3 posts tagged with "release"

Nodestream 0.14 Release

February 24, 2025 · 3 min read

Maintainer of Nodestream

We are happy to announce the release of Nodestream 0.14. This release includes a number of new features and improvements. However, its breaking changes are minimal, so upgrading should be a breeze.

Breaking Changes

File Extractors

In 0.13, we introduced a united file handling extractor. However, we kept the old extractors around for backwards compatibility. Starting with this release, we have removed the old extractors and everything is now handled by the UnifiedFileExtractor extractor which is now renamed to FileExtractor.

Check out the docs on it here.

Core Contributors to this feature include:

New Features

Object Storage APIs

Nodestream now has a new object storage abstraction. These APIs allow steps in your pipeline to interact with object storage to persist data between executions. We see incredible value in this feature as it allows for more complex pipelines to be built. We've implemented serveral features in this relase that leverage this new abstraction.

You can persist objects locally or in the cloud via AWS S3. Like large amounts of the framework it is plubbable and can be extended to support other object storage providers.

Check out the docs on it here.

Core Contributors to this feature include:

Extractor Checkpointing

Previously, extractors would always start from the beginning of their data source. If a pipeline crashed or was interrupted, the extractor would start from the beginning of the data source again. This lead to duplicate data being extracted, processed, and inserted into the database.

Now with nodestream 0.14, extractors can now checkpoint their progress. This means that if a pipeline crashes or is interrupted, the extractor will be start from where it left off. To do this, the extractor will store a checkpoint via its object storage. Therefore, in order to use this feature, you must have object storage configured during your pipeline execution. Checkpoints are cleared when a pipeline is successfully completed.

Curious how to implement this for your extractor? We've update the tutorial on it here.

Core Contributors to this feature include:

Record Schema Enforcement

Nodestream now has a new record schema enforcement feature. Nodestream pipelines tend to be highly dependent on the schema of the data being processed. Depending on the data source, the schema can change over time. This can lead to pipelines failing or producing incorrect results.

With this new feature, you can now enforce a schema on your data. This means that if the schema of the data changes, the pipeline will skip or warn about the records that do not match the schema.

Not only that, but you can also use this feature to automatically infer the schema of your data and then enforce it.

Check out the docs on it here.

Core Contributors to this feature include:

Nodestream 0.13 Release

August 9, 2024 · 5 min read

Zach Probst

Maintainer of Nodestream

We are happy to announce the release of Nodestream 0.13. This release includes a number of new features and improvements. However, its breaking changes are minimal, so upgrading should be a breeze.

Breaking Changes

Unified File Extractors

In the past, we had separate file extractors for local, remote, and S3 files. This was a bit cumbersome for a couple of reasons:

On the maintainability side, we had to make sure that all of these extractors were kept in sync with each other.
On the user side, it was limiting to have to choose between these extractors when the only difference was the location of the file.

Starting with this release, we have unified these extractors into a single UnifiedFileExtractor extractor. This extractor can handle local, remote, and S3 files (so functionality is not lost).

Check out the docs on it here.

Core Contributors to this feature include:

Changing Source Node Behavior With `properties`

In the past, the properties key automatically lower cased all string property values. This was because there was one set of normalization rules for the entire source node interpretation. However, this was a bit limiting because it was not possible to have different normalization rules for different properties.

Starting with this release, the properties key no longer automatically lower cases all string property values. Instead, you can now define normalization rules for keys and properties separately (via the key_normalization and property_normalization properties). However, if you specify the normalization key, it will apply to both keys and properties and will default to lower casing all string property values.

New Features

Squashing Migrations

In the past, migrations were applied one by one in order. If during development, you were constantly iterating on a data model you could be constantly adding migrations. This resulted in a lot of migrations being applied that were essentially intermediary when going to production. As a result, the migration node count could get quite large with a lot of "messy" migrations.

Starting with this release, you can now squash migrations. This means that you can take a set of migrations and squash them into a single, optimized set of migrations. This can be useful for cleaning up the migration node count and making it easier to understand the data model. Additionally, the old migrations are still stored in the project, so you can always go back to them if you need to. If a database has partially applied a sequence of migrations that was squashed, we can't used the squashed migration. Instead, the logic will fall back to the original migrations.

Core Contributors to this feature include:

Check out the docs on it here.

Compressed File Handling

Many users have data stored in compressed files. This release adds support for compressed files that are .gz, .bz2 in format. This support is available in the UnifiedFileExtractor extractor.

Check out the docs on it here.

Core Contributors to this feature include:

Improved LLM Compatible Schema Printing

The llm format is a format that is used to represent the schema of a graph. This release adds improved support for printing the schema in a format that is compatible with an llm.

In short, it uses a cypher-esque syntax to represent the schema of the graph:

Node Types:
Person: first_name: STRING, last_name: STRING, last_ingested_at: DATETIME, age: STRING
Number: number: STRING, last_ingested_at: DATETIME
Relationship Types:
KNOWS: last_ingested_at: DATETIME
Adjancies:
(:Person)-[:KNOWS]->(:Person)

Core Contributors to this feature include:

Improved Error Messages in Value Providers

Nodestream uses value providers to extract values from documents and map them to graph. Every time you get an error in a value provider, it can be a bit tricky to debug. This release adds improved error messages to value providers to make it easier to debug issues.

Core Contributors to this feature include:

DynamoDB Extractor

DynamoDB is a popular NoSQL database that is used by many people to store data. This release adds support for DynamoDB as a first class citizen via the DynamoDBExtractor.

Check out the docs on it here.

Core Contributors to this feature include:

SQS and Queue Extractor Support

Many users have data stored in SQS and other queue services. This release adds support for SQS and other queue services via the QueueExtractor. Concecptually, this extractor is similar to the StreamExtractor but for queues.

Check out the docs on it here.

Core Contributors to this feature include:

Release Attestations

0.13 marks the first release were nodestream and all of its dependencies are signed and attested to via Github's Attestation support. This means that you can be sure that the code you are running is the code that was intended to be run.

Dependency Updates

This release includes updates to dependencies to keep Nodestream up to date with the latest and greatest. Some dependencies that were updated include:

httpx to >=0.27.0
uvloop to >=0.17.0, <=0.19.0 (Not installed/used on Python 3.13 due to compatibility issues)
numpy to >=2.0.0
pyarrow to 17.0.0
python 3.13 has been added to the supported versions matrix.
A variety of other dependencies have had their supported versions widened to be more permissive.

Bug Fixes

Fixed a bug where schema inference was not working correctly in some cases (with switch interpretations).
Fixed a variety of bugs related to the pipeline runtime that were causing mishandled errors.

Nodestream 0.12 Release

April 5, 2024 · 5 min read

Zach Probst

Maintainer of Nodestream

We are happy to announce the release of Nodestream 0.12. This release marks the largest update to Nodestream since its inception. We've spent a lot of time improving the core of nodestream and we're excited to share it with you.

Before we get into the details, we want to thank the community for their support and feedback. As such, we have completely revamped the documentation to make it easier to use and navigate. This releases comes with two headline features Database Migrations and Multi-Database Support.

Major Features

Database Migrations

In the past, nodestream attempted to automatically create indexes and constraints on the database based on your pipeline at runtime. This was done by introspecting the schema of the entire project and generating the appropriate queries to create the indexes and constraints. This was a very powerful feature but it had a few drawbacks:

It was redundant. The same indexes and constraints were being created with IF NOT EXISTS clauses every time the pipeline was run.
It was slow. The queries were being executed serially and the pipeline was locked until they were all completed.
It was error prone. If the database was not in a state that allowed for the creation of the indexes and constraints, the pipeline would fail.
It was high friction. There was no way to refactor the database without manual intervention. If the schema changed, the pipeline would fail and the user would have to manually remove the indexes, constraints, and sometimes data before running the pipeline again.

To address these issues, nodestream 0.12 has introduced the concept of migrations. Migrations are a way of encapsulating changes to the database schema in a way that can be applied incrementally. Conceptually, they are similar to the migrations in the Django, Rails, Neo4j Migrations, and Flyway frameworks.

Database Migrations

Migrations are defined in a directory called migrations in the root of your project. Each migration is a yaml file that contains data about the migration and its dependencies. You can create migrations by running the nodestream migrations make command.

Check out the changes to the tutorial on Database Migrations as well as the new tutorial on Working With Migrations to learn more.

Core Contributors to this feature include:

Multi-Database Support

Prior to this release, the only database that was supported was neo4j. While this is a category leading database, the goal of nodestream is to be database agnostic and afford developer the ability to use the database or databases that best fits their needs. As such, we are happy to announce that nodestream now supports Amazon Neptune and Amazon Neptune Analytics. TO accommodate that, we have moved the neo4j database connector into a separate package called nodestream-plugin-neo4j and added a new package called nodestream-plugin-neptune.

Starting with this release, you use the --database flag to generate neptune boilerplate configuration.

Database Migrations

Check out the docs on it here.

Core Contributors to this feature include:

Other Features

Parquet Support

Many customers have data stored in parquet format. Parquet is a columnar storage format that is optimized for reading and writing large datasets. We are happy to announce that nodestream now supports parquet as a first class citizen.

Check out the docs on it here.

Core Contributors to this feature include:

Include Properties From Maps

In the past, each property you wanted to include in the pipeline had to be explicitly defined in the pipeline configuration. This was a bit cumbersome and error prone. Starting with this release, you can now include all properties by defining an expression that returns a map at the properties key directly instead of a mapping of property names to expressions.

For example, here are two examples on the properties and source_node interpretations:

- type: source_node
  node_type: User
  key:
    email: !jmespath email
  properties: !jmespath path.to.properties.mapping
  normalization:
    do_trim_whitespace: true

- type: properties
  properties: !jmespath path.to.properties.mapping
  normalization:
    do_lowercase_strings: true

Check out the docs on it here.

Core Contributors to this feature include:

Performance Improvements

We've made a small number of performance improvements to the core of nodestream that should result in faster processing times and lower memory usage. Most notably, we've cache the last_ingested_at timestamp for nodes and relationships to reduce the number of times we create objects in memory. We've observed a 10% improvement in processing times and a 5% reduction in memory usage in our testing.

Core Contributors to this feature include:

Breaking Changes​

File Extractors​

New Features​

Object Storage APIs​

Extractor Checkpointing​

Record Schema Enforcement​

Breaking Changes​

Unified File Extractors​

Changing Source Node Behavior With properties​

New Features​

Squashing Migrations​

Compressed File Handling​

Improved LLM Compatible Schema Printing​

Improved Error Messages in Value Providers​

DynamoDB Extractor​

SQS and Queue Extractor Support​

Release Attestations​

Dependency Updates​

Bug Fixes​

Major Features​

Database Migrations​

Multi-Database Support​

Other Features​

Parquet Support​

Include Properties From Maps​

Performance Improvements​

Breaking Changes

File Extractors

New Features

Object Storage APIs

Extractor Checkpointing

Record Schema Enforcement

Breaking Changes

Unified File Extractors

Changing Source Node Behavior With `properties`

New Features

Squashing Migrations

Compressed File Handling

Improved LLM Compatible Schema Printing

Improved Error Messages in Value Providers

DynamoDB Extractor

SQS and Queue Extractor Support

Release Attestations

Dependency Updates

Bug Fixes

Major Features

Database Migrations

Multi-Database Support

Other Features

Parquet Support

Include Properties From Maps

Performance Improvements