Nodestream 0.13 Release

August 9, 2024 · 5 min read

Maintainer of Nodestream

We are happy to announce the release of Nodestream 0.13. This release includes a number of new features and improvements. However, its breaking changes are minimal, so upgrading should be a breeze.

Breaking Changes

Unified File Extractors

In the past, we had separate file extractors for local, remote, and S3 files. This was a bit cumbersome for a couple of reasons:

On the maintainability side, we had to make sure that all of these extractors were kept in sync with each other.
On the user side, it was limiting to have to choose between these extractors when the only difference was the location of the file.

Starting with this release, we have unified these extractors into a single UnifiedFileExtractor extractor. This extractor can handle local, remote, and S3 files (so functionality is not lost).

Check out the docs on it here.

Core Contributors to this feature include:

Changing Source Node Behavior With `properties`

In the past, the properties key automatically lower cased all string property values. This was because there was one set of normalization rules for the entire source node interpretation. However, this was a bit limiting because it was not possible to have different normalization rules for different properties.

Starting with this release, the properties key no longer automatically lower cases all string property values. Instead, you can now define normalization rules for keys and properties separately (via the key_normalization and property_normalization properties). However, if you specify the normalization key, it will apply to both keys and properties and will default to lower casing all string property values.

New Features

Squashing Migrations

In the past, migrations were applied one by one in order. If during development, you were constantly iterating on a data model you could be constantly adding migrations. This resulted in a lot of migrations being applied that were essentially intermediary when going to production. As a result, the migration node count could get quite large with a lot of "messy" migrations.

Starting with this release, you can now squash migrations. This means that you can take a set of migrations and squash them into a single, optimized set of migrations. This can be useful for cleaning up the migration node count and making it easier to understand the data model. Additionally, the old migrations are still stored in the project, so you can always go back to them if you need to. If a database has partially applied a sequence of migrations that was squashed, we can't used the squashed migration. Instead, the logic will fall back to the original migrations.

Core Contributors to this feature include:

Check out the docs on it here.

Compressed File Handling

Many users have data stored in compressed files. This release adds support for compressed files that are .gz, .bz2 in format. This support is available in the UnifiedFileExtractor extractor.

Check out the docs on it here.

Core Contributors to this feature include:

Improved LLM Compatible Schema Printing

The llm format is a format that is used to represent the schema of a graph. This release adds improved support for printing the schema in a format that is compatible with an llm.

In short, it uses a cypher-esque syntax to represent the schema of the graph:

Node Types:
Person: first_name: STRING, last_name: STRING, last_ingested_at: DATETIME, age: STRING
Number: number: STRING, last_ingested_at: DATETIME
Relationship Types:
KNOWS: last_ingested_at: DATETIME
Adjancies:
(:Person)-[:KNOWS]->(:Person)

Core Contributors to this feature include:

Improved Error Messages in Value Providers

Nodestream uses value providers to extract values from documents and map them to graph. Every time you get an error in a value provider, it can be a bit tricky to debug. This release adds improved error messages to value providers to make it easier to debug issues.

Core Contributors to this feature include:

DynamoDB Extractor

DynamoDB is a popular NoSQL database that is used by many people to store data. This release adds support for DynamoDB as a first class citizen via the DynamoDBExtractor.

Check out the docs on it here.

Core Contributors to this feature include:

SQS and Queue Extractor Support

Many users have data stored in SQS and other queue services. This release adds support for SQS and other queue services via the QueueExtractor. Concecptually, this extractor is similar to the StreamExtractor but for queues.

Check out the docs on it here.

Core Contributors to this feature include:

Release Attestations

0.13 marks the first release were nodestream and all of its dependencies are signed and attested to via Github's Attestation support. This means that you can be sure that the code you are running is the code that was intended to be run.

Dependency Updates

This release includes updates to dependencies to keep Nodestream up to date with the latest and greatest. Some dependencies that were updated include:

httpx to >=0.27.0
uvloop to >=0.17.0, <=0.19.0 (Not installed/used on Python 3.13 due to compatibility issues)
numpy to >=2.0.0
pyarrow to 17.0.0
python 3.13 has been added to the supported versions matrix.
A variety of other dependencies have had their supported versions widened to be more permissive.

Bug Fixes

Fixed a bug where schema inference was not working correctly in some cases (with switch interpretations).
Fixed a variety of bugs related to the pipeline runtime that were causing mishandled errors.

Breaking Changes​

Unified File Extractors​

Changing Source Node Behavior With properties​

New Features​

Squashing Migrations​

Compressed File Handling​

Improved LLM Compatible Schema Printing​

Improved Error Messages in Value Providers​

DynamoDB Extractor​

SQS and Queue Extractor Support​

Release Attestations​

Dependency Updates​

Bug Fixes​

Breaking Changes

Unified File Extractors

Changing Source Node Behavior With `properties`

New Features

Squashing Migrations

Compressed File Handling

Improved LLM Compatible Schema Printing

Improved Error Messages in Value Providers

DynamoDB Extractor

SQS and Queue Extractor Support

Release Attestations

Dependency Updates

Bug Fixes