Skip to main content

2 posts tagged with "release"

View All Tags

· 5 min read
Zach Probst

We are happy to announce the release of Nodestream 0.13. This release includes a number of new features and improvements. However, its breaking changes are minimal, so upgrading should be a breeze.

Breaking Changes

Unified File Extractors

In the past, we had separate file extractors for local, remote, and S3 files. This was a bit cumbersome for a couple of reasons:

  • On the maintainability side, we had to make sure that all of these extractors were kept in sync with each other.
  • On the user side, it was limiting to have to choose between these extractors when the only difference was the location of the file.

Starting with this release, we have unified these extractors into a single UnifiedFileExtractor extractor. This extractor can handle local, remote, and S3 files (so functionality is not lost).

Check out the docs on it here.

Core Contributors to this feature include:

Changing Source Node Behavior With properties

In the past, the properties key automatically lower cased all string property values. This was because there was one set of normalization rules for the entire source node interpretation. However, this was a bit limiting because it was not possible to have different normalization rules for different properties.

Starting with this release, the properties key no longer automatically lower cases all string property values. Instead, you can now define normalization rules for keys and properties separately (via the key_normalization and property_normalization properties). However, if you specify the normalization key, it will apply to both keys and properties and will default to lower casing all string property values.

New Features

Squashing Migrations

In the past, migrations were applied one by one in order. If during development, you were constantly iterating on a data model you could be constantly adding migrations. This resulted in a lot of migrations being applied that were essentially intermediary when going to production. As a result, the migration node count could get quite large with a lot of "messy" migrations.

Starting with this release, you can now squash migrations. This means that you can take a set of migrations and squash them into a single, optimized set of migrations. This can be useful for cleaning up the migration node count and making it easier to understand the data model. Additionally, the old migrations are still stored in the project, so you can always go back to them if you need to. If a database has partially applied a sequence of migrations that was squashed, we can't used the squashed migration. Instead, the logic will fall back to the original migrations.

Core Contributors to this feature include:

Check out the docs on it here.

Compressed File Handling

Many users have data stored in compressed files. This release adds support for compressed files that are .gz, .bz2 in format. This support is available in the UnifiedFileExtractor extractor.

Check out the docs on it here.

Core Contributors to this feature include:

Improved LLM Compatible Schema Printing

The llm format is a format that is used to represent the schema of a graph. This release adds improved support for printing the schema in a format that is compatible with an llm.

In short, it uses a cypher-esque syntax to represent the schema of the graph:

Node Types:
Person: first_name: STRING, last_name: STRING, last_ingested_at: DATETIME, age: STRING
Number: number: STRING, last_ingested_at: DATETIME
Relationship Types:
KNOWS: last_ingested_at: DATETIME
Adjancies:
(:Person)-[:KNOWS]->(:Person)

Core Contributors to this feature include:

Improved Error Messages in Value Providers

Nodestream uses value providers to extract values from documents and map them to graph. Every time you get an error in a value provider, it can be a bit tricky to debug. This release adds improved error messages to value providers to make it easier to debug issues.

Core Contributors to this feature include:

DynamoDB Extractor

DynamoDB is a popular NoSQL database that is used by many people to store data. This release adds support for DynamoDB as a first class citizen via the DynamoDBExtractor.

Check out the docs on it here.

Core Contributors to this feature include:

SQS and Queue Extractor Support

Many users have data stored in SQS and other queue services. This release adds support for SQS and other queue services via the QueueExtractor. Concecptually, this extractor is similar to the StreamExtractor but for queues.

Check out the docs on it here.

Core Contributors to this feature include:

Release Attestations

0.13 marks the first release were nodestream and all of its dependencies are signed and attested to via Github's Attestation support. This means that you can be sure that the code you are running is the code that was intended to be run.

Dependency Updates

This release includes updates to dependencies to keep Nodestream up to date with the latest and greatest. Some dependencies that were updated include:

  • httpx to >=0.27.0
  • uvloop to >=0.17.0, <=0.19.0 (Not installed/used on Python 3.13 due to compatibility issues)
  • numpy to >=2.0.0
  • pyarrow to 17.0.0
  • python 3.13 has been added to the supported versions matrix.
  • A variety of other dependencies have had their supported versions widened to be more permissive.

Bug Fixes

  • Fixed a bug where schema inference was not working correctly in some cases (with switch interpretations).
  • Fixed a variety of bugs related to the pipeline runtime that were causing mishandled errors.

· 5 min read
Zach Probst

We are happy to announce the release of Nodestream 0.12. This release marks the largest update to Nodestream since its inception. We've spent a lot of time improving the core of nodestream and we're excited to share it with you.

Before we get into the details, we want to thank the community for their support and feedback. As such, we have completely revamped the documentation to make it easier to use and navigate. This releases comes with two headline features Database Migrations and Multi-Database Support.

Major Features

Database Migrations

In the past, nodestream attempted to automatically create indexes and constraints on the database based on your pipeline at runtime. This was done by introspecting the schema of the entire project and generating the appropriate queries to create the indexes and constraints. This was a very powerful feature but it had a few drawbacks:

  • It was redundant. The same indexes and constraints were being created with IF NOT EXISTS clauses every time the pipeline was run.
  • It was slow. The queries were being executed serially and the pipeline was locked until they were all completed.
  • It was error prone. If the database was not in a state that allowed for the creation of the indexes and constraints, the pipeline would fail.
  • It was high friction. There was no way to refactor the database without manual intervention. If the schema changed, the pipeline would fail and the user would have to manually remove the indexes, constraints, and sometimes data before running the pipeline again.

To address these issues, nodestream 0.12 has introduced the concept of migrations. Migrations are a way of encapsulating changes to the database schema in a way that can be applied incrementally. Conceptually, they are similar to the migrations in the Django, Rails, Neo4j Migrations, and Flyway frameworks.

Database Migrations

Migrations are defined in a directory called migrations in the root of your project. Each migration is a yaml file that contains data about the migration and its dependencies. You can create migrations by running the nodestream migrations make command.

Check out the changes to the tutorial on Database Migrations as well as the new tutorial on Working With Migrations to learn more.

Core Contributors to this feature include:

Multi-Database Support

Prior to this release, the only database that was supported was neo4j. While this is a category leading database, the goal of nodestream is to be database agnostic and afford developer the ability to use the database or databases that best fits their needs. As such, we are happy to announce that nodestream now supports Amazon Neptune and Amazon Neptune Analytics. TO accommodate that, we have moved the neo4j database connector into a separate package called nodestream-plugin-neo4j and added a new package called nodestream-plugin-neptune.

Starting with this release, you use the --database flag to generate neptune boilerplate configuration.

Database Migrations

Check out the docs on it here.

Core Contributors to this feature include:

Other Features

Parquet Support

Many customers have data stored in parquet format. Parquet is a columnar storage format that is optimized for reading and writing large datasets. We are happy to announce that nodestream now supports parquet as a first class citizen.

Check out the docs on it here.

Core Contributors to this feature include:

Include Properties From Maps

In the past, each property you wanted to include in the pipeline had to be explicitly defined in the pipeline configuration. This was a bit cumbersome and error prone. Starting with this release, you can now include all properties by defining an expression that returns a map at the properties key directly instead of a mapping of property names to expressions.

For example, here are two examples on the properties and source_node interpretations:

- type: source_node
node_type: User
key:
email: !jmespath email
properties: !jmespath path.to.properties.mapping
normalization:
do_trim_whitespace: true
- type: properties
properties: !jmespath path.to.properties.mapping
normalization:
do_lowercase_strings: true

Check out the docs on it here.

Core Contributors to this feature include:

Performance Improvements

We've made a small number of performance improvements to the core of nodestream that should result in faster processing times and lower memory usage. Most notably, we've cache the last_ingested_at timestamp for nodes and relationships to reduce the number of times we create objects in memory. We've observed a 10% improvement in processing times and a 5% reduction in memory usage in our testing.

Core Contributors to this feature include: