7 posts tagged with "nodestream"

Nodestream 0.14 Release

February 24, 2025 · 3 min read

Maintainer of Nodestream

We are happy to announce the release of Nodestream 0.14. This release includes a number of new features and improvements. However, its breaking changes are minimal, so upgrading should be a breeze.

Breaking Changes

File Extractors

In 0.13, we introduced a united file handling extractor. However, we kept the old extractors around for backwards compatibility. Starting with this release, we have removed the old extractors and everything is now handled by the UnifiedFileExtractor extractor which is now renamed to FileExtractor.

Check out the docs on it here.

Core Contributors to this feature include:

New Features

Object Storage APIs

Nodestream now has a new object storage abstraction. These APIs allow steps in your pipeline to interact with object storage to persist data between executions. We see incredible value in this feature as it allows for more complex pipelines to be built. We've implemented serveral features in this relase that leverage this new abstraction.

You can persist objects locally or in the cloud via AWS S3. Like large amounts of the framework it is plubbable and can be extended to support other object storage providers.

Check out the docs on it here.

Core Contributors to this feature include:

Extractor Checkpointing

Previously, extractors would always start from the beginning of their data source. If a pipeline crashed or was interrupted, the extractor would start from the beginning of the data source again. This lead to duplicate data being extracted, processed, and inserted into the database.

Now with nodestream 0.14, extractors can now checkpoint their progress. This means that if a pipeline crashes or is interrupted, the extractor will be start from where it left off. To do this, the extractor will store a checkpoint via its object storage. Therefore, in order to use this feature, you must have object storage configured during your pipeline execution. Checkpoints are cleared when a pipeline is successfully completed.

Curious how to implement this for your extractor? We've update the tutorial on it here.

Core Contributors to this feature include:

Record Schema Enforcement

Nodestream now has a new record schema enforcement feature. Nodestream pipelines tend to be highly dependent on the schema of the data being processed. Depending on the data source, the schema can change over time. This can lead to pipelines failing or producing incorrect results.

With this new feature, you can now enforce a schema on your data. This means that if the schema of the data changes, the pipeline will skip or warn about the records that do not match the schema.

Not only that, but you can also use this feature to automatically infer the schema of your data and then enforce it.

Check out the docs on it here.

Core Contributors to this feature include:

Nodestream 0.13 Release

August 9, 2024 · 5 min read

Zach Probst

Maintainer of Nodestream

We are happy to announce the release of Nodestream 0.13. This release includes a number of new features and improvements. However, its breaking changes are minimal, so upgrading should be a breeze.

Breaking Changes

Unified File Extractors

In the past, we had separate file extractors for local, remote, and S3 files. This was a bit cumbersome for a couple of reasons:

On the maintainability side, we had to make sure that all of these extractors were kept in sync with each other.
On the user side, it was limiting to have to choose between these extractors when the only difference was the location of the file.

Starting with this release, we have unified these extractors into a single UnifiedFileExtractor extractor. This extractor can handle local, remote, and S3 files (so functionality is not lost).

Check out the docs on it here.

Core Contributors to this feature include:

Changing Source Node Behavior With `properties`

In the past, the properties key automatically lower cased all string property values. This was because there was one set of normalization rules for the entire source node interpretation. However, this was a bit limiting because it was not possible to have different normalization rules for different properties.

Starting with this release, the properties key no longer automatically lower cases all string property values. Instead, you can now define normalization rules for keys and properties separately (via the key_normalization and property_normalization properties). However, if you specify the normalization key, it will apply to both keys and properties and will default to lower casing all string property values.

New Features

Squashing Migrations

In the past, migrations were applied one by one in order. If during development, you were constantly iterating on a data model you could be constantly adding migrations. This resulted in a lot of migrations being applied that were essentially intermediary when going to production. As a result, the migration node count could get quite large with a lot of "messy" migrations.

Starting with this release, you can now squash migrations. This means that you can take a set of migrations and squash them into a single, optimized set of migrations. This can be useful for cleaning up the migration node count and making it easier to understand the data model. Additionally, the old migrations are still stored in the project, so you can always go back to them if you need to. If a database has partially applied a sequence of migrations that was squashed, we can't used the squashed migration. Instead, the logic will fall back to the original migrations.

Core Contributors to this feature include:

Check out the docs on it here.

Compressed File Handling

Many users have data stored in compressed files. This release adds support for compressed files that are .gz, .bz2 in format. This support is available in the UnifiedFileExtractor extractor.

Check out the docs on it here.

Core Contributors to this feature include:

Improved LLM Compatible Schema Printing

The llm format is a format that is used to represent the schema of a graph. This release adds improved support for printing the schema in a format that is compatible with an llm.

In short, it uses a cypher-esque syntax to represent the schema of the graph:

Node Types:
Person: first_name: STRING, last_name: STRING, last_ingested_at: DATETIME, age: STRING
Number: number: STRING, last_ingested_at: DATETIME
Relationship Types:
KNOWS: last_ingested_at: DATETIME
Adjancies:
(:Person)-[:KNOWS]->(:Person)

Core Contributors to this feature include:

Improved Error Messages in Value Providers

Nodestream uses value providers to extract values from documents and map them to graph. Every time you get an error in a value provider, it can be a bit tricky to debug. This release adds improved error messages to value providers to make it easier to debug issues.

Core Contributors to this feature include:

DynamoDB Extractor

DynamoDB is a popular NoSQL database that is used by many people to store data. This release adds support for DynamoDB as a first class citizen via the DynamoDBExtractor.

Check out the docs on it here.

Core Contributors to this feature include:

SQS and Queue Extractor Support

Many users have data stored in SQS and other queue services. This release adds support for SQS and other queue services via the QueueExtractor. Concecptually, this extractor is similar to the StreamExtractor but for queues.

Check out the docs on it here.

Core Contributors to this feature include:

Release Attestations

0.13 marks the first release were nodestream and all of its dependencies are signed and attested to via Github's Attestation support. This means that you can be sure that the code you are running is the code that was intended to be run.

Dependency Updates

This release includes updates to dependencies to keep Nodestream up to date with the latest and greatest. Some dependencies that were updated include:

httpx to >=0.27.0
uvloop to >=0.17.0, <=0.19.0 (Not installed/used on Python 3.13 due to compatibility issues)
numpy to >=2.0.0
pyarrow to 17.0.0
python 3.13 has been added to the supported versions matrix.
A variety of other dependencies have had their supported versions widened to be more permissive.

Bug Fixes

Fixed a bug where schema inference was not working correctly in some cases (with switch interpretations).
Fixed a variety of bugs related to the pipeline runtime that were causing mishandled errors.

Migrations Design in Nodestream 0.12

May 14, 2024 · 7 min read

Zach Probst

Maintainer of Nodestream

In the release notes for Nodestream 0.12, we mentioned that we had added support for migrations. This is a feature that we have been wanting to add for a long time, and we are excited to finally have it in place. In this post, we will discuss what migrations are, why they are important, and how they work in Nodestream.

Evolutionary Database Design

Evolutionary database design is the idea that the database schema should evolve over time as the application changes. This is in contrast to the traditional approach of creating a fixed schema at the beginning of a project and then never changing it. With evolutionary database design, the schema is treated as a living document that can be updated and modified as needed. If you want to go deep into this topic, we recommend reading the Martin Fowler's page on the subject.

Why Migrations?

Migrations are a way to manage the evolution of the database schema in a controlled and repeatable way. They allow you to define the changes that need to be made to the schema in a series of files that can be run in sequence. This makes it easy to track changes to the schema over time and to apply those changes to multiple environments, such as development, staging, and production.

Surveying All The Types of Schema Changes

Graph databases are schema-less, but the data model is still defined by the relationships between nodes and edges and the properties of those nodes and edges. This means that there is still a schema to manage, even if it is not as rigid as a traditional relational database schema. Since nodestream is agnostic to the underlying database, we need to be able to support migrations for all types of databases that nodestream can work with. Therefore we need to support migrations that are designed against an abstract graph model and leave the implementation details to the specific database connector. So lets examine the types of schema changes that can exist in a graph database:

Creating New Nodes and Edges Types

The most basic type of schema change is creating new node and edge types. This is equivalent to creating a new table in a relational database. When you create a new node or edge type, you may need to define the properties that it will have and the relationships that it will have with other nodes and edges.

Depending on the underlying database, this might involve creating a new index or constraint to enforce the uniqueness of the new node or edge type.

Removing Nodes and Edges Types

Conversely, you may also need to remove existing node and edge types. This is equivalent to dropping a table in a relational database. Most graph databases do not support leaving orphaned nodes or edges, so you may need to delete all nodes and edges of the type that you are removing.

Adding Properties to Nodes and Edges

Another common type of schema change is adding properties to existing nodes and edges. This is equivalent to adding a new column to a table in a relational database. When you add a property to a node or edge, you may need to define a default value for that property or update existing nodes and edges to have a value for that property.

One tricky case is when you add a property that is part of the nodes or edges key. In this case, you may need to update the key of the node or edge to include the new property.

Removing Properties from Nodes and Edges

Conversely, you may also need to remove properties from existing nodes and edges. This is equivalent to dropping a column from a table in a relational database. When you remove a property from a node or edge, you may need to update existing nodes and edges to remove the value for that property.

Adding and Removing Indexes

Another common type of schema change is adding and removing indexes. Indexes are used to speed up queries by allowing the database to quickly find nodes and edges that match certain criteria. When you add an index, you may need to define the properties that the index will be based on and the type of index that will be used. When you remove an index, you may need to update existing indexes to remove the properties that the index was based on.

Topological Changes

Finally, you may need to make topological changes to the schema such as adding or removing relationships between nodes and edges. This is equivalent to adding or removing foreign keys in a relational database.

When you change the adjancency of nodes and edges, you may want to clean up the data to ensure that it is consistent with the new schema. This may involve updating existing nodes and edges to reflect the new relationships or deleting nodes and edges that are no longer needed.

How Migrations Work in Nodestream

In nodestream, migrations are defined as a series of yaml files that describe the changes that need to be made to the schema. Each migration file contains a list of operations that need to be performed. For example, creating a new node type or adding a property to an existing node type.

When you run nodestream migrations make nodestream will create a new migration file in the migrations directory. That process works roughly like this:

Nodestream will look at the current state of the schema by initializing and introspecting all pipelines (A).
Build the state of the schema that is represented by the current migrations (B).
Diff the two states (A and B) to determine the changes that need to be made to the schema.
Generate a new migration file that describes the changes that need to be made to the schema.

When you run nodestream migrations run nodestream will apply the migrations in sequence to evolve the schema. That process works roughly like this:

Nodestream reads the migration files into memory and builds a graph of the dependencies between the migrations.
Nodestream runs the migrations in topological order, applying the changes to the schema as it goes.
Nodestream keeps track of which migrations have been applied so that it can skip them in the future.

Crucially, nodestream does not track all possible schema changes. Topological changes are not tracked(see here), so you will need to handle those manually. Additionally, nodestream does not support rolling back migrations, so you will need to handle that manually as well.

Are Migrations Any Good?

Wondering what Martin Fowler would think of this design given is page on the subject? He describes the concept of "evolutionary database design" with a set of characterisitcs. Some of them are more organizational than technical.

However, some of the technical ones are:

All Database Artifacts are Version Controlled with Application Code: Nodestream's migrations are intended to be source controlled files that are run in sequence and define their dependencies. This makes it easy to evolve changes and continuously integrate them (which is another of the characteristics).
All database changes are database refactorings Nodestream's migrations are a series of database refactorings that are run in sequence. This makes it easy to track changes to the schema over time and to apply those changes to multiple environments, such as development, staging, and production. We are detecting the changes that need to be made to the schema and applying them in a controlled and repeatable way.
Clearly Separate Database Access Code You generally don't need to write database access code in nodestream, so this is taken care of 🎆
Automated the Refactorings This is the main point of migrations. They are automated and can be run in sequence to evolve the schema.

We are happy with the design of the migrations in nodestream and we think that they are a good fit for the project. As we've mentioned, there are still some major evolutions to be made to migrations, such as the ability to rollback a migration, but we are confident that we are on the right track.

Nodestream Neptune Support

April 26, 2024 · 3 min read

Cole Greer

Neptune Plugin Maintainer

The recent release of Nodestream 0.12 has introduced support for Amazon Neptune as the first step towards broader multi-database support. Nodestream provides a flexible tool to perform bulk ETL into Amazon Neptune Database and Amazon Neptune Analytics.

This post will give a quick overview of the new Amazon Neptune support, offer some examples to get started, and list some features planned for future releases.

Overview

Support for AWS Neptune is split into two modes, DB and Analytics. Both modes leverage the AWS SDK to load data via batched openCypher queries. Nodestream is compatible with Neptune DB engine version 1.2.1.1 or higher, as well as Neptune Analytics.

Capabilities

Nodestream with Neptune currently supports standard ETL pipelines as well as time to live (TTL) pipelines. ETL pipelines enable bulk data ingestion into Neptune from a much broader range of data sources and formats than have previously been possible in Neptune.

Nodestream's TTL mechanism also enables new capabilities not previously available in Neptune. By annotating ingested graph elements with timestamps, Nodestream is able to create pipelines which automatically expire and remove data that has passed a configured lifespan.

Usage

Prerequisites

Neptune must be reachable from whichever environment you intend to run Nodestream. Both Neptune Database, as well as Neptune Analytics with a private endpoint, are restricted to VPC-only access. If you intend to use a Neptune Analytics graph with a public endpoint, no special considerations are required.

Check out the Neptune User-Guide for more information about connecting to a VPC-only host. You can test the connection with this curl command:

curl https://<NEPTUNE_ENDPOINT>:<PORT>/openCypher/status

IAM Auth

Nodestream fully supports IAM Authentication when connecting to Amazon Neptune, as long as credentials are properly configured. See the boto3 credentials guide for more instruction on correctly configuring credentials.

Configuration Examples

The connection configuration for Neptune contains a switch between two modes: db and analytics. Neptune DB mode will connect using a Neptune Database cluster or instance endpoint, while Neptune Analytics will connect via the graph identifier.

Neptune Database:

targets:
  db-one:
    database: neptune
    mode: database
    host: https://<NEPTUNE_ENDPOINT>:<PORT>

Neptune Analytics:

targets:
  db-one:
    database: neptune
    mode: analytics
    graph_id: <GRAPH_IDENTIFIER>

Check out the Nodestream basics tutorial for a complete guide to getting started with Nodestream and Neptune.

Future Roadmap

We have several new features planned to bring additional functionality in upcoming releases.

One feature we are excited to bring to the Nodestream Neptune plugin is support for the new Nodestream Migrations API. Some migrations are not applicable in Neptune as it does not use user-defined indices. However, support for applicable migrations, such as renaming properties, will be added in an upcoming release.

We are additionally planning to add expanded datatype support. Currently, the Neptune plugin supports string, boolean, and numeric types. Datetime types are automatically converted into epoch timestamps. We aim to expand this list such that any extracted types which are supported by Neptune can be loaded without casting or conversion.

Our future work will also include further performance assessments and optimizations. We will continue to optimize the generated queries in order to maximize the performance and scalability of Nodestream with Neptune.

Get Involved

The inclusion of new features is heavily dependent on community feedback, if there are any additional features or configurations which you would find valuable, please create an issue on GitHub with the request.

Nodestream 0.12 Release

April 5, 2024 · 5 min read

Zach Probst

Maintainer of Nodestream

We are happy to announce the release of Nodestream 0.12. This release marks the largest update to Nodestream since its inception. We've spent a lot of time improving the core of nodestream and we're excited to share it with you.

Before we get into the details, we want to thank the community for their support and feedback. As such, we have completely revamped the documentation to make it easier to use and navigate. This releases comes with two headline features Database Migrations and Multi-Database Support.

Major Features

Database Migrations

In the past, nodestream attempted to automatically create indexes and constraints on the database based on your pipeline at runtime. This was done by introspecting the schema of the entire project and generating the appropriate queries to create the indexes and constraints. This was a very powerful feature but it had a few drawbacks:

It was redundant. The same indexes and constraints were being created with IF NOT EXISTS clauses every time the pipeline was run.
It was slow. The queries were being executed serially and the pipeline was locked until they were all completed.
It was error prone. If the database was not in a state that allowed for the creation of the indexes and constraints, the pipeline would fail.
It was high friction. There was no way to refactor the database without manual intervention. If the schema changed, the pipeline would fail and the user would have to manually remove the indexes, constraints, and sometimes data before running the pipeline again.

To address these issues, nodestream 0.12 has introduced the concept of migrations. Migrations are a way of encapsulating changes to the database schema in a way that can be applied incrementally. Conceptually, they are similar to the migrations in the Django, Rails, Neo4j Migrations, and Flyway frameworks.

Database Migrations

Migrations are defined in a directory called migrations in the root of your project. Each migration is a yaml file that contains data about the migration and its dependencies. You can create migrations by running the nodestream migrations make command.

Check out the changes to the tutorial on Database Migrations as well as the new tutorial on Working With Migrations to learn more.

Core Contributors to this feature include:

Multi-Database Support

Prior to this release, the only database that was supported was neo4j. While this is a category leading database, the goal of nodestream is to be database agnostic and afford developer the ability to use the database or databases that best fits their needs. As such, we are happy to announce that nodestream now supports Amazon Neptune and Amazon Neptune Analytics. TO accommodate that, we have moved the neo4j database connector into a separate package called nodestream-plugin-neo4j and added a new package called nodestream-plugin-neptune.

Starting with this release, you use the --database flag to generate neptune boilerplate configuration.

Database Migrations

Check out the docs on it here.

Core Contributors to this feature include:

Other Features

Parquet Support

Many customers have data stored in parquet format. Parquet is a columnar storage format that is optimized for reading and writing large datasets. We are happy to announce that nodestream now supports parquet as a first class citizen.

Check out the docs on it here.

Core Contributors to this feature include:

Include Properties From Maps

In the past, each property you wanted to include in the pipeline had to be explicitly defined in the pipeline configuration. This was a bit cumbersome and error prone. Starting with this release, you can now include all properties by defining an expression that returns a map at the properties key directly instead of a mapping of property names to expressions.

For example, here are two examples on the properties and source_node interpretations:

- type: source_node
  node_type: User
  key:
    email: !jmespath email
  properties: !jmespath path.to.properties.mapping
  normalization:
    do_trim_whitespace: true

- type: properties
  properties: !jmespath path.to.properties.mapping
  normalization:
    do_lowercase_strings: true

Check out the docs on it here.

Core Contributors to this feature include:

Performance Improvements

We've made a small number of performance improvements to the core of nodestream that should result in faster processing times and lower memory usage. Most notably, we've cache the last_ingested_at timestamp for nodes and relationships to reduce the number of times we create objects in memory. We've observed a 10% improvement in processing times and a 5% reduction in memory usage in our testing.

Core Contributors to this feature include:

Software Vulnerability Analysis using SBOMs, Amazon Neptune, and Nodestream

April 5, 2024 · 8 min read

Dave Bechberger

Neptune Plugin Architect and SBOM Plugin Creator

Note: Both the Nodestream Neptune and Nodestream SBOM plugins are currently preview releases

Recently, (March 2024) a severe vulnerability was found to have been added to a common library, XZ utility. Unfortunately, serious software vulnerabilities are not isolated incidents, as in late 2021, a critical security vulnerability was discovered in a commonly used logging library, Log4j. While the origin of the issues differ, Log4j was an oversight while XZ was an explicit backdoor, the outcome for users was the end same. Once each vulnerability was known, companies and individuals spent many hours combing through countless applications, looking for and patching systems running vulnerable versions of the software.

As this effort was ongoing, many were asking, "Isn't there a better way to track this information?"

In this post, we will discuss the work we have been doing around creating a plugin for Nodestream that provides a unified graph model for SBOMs ingestion and analysis. We will combine this with the plugin for Amazon Neptune to demonstrate how you can find insights for software vulnerabilities in application stacks. Let’s first talk a bit about what an SBOM is and why you should use a graph for analysis.

What is a Software Bill of Materials (SBOM) and why use Graphs

A software bill of materials (SBOM) is a critical component of software development and management, helping organizations to improve the transparency, security, and reliability of their software applications. An SBOM acts as an "ingredient list" of libraries and components of a software application that:

Enables software creators to track dependencies within their applications
Provides security personnel the ability to examine and assess the risk of potential vulnerabilities within an environment
Provides legal personnel with the information needed to assure that a particular software is in compliance with all licensing requirements.

A software bill of materials (SBOM) is a comprehensive list of the components, libraries, and dependencies used in a software application or system. It provides a detailed breakdown of the software's architecture, including the names, versions, licenses, and optionally the vulnerabilities of each component and describes the complex dependencies and relationships between components of a software system, including multi-level hierarchies and recursive relationships.

Graphs are excellent for modeling these kinds of interconnected relationships, with nodes representing components and edges representing dependencies and relationships between these components. Graph data structures handle recursive relationships very naturally, making it easy to analyze networks and flows. Using graph algorithms and metrics, allows you to analyze and identify critical components and dependencies, single points of failure, security vulnerabilities, license compatibilities, etc. for use cases such as:

Dependency graphs - These show how different components in the software relate to and depend on each other. Graphs make these complex relationships easier to visualize.
Vulnerability Graphs - Graphs make it easy to determine and assign associated risks with different vulnerabilities to prioritize fixing known issues.
Supply chain graphs - SBOMs trace the components and dependencies up the software supply chain. Graphs can illustrate the flow of open-source components from lower-level suppliers up to the final product. This helps identify vulnerabilities or licensing issues in the supply chain.

How to use Graphs for SBOM analysis

While using graphs to assist with SBOM analysis is not new, it also has not been trivial to get the data loaded in due to differing formats, with the two most popular being CycloneDX and SPDX. To assist with the data loading and analysis, I recently worked on an SBOM plugin for Nodestream to provide a simple way to load SBOMs into an opinionated graph data model from local files, GitHub, or Amazon Inspector. Nodestream is a Python framework for performing graph database ETL. The SBOM plugin extends this framework to provide a

Loading Data into SBOMs into our Graph

To get started loading your SBOM files into Amazon Neptune, we first need to setup an Amazon Neptune Analytics Graph as well as a Neptune Notebook to perform our analysis. To configure a Neptune Analytics Graph you can follow the documentation here: https://docs.aws.amazon.com/neptune-analytics/latest/userguide/create-graph-using-console.html

Neptune Notebooks is a managed open-source graph-notebook project provides a plethora of Jupyter extensions and sample notebooks that make it easy to interact with and learn to use a Neptune Analytics graph. This can be configured using the documentation here: https://docs.aws.amazon.com/neptune-analytics/latest/userguide/notebooks.html

Now that we have setup our database and analysis environment we next need to install the Nodestream plugins for Neptune and SBOM.

pip install -q pyyaml nodestream-plugin-neptune nodestream-plugin-sbom

With those data files installed, all we need to do is set our configuration in the nodestream.yaml file as shown below. In this example, we are going to load the SBOM files for Nodestream, the Nodestream Neptune Plugin, and the Nodestream SBOM plugin into our database, directly from GitHub.

plugins:
- name: sbom
  config:
    repos:[nodestream-proj/nodestream, nodestream-proj/nodestream-plugin-sbom, nodestream-proj/nodestream-plugin-neptune]
targets:
  my-neptune:
    database: neptune
    graph_id: g-<GRAPH ID>
    mode: analytics

With our configuration setup, we can run the import using the following command:

nodestream run sbom_github --target my-neptune

After we run the data load, we get a graph that similar to the image below.

SBOM Model Overview

What does our graph look like?

Let’s take a look at the types of data that we are storing in our graph. The plugin uses the opinionated graph data model shown below to represent SBOM data files. SBOM Graph schema This model contains the following elements:

Node Types

Document - This represents the SBOM document as well as the metadata associated with that SBOM.
Component - This represents a specific component of a software system.
Reference - This represents a reference to any external system which the system wanted to include as a reference. This can range from package managers, URLs to external websites, etc.
Vulnerability - This represents a specific known vulnerability for a component.
License - The license for the component or package.

Edge Types

DESCRIBES/DEPENDS_ON/DEPENDENCY_OF/DESCRIBED_BY/CONTAINS - This represents the type of relationship between a Document and a Component in the system.
REFERS_TO - This represents a reference between a Component and a Reference
AFFECTS - This represents that a particular Component is affected by the connected Vulnerability

The properties associated with each element will vary depending on the input format used, and the optional information contained in each file.

Analyzing SBOMs

Now that we have our data loaded into our graph, the next step is to start to extract insights into what is actually important in our SBOM data.

One common use case is to investigate shared dependencies across projects. Shared dependencies allow development and security teams to better understand the security posture of the organization through identification of shared risks. Let's start by taking a look at the most shared dependencies between these projects using the query below.

MATCH (n:Component)
WHERE exists(n.name)
CALL neptune.algo.degree(n, {traversalDirection: 'inbound', edgeLabels: ['DEPENDS_ON']})
YIELD node, degree
RETURN node.name, degree
ORDER BY degree DESC
LIMIT 10

Running this query will show us that there are quite a few dependencies that are shared across all three projects. To do this analysis, we used a graph algorithm known as Degree Centrality which counts the number of edges connected to a node. This measure of how connected the node is can in turn indicate the node's importance and level of influence in the network. Results Running the query below shows us that there are 31 Components that are shared across all the projects.

MATCH (n:Component)
WHERE exists(n.name)
CALL neptune.algo.degree(n, {traversalDirection: 'inbound', edgeLabels: ['DEPENDS_ON']})
YIELD node, degree
WHERE degree=3
RETURN count(node)

Given that this is a closely connected group of projects, it is not a surprise that there are many shared components. Given that one of the strengths of graphs is the ability to visualize the connectedness between data, let’s take a look at how they are connected.

MATCH (n:Component)
WHERE exists(n.name)
CALL neptune.algo.degree(n, {traversalDirection: 'inbound', edgeLabels: ['DEPENDS_ON']})
YIELD node, degree
WHERE degree = 3
WITH node, degree
MATCH p=(node)-[]-()
RETURN p

Results

Another common use case is to investigate licensing across multiple projects. This sort of investigation benefits from the connectedness across the graph by leveraging the connectedness to find how component licenses are connected to each other. Let’s take a look at what other licenses are associated with the lgpl-2.1-or-later licensed components.

MATCH p=(l:License)<-[:LICENSED_BY]-(:Component)<-[:DEPENDS_ON]-(:Document)
-[:DEPENDS_ON]->(:Component)-[:LICENSED_BY]->(l2)
WHERE l.name = 'lgpl-2.1-or-later' and l<>l2
RETURN DISTINCT l2.name

Results

As we see, there are quite a few other licenses used in these projects. We can leverage the visual nature of graph results to gain some insight into how components are connected. In this case, let’s see how components with the lgpl-2.1-or-later are connected to components with the unlicense.

MATCH p=(l:License)←[:LICENSED_BY]-(:Component)←[:DEPENDS_ON]-(:Document)
-[:DEPENDS_ON]→(:Component)-[:LICENSED_BY]→(l2)
WHERE l.name = 'lgpl-2.1-or-later' and l<>l2
RETURN DISTINCT l2.name

Results

We see that there exists one path in our graph between these two licenses.

Next Steps

As we have seen, using graphs to perform analysis of SBOM data can be a powerful tool in your toolbox to gain insights into the connections between software projects. What I have shown here is only the beginning of the types of analysis you can perform with this data. For a more detailed walkthrough of using graphs for SBOM analysis, I recommend taking a look at the following notebooks:

Welcome

March 30, 2024 · One min read

Zach Probst

Maintainer of Nodestream

Welcome to the new nodestream documentation and project site! We are excited to share with you the new features and improvements we have been working on.

We have been working hard to improve the documentation and make it easier to use and navigate. We have also been working on improving the project site to make it easier to find the information you need.

We hope you find the new documentation and project site helpful and easy to use!

By the way, thanks to the Docusaurus team for creating such a great tool!

If you have any questions or feedback, please feel free to reach out to us on GitHub!

Breaking Changes​

File Extractors​

New Features​

Object Storage APIs​

Extractor Checkpointing​

Record Schema Enforcement​

Breaking Changes​

Unified File Extractors​

Changing Source Node Behavior With properties​

New Features​

Squashing Migrations​

Compressed File Handling​

Improved LLM Compatible Schema Printing​

Improved Error Messages in Value Providers​

DynamoDB Extractor​

SQS and Queue Extractor Support​

Release Attestations​

Dependency Updates​

Bug Fixes​

Evolutionary Database Design​

Why Migrations?​

Surveying All The Types of Schema Changes​

Creating New Nodes and Edges Types​

Removing Nodes and Edges Types​

Adding Properties to Nodes and Edges​

Removing Properties from Nodes and Edges​

Adding and Removing Indexes​

Topological Changes​

How Migrations Work in Nodestream​

Are Migrations Any Good?​

Overview​

Capabilities​

Usage​

Prerequisites​

IAM Auth​

Configuration Examples​

Neptune Database:​

Neptune Analytics:​

Future Roadmap​

Get Involved​

Major Features​

Database Migrations​

Multi-Database Support​

Other Features​

Parquet Support​

Include Properties From Maps​

Performance Improvements​

What is a Software Bill of Materials (SBOM) and why use Graphs​

How to use Graphs for SBOM analysis​

Loading Data into SBOMs into our Graph​

What does our graph look like?​

Analyzing SBOMs​

Next Steps​

Breaking Changes

File Extractors

New Features

Object Storage APIs

Extractor Checkpointing

Record Schema Enforcement

Breaking Changes

Unified File Extractors

Changing Source Node Behavior With `properties`

New Features

Squashing Migrations

Compressed File Handling

Improved LLM Compatible Schema Printing

Improved Error Messages in Value Providers

DynamoDB Extractor

SQS and Queue Extractor Support

Release Attestations

Dependency Updates

Bug Fixes

Evolutionary Database Design

Why Migrations?

Surveying All The Types of Schema Changes

Creating New Nodes and Edges Types

Removing Nodes and Edges Types

Adding Properties to Nodes and Edges

Removing Properties from Nodes and Edges

Adding and Removing Indexes

Topological Changes

How Migrations Work in Nodestream

Are Migrations Any Good?

Overview

Capabilities

Usage

Prerequisites

IAM Auth

Configuration Examples

Neptune Database:

Neptune Analytics:

Future Roadmap

Get Involved

Major Features

Database Migrations

Multi-Database Support

Other Features

Parquet Support

Include Properties From Maps

Performance Improvements

What is a Software Bill of Materials (SBOM) and why use Graphs

How to use Graphs for SBOM analysis

Loading Data into SBOMs into our Graph

What does our graph look like?

Analyzing SBOMs

Next Steps