Configuring Projects and Pipelines
How to manage the configuration of your nodestream project.
Fundementals
A nodestream pipeline is declarative. As a result, the pipeline tends to be a combination of configuration and logic. While the logic is likely to remain static, the configuration is likely to change between environments.
For example, you may have a pipeline that reads from an API endpoint in development, but a different API endpoint in production. Ideally, you would be able to inject the configuration for the API endpoint into the pipeline at runtime.
Argument Resolvers
Argument resolvers solve the problem of injecting configuration into a pipeline.
They are fundemental to the nodestream DSL.
Each argument resolver is a YAML tag that is resolved at runtime by its means of configuration.
For example, the !env
argument resolver resolves to the value of an environment variable
Configuring Pipelines
Injecting Environment Variables with !env
As mentioned, the !env
argument resolver resolves to the value of an environment variable.
This is useful for injecting configuration into a pipeline at runtime.
Lets take our example from earlier. Imagine the start of our pipeline file looks like this:
- implementation: nodestream.pipeline.extractors:SimpleApiExtractor
arguments:
url: https://example.com/people
In development, we want to use a different API endpoint.
We can use the !env
argument resolver to inject the configuration at runtime.
- implementation: nodestream.pipeline.extractors:SimpleApiExtractor
arguments:
url: !env API_ENDPOINT
Now, we can set the API_ENDPOINT
environment variable to https://example.com/people
in production, and https://example.com/people-dev
in development.
NOTE: If you are looking to inject environment variables from a .env
file, you can use the nodestream-plugin-dotenv
package to load the .env
file and inject the environment variables into the process.
Read more here.
Injecting File Contents with !include
The !include
argument resolver resolves to the contents of a file.
This is most useful for injecting large configuraiton blobs into a pipeline and
as an interpretation sharing technique.
For example, imagine we have two pipelines that share a common set of interpretations. We can extract the interpretations into a file and include it in both pipelines.
- implementation: nodestream.interpreting:Interpreter
arguments:
interpretations: !include common/interpretations.yaml
The path is relative to the project root, not the pipeline file.
Injecting Configuration with !config
The !config
argument resolver resolves to the value of a configuration key.
!config
is useful for injecting configuration from a speicfied name and allows for a more structured configuration.
The Project Configuration section explains how to set/map configuraiton in the project.
Configuration can then be used inline in the pipeline.
- implementation: nodestream.pipeline.extractors:SimpleApiExtractor
arguments:
url: !config api_endpoint
Annotations
Annotations are a way to accomplish two tasks:
- Add metadata to a pipeline
- Conditionally include or exclude parts of a pipeline
Annotations are optionally added to each step in a pipeline. They are a list of strings that are effectively tags for the step.
For example, lets say in a live environment we want to pull data from an API, but in a test environment we want to pull data from a file. We can use annotations to conditionally include the step.
- implementation: nodestream.pipeline.extractors:SimpleApiExtractor
arguments:
url: !env API_ENDPOINT
annotations:
- live
- implementation: nodestream.pipeline.extractors:FileExtractor
arguments:
globs:
- !env FILE_PATH
annotations:
- test
Then when we want to run the pipeline in a live environment, we can use the --annotation
flag to include only the steps with the live
annotation.
nodestream run my-pipeline --annotation live
Note that steps with no annotations are always included and if you supply multiple annotations, steps that match any of the annotations are included.
Configuring Projects
Annotations
Annotations are key-value pairs that are used to add metadata to a project.
They are useful for adding metadata to a project that can be used by plugins or other tools.
For example, you may want to add a schedule
annotation to a project to indicate how often the pipelines in the project should be run by some scheduler.
Annotations are displayed when you run nodestream show
and especially useful in nodestream show --json
for programmatic access to the metadata.
Targets
Targets are a destination Graph Database for the output of a pipeline. For each target you can specify the connection details and credentials. The configuration is highly dependent on the target database type, so it is best to refer to the documentation for the target database you are using here.
# nodestream.yaml
targets:
my-target:
database: neo4j # The type of database; neo4j, neptune, etc.
# ... Connection details and credentials for your target database
In other parts of the project or from the cli, you can refer to the target by its name, my-target
in this case.
Delaying Evaluation with !delayed
The !delayed
argument resolver delays the evaluation of the argument until the consuming component (if supported) is ready to evaluate it.
This is useful for injecting configuration that is not available at the time of pipeline definition or for allowing the component to re-evaluate the argument at runtime.
For instance, the neo4j connector can accept !delayed
for the username and password arguments.
targets:
my-target:
database: neo4j
uri: bolt://localhost:7687
username: !delayed
value: !env MY_NEO4J_USERNAME
password: !delayed
value: !env MY_NEO4J_PASSWORD
Scopes And Pipelines
A nodestream project is a collection of pipelines. Pipelines are organized into one or more scopes. Conceptually, you can think of a scope as a namespace or folder for pipelines.
For example, you may have a scope for all of your data warehouse pipelines, and another scope for all of your web scraping pipelines.
Another common pattern is to have a scope for each way the pipeline is used, such as cron
and perpetual
to represent pipelines that are run on a schedule and pipelines that are always running, respectively.
Regardless of how you choose to organize your pipelines, scopes are a way to group pipelines together and nodestream does not enforce any particular organization.
Simple Configuration
scopes:
cron:
pipelines:
- pipelines/warehouse.yaml
- pipelines/web-scraping.yaml
perpetual:
pipelines:
- pipelines/kafka-consumer.yaml
Scope Level Configuration
Configuration (that can be used via !config
) can be set at the scope level.
This is useful for setting configuration that is shared across multiple pipelines in a scope.
scopes:
cron:
config:
api_endpoint: https://example.com/api
pipelines:
- pipelines/warehouse.yaml
- pipelines/web-scraping.yaml
perpetual:
config:
api_endpoint: https://example.com/api
pipelines:
- pipelines/kafka-consumer.yaml
Note that configuration can use other argument resolvers, such as !env
and !include
to inject environment variables and file contents into the configuration.
For example:
scopes:
cron:
config:
api_endpoint: !env API_ENDPOINT
pipelines:
- pipelines/warehouse.yaml
- pipelines/web-scraping.yaml
# ...
Scope Level Annotations
Annotations can be set at the scope level. The effect is that all pipelines in the scope inherit the annotations.
scopes:
cron:
annotations:
schedule: "0 0 * * *"
pipelines:
- pipelines/warehouse.yaml
- pipelines/web-scraping.yaml
Scope Level Targets
Targets can be set at the scope level. The effect is that all pipelines in a scope implictly write to the target(s) specified at the scope level.
scopes:
cron:
targets:
- my-target
pipelines:
- pipelines/warehouse.yaml
- pipelines/web-scraping.yaml
Naming Pipelines
By default, the name of the pipeline is the name of the file without the .yaml
extension.
You can override the name of the pipeline by specifying the name
key in the pipeline configuration.
scopes:
cron:
pipelines:
- name: my-warehouse-pipeline
path: pipelines/warehouse.yaml
Pipeline Level Annotations
Annotations can be set at the pipeline level. This is useful for adding metadata to a pipeline that is not shared with other pipelines in the scope.
scopes:
cron:
pipelines:
- path: pipelines/warehouse.yaml
annotations:
schedule: "0 0 * * *"
- path: pipelines/web-scraping.yaml
annotations:
schedule: "0 10 * * *"
Pipeline Level Targets
Targets can be set at the pipeline level. This is useful for specifying that a pipeline writes to a different target than the one specified at the scope level.
scopes:
cron:
pipelines:
- path: pipelines/warehouse.yaml
targets:
- my-other-target
- path: pipelines/web-scraping.yaml
You can also specify that the pipeline does not inherit the targets from the scope by exclude_inherited_targets
key.
scopes:
cron:
targets:
- my-target
pipelines:
- pipelines/web-scraping.yaml
- path: pipelines/warehouse.yaml
exclude_inherited_targets: true
targets:
- my-other-target
In this case, the warehouse
pipeline writes to my-other-target
and the web-scraping
pipeline writes to my-target
.
Project Plugins
Specifying Plugin Configuration
In nodestream, some plugins can be used to compose new scopes and pipelines
into your project.
The !config
YAML tag followed by a key can be used to provide end-user configurable values for project plugin pipelines.
For example, in a project plugin pipeline, you can use the !config
tag to specify a name
for the pipeline.
# pipeline-inside-project-plugin.yaml
- implementation: myPlugin.testPipeline:SomePluginExtractor
arguments:
username: !config 'service_username'
password: !config 'service_password'
In the above example, the !config
tag is used to specify the username
and password
for the pipeline.
End users can provide values for the !config plugin tags in their nodestream.yaml file.
This feature is particularly useful for supplying user-provided information such as URLs and credentials.
plugins:
- name: myPlugin
config:
username: !env MY_TEST_PLUGIN_USERNAME
password: !env MY_TEST_PLUGIN_PASSWORD
- name: otherPlugin
config:
service_base_url: "https://otherurl.com"
Specifying Plugin Annotations
Plugins can also be annotated with metadata just the way you can in your own pipelines and scopes.
The annotations
key can be used to specify metadata for a plugin.
plugins:
- name: myPlugin
annotations:
schedule: "0 0 * * *"
Specifying Plugin Targets
Plugins can also be configured to write to specific targets.
The targets
key can be used to specify the targets that the plugin writes to.
plugins:
- name: myPlugin
targets:
- my-target
Advanced Plugin Configuration
Just as with scopes and pipelines, you can use a complex configuration for plugins as well. For example, this uses a combination of config
, annotations
, and targets
for a plugin as well as the exclude_inherited_targets
key.
plugins:
- name: myplugin
config:
service_base_url: "https://otherurl.com"
targets:
- target1
- target2
pipelines:
- name: plugin_pipeline_1
exclude_inherited_targets: True
annotations:
my_annoation: True
targets:
- target3