Skip to main content

MovieLens Recomendation Engine

Introduction

The MovieLens data set is a classic data set used for recommendation systems. It contains a set of ratings given by a set of users to a set of movies. The data set is available in different sizes, and we will use the smallest one, which contains 100,000 ratings.

Data

The data is available here as a zip file. When you unzip it, you will find 3 csv files: ratings.csv, movies.csv, and tags.csv. We'll ingest all three of these files into our graph.

ratings.csv

The ratings.csv file contains the following columns:

  • userId: the id of the user who rated the movie (an integer)
  • movieId: the id of the movie that was rated (an integer)
  • rating: the rating given by the user (a float)
  • timestamp: the time when the rating was given (an integer)

movies.csv

The movies.csv file contains the following columns:

  • movieId: the id of the movie (an integer)
  • title: the title of the movie (a string)
  • genres: the genres of the movie (a string), separated by |

tags.csv

The tags.csv file contains the following columns:

  • userId: the id of the user who tagged the movie (an integer)
  • movieId: the id of the movie that was tagged (an integer)
  • tag: the tag given by the user (a string)
  • timestamp: the time when the tag was given (an integer)

Getting Started

First, we need to create a new nodestream project:

nodestream new movies
nodestream remove default sample # remove the default sample pipeline

Then, we'll copy the data files into a newly data directory of our project.

mkdir -p movies/data
cp path/to/ratings.csv path/to/movies.csv path/to/tags.csv movies/data

Now we can cd into the project and scaffold the pipelines for each file:

cd movies
nodestream scaffold ratings
nodestream scaffold movies
nodestream scaffold tags

This will create a new pipeline for each file, and we can now edit the ratings pipeline to ingest the ratings.csv file. Before continuing, be sure to configure your database connection in the nodestream.yaml file. See the Databases section for more information.

Building the Piplines

In this example, we'll build a graph with the following schema:

Ratings

We'll start by editing the ratings pipeline to ingest the ratings.csv file. Open the pipelines/ratings.yaml file and delete the default content. Then, add the following content:

- implementation: nodestream.pipeline.extractors:FileExtractor
arguments:
globs:
- data/ratings.csv

This will tell the pipeline to extract the data from the ratings.csv file. However, this does not tell the pipeline how to ingest the data into the graph. We'll need to add a new step to the pipeline to do that.

# ... previous content

- implementation: nodestream.interpreting:Interpreter
arguments:
interpretations:
- type: source_node
node_type: User
key:
id: !jmespath userId
- type: relationship
relationship_type: RATED
node_type: Movie
node_key:
id: !jmespath movieId
relationship_properties:
rating: !jmespath rating
at: !jmespath timestamp

This will tell the pipeline to interpret the data and create the User and Movie nodes, and the RATED relationships between them. The !jmespath expressions are used to extract the data from the csv file. In this case, we are extracting the userId, movieId, rating, and timestamp columns.

Now, that we have the ratings pipeline ready, we can generate the migrations and run the pipeline:

nodestream migrations make
nodestream migrations run -t my-db
nodestream run movies -t my-db

Movies

We'll do the same for the movies pipeline. Open the pipelines/movies.yaml file and delete the default content. Then, add the following content:

- implementation: nodestream.pipeline.extractors:FileExtractor
arguments:
globs:
- data/movies.csv

This will tell the pipeline to extract the data from the movies.csv file. We'll need to add a new step to the pipeline to ingest the data into the graph.

# ... previous content

- implementation: nodestream.interpreting:Interpreter
arguments:
interpretations:
- type: source_node
node_type: Movie
key:
id: !jmespath movieId
properties:
title: !jmespath title
- type: relationship
relationship_type: HAS_GENRE
node_type: Genre
find_many: true
node_key:
id: !split
data: !jmespath genres
delimiter: '|'

This will tell the pipeline to interpret the data and create the Movie nodes. The !jmespath expressions are used to extract the data from the csv file. Now, that we have the movies pipeline ready, we can generate the migrations and run the pipeline:

nodestream migrations make
nodestream migrations run -t my-db
nodestream run movies -t my-db

Tags

We'll do the same for the tags pipeline. Open the pipelines/tags.yaml file and delete the default content. Then, add the following content:

- implementation: nodestream.pipeline.extractors:FileExtractor
argumenets:
globs:
- data/tags.csv

This will tell the pipeline to extract the data from the tags.csv file. We'll need to add a new step to the pipeline to ingest the data into the graph.

# ... previous content

- implementation: nodestream.interpreting:Interpreter
arguments:
interpretations:
- type: source_node
node_type: Tag
key:
value: !jmespath tag
- type: relationship
relationship_type: TAGGED
node_type: User
node_key:
id: !jmespath userId
relationship_properties:
at: !jmespath timestamp
outbound: false
- type: relationship
relationship_type: APPLIED_TO
node_type: Movie
node_key:
id: !jmespath movieId

This will tell the pipeline to interpret the data and create the Tag nodes, and the TAGGED and APPLIED_TO relationships between them. The !jmespath expressions are used to extract the data from the csv file. Now, that we have the tags pipeline ready, we can generate the migrations and run the pipeline:

nodestream migrations make
nodestream migrations run -t my-db
nodestream run movies -t my-db

Verifying the Data

We can now verify that the data was ingested into the graph by running some queries.

MATCH (u:User)-[r:RATED]->(m:Movie)
RETURN u, r, m
LIMIT 10
MATCH (m:Movie)-[t:TAGGED]->(t:Tag)
RETURN m, t
LIMIT 10
MATCH (m:Movie)
RETURN m
LIMIT 10
MATCH (u:User)
RETURN u
LIMIT 10
MATCH (t:Tag)
RETURN t
LIMIT 10
MATCH (g:Genre)
RETURN g
LIMIT 10
MATCH (u:User)-[r:RATED]->(m:Movie)-[:HAS]->(g:Genre)
RETURN u, r, m, g
LIMIT 10

This should return some data from the graph, showing that the data was ingested correctly.

Conclusion

Success 🎉!

Now that we have the data ingested into the graph, we can start building recommendation algorithms. We can use the ratings and tags data to build a collaborative filtering algorithm, and the genres data to build a content-based filtering algorithm. We can also use the movies data to build a graph-based recommendation algorithm.