Calculate unique values in parallel for Neo4j Node creation

Hi Kettle and Neo4j fans!

So maybe that title is a little bit over the top but it sums up what this transformation does:

Here is what we do in this transformation:

  • Read a text file
  • Calculate unique values over 2 columns
  • Create Neo4j nodes for each unique value

To do this we first normalize the columns effectively doubling the amount of rows in the set.  Then we do some cleanup (remove double quotes).  The secret sauce is then to do a partitioned unique value calculation (5 partitions means 5 parallel thread). By partitioning the data on the single column we guarantee that the same data ends up on the same partition (step copy).  For this we use a Hash set (Unique Hash Set) step which uses memory to avoid a costly “Sort/Unique” operation.  While we have the data in parallel step copies, we also load the data in parallel into Neo4j.  Make sure you drop indexes on the Node/Label you’re loading to avoid transaction issues.

This allowed me to condense 28M rows at the starting point into 5M unique values and load those in just over 1 minute on my laptop.  I’ll post a more comprehensive walk-through example later but I wanted to show you this strategy because it can help you out there in need of decent data loading capacity.

Cheers,
Matt

Catching up with Kettle REMIX

Dear Kettle and Neo4j friends,

Since I joined the Neo4j team in April I haven’t given you any updates despite the fact that a lot of activity has been taking place in both the Neo4j and Kettle realms.

First and foremost, you can grab the cool Neo4j plugins from neo4j.kettle.be (the plugin in the marketplace is always out of date since it takes weeks to update the metadata).

Then based on valuable feedback from community members we’ve updated the DataSet plugin (including unit testing) to include relative paths for filenames (for easier git support), to avoid modifying transformation metadata and to set custom variables or parameters.

Kettle unit testing ready for prime time

I’ve also created a plugin to debug transformations and jobs a bit easier.  You can do things like set specific logging levels on steps (or only for a few rows) and work with zoom levels.

Clicking right on a step you can choose “Logging…” and set logging specifics.

Then, back on the subject of Neo4j, I’ve created a plugin to log the execution results of transformations and jobs (and a bit of their metadata) to Neo4j.

Graph of a transformation executing a bunch of steps. Metadata on the left, logging nodes on the right.

Those working with Azure might enjoy the Event Hubs plugins for a bit of data streaming action in Kettle.

The Kettle Needful Things plugin aims to fix bugs and solve silly problems in Kettle.  For now it sets the correct local metastore on Carte servers AND… features a new launcher script called MaitreMaitre supports transformations and jobs, local, remote and clustered execution.

The Kettle Environment plugin aims to take a stab at life-cycle management by allowing you to define a list of Environments:

The Environments dialog shown at the start of Spoon

In each Environment you can set all sorts of metadata but also the location of the Kettle and MetaStore home folders.

Finally, because downloading, patching, installing and configuring all this is a lot of work, I’ve created an automated process which does this for you on a daily bases (for testing) and so you can download Kettle Community Edition version 8.1.0.0 patched to 8.1.0.4 with all the extra plugins above in its 1GB glory at : remix.kettle.be

To get it on your machine simply run:

wget remix.kettle.be -O remix.zip

You can also give these plugins (Except for Needful-things and Environment) a try live on my sandbox WebSpoon server.  You can easily run your own WebSpoon from the also daily updated docker container.

If you have suggestions, bugs, rants, please feel free to leave them here or in the respective github projects.  Any feedback is as always more than welcome.  In fact, thanks you all for the feedback given so far.  It’s making all the difference.  If you feel the need to contribute more opinions on the subjects of Kettle feel free to send me a mail (mattcasters at gmail dot com) to join our kettle-community Slack channel.

Enjoy!

Matt