So maybe that title is a little bit over the top but it sums up what this transformation does:
Here is what we do in this transformation:
Read a text file
Calculate unique values over 2 columns
Create Neo4j nodes for each unique value
To do this we first normalize the columns effectively doubling the amount of rows in the set. Then we do some cleanup (remove double quotes). The secret sauce is then to do a partitioned unique value calculation (5 partitions means 5 parallel thread). By partitioning the data on the single column we guarantee that the same data ends up on the same partition (step copy). For this we use a Hash set (Unique Hash Set) step which uses memory to avoid a costly “Sort/Unique” operation. While we have the data in parallel step copies, we also load the data in parallel into Neo4j. Make sure you drop indexes on the Node/Label you’re loading to avoid transaction issues.
This allowed me to condense 28M rows at the starting point into 5M unique values and load those in just over 1 minute on my laptop. I’ll post a more comprehensive walk-through example later but I wanted to show you this strategy because it can help you out there in need of decent data loading capacity.
Since I joined the Neo4j team in April I haven’t given you any updates despite the fact that a lot of activity has been taking place in both the Neo4j and Kettle realms.
First and foremost, you can grab the cool Neo4j plugins from neo4j.kettle.be (the plugin in the marketplace is always out of date since it takes weeks to update the metadata).
Then based on valuable feedback from community members we’ve updated the DataSet plugin (including unit testing) to include relative paths for filenames (for easier git support), to avoid modifying transformation metadata and to set custom variables or parameters.
I’ve also created a plugin to debug transformations and jobs a bit easier. You can do things like set specific logging levels on steps (or only for a few rows) and work with zoom levels.
Then, back on the subject of Neo4j, I’ve created a plugin to log the execution results of transformations and jobs (and a bit of their metadata) to Neo4j.
Those working with Azure might enjoy the Event Hubs plugins for a bit of data streaming action in Kettle.
The Kettle Needful Things plugin aims to fix bugs and solve silly problems in Kettle. For now it sets the correct local metastore on Carte servers AND… features a new launcher script called Maitre. Maitre supports transformations and jobs, local, remote and clustered execution.
The Kettle Environment plugin aims to take a stab at life-cycle management by allowing you to define a list of Environments:
In each Environment you can set all sorts of metadata but also the location of the Kettle and MetaStore home folders.
Finally, because downloading, patching, installing and configuring all this is a lot of work, I’ve created an automated process which does this for you on a daily bases (for testing) and so you can download Kettle Community Edition version 126.96.36.199 patched to 188.8.131.52 with all the extra plugins above in its 1GB glory at : remix.kettle.be
To get it on your machine simply run:
wget remix.kettle.be -O remix.zip
You can also give these plugins (Except for Needful-things and Environment) a try live on my sandbox WebSpoon server. You can easily run your own WebSpoon from the also daily updated docker container.
If you have suggestions, bugs, rants, please feel free to leave them here or in the respective github projects. Any feedback is as always more than welcome. In fact, thanks you all for the feedback given so far. It’s making all the difference. If you feel the need to contribute more opinions on the subjects of Kettle feel free to send me a mail (mattcasters at gmail dot com) to join our kettle-community Slack channel.