Calculate unique values in parallel for Neo4j Node creation

Hi Kettle and Neo4j fans!

So maybe that title is a little bit over the top but it sums up what this transformation does:

Here is what we do in this transformation:

  • Read a text file
  • Calculate unique values over 2 columns
  • Create Neo4j nodes for each unique value

To do this we first normalize the columns effectively doubling the amount of rows in the set.  Then we do some cleanup (remove double quotes).  The secret sauce is then to do a partitioned unique value calculation (5 partitions means 5 parallel thread). By partitioning the data on the single column we guarantee that the same data ends up on the same partition (step copy).  For this we use a Hash set (Unique Hash Set) step which uses memory to avoid a costly “Sort/Unique” operation.  While we have the data in parallel step copies, we also load the data in parallel into Neo4j.  Make sure you drop indexes on the Node/Label you’re loading to avoid transaction issues.

This allowed me to condense 28M rows at the starting point into 5M unique values and load those in just over 1 minute on my laptop.  I’ll post a more comprehensive walk-through example later but I wanted to show you this strategy because it can help you out there in need of decent data loading capacity.