Calculate unique values in parallel for Neo4j Node creation

Hi Kettle and Neo4j fans!

So maybe that title is a little bit over the top but it sums up what this transformation does:

Here is what we do in this transformation:

  • Read a text file
  • Calculate unique values over 2 columns
  • Create Neo4j nodes for each unique value

To do this we first normalize the columns effectively doubling the amount of rows in the set.  Then we do some cleanup (remove double quotes).  The secret sauce is then to do a partitioned unique value calculation (5 partitions means 5 parallel thread). By partitioning the data on the single column we guarantee that the same data ends up on the same partition (step copy).  For this we use a Hash set (Unique Hash Set) step which uses memory to avoid a costly “Sort/Unique” operation.  While we have the data in parallel step copies, we also load the data in parallel into Neo4j.  Make sure you drop indexes on the Node/Label you’re loading to avoid transaction issues.

This allowed me to condense 28M rows at the starting point into 5M unique values and load those in just over 1 minute on my laptop.  I’ll post a more comprehensive walk-through example later but I wanted to show you this strategy because it can help you out there in need of decent data loading capacity.

Cheers,
Matt

Catching up with Kettle REMIX

Dear Kettle and Neo4j friends,

Since I joined the Neo4j team in April I haven’t given you any updates despite the fact that a lot of activity has been taking place in both the Neo4j and Kettle realms.

First and foremost, you can grab the cool Neo4j plugins from neo4j.kettle.be (the plugin in the marketplace is always out of date since it takes weeks to update the metadata).

Then based on valuable feedback from community members we’ve updated the DataSet plugin (including unit testing) to include relative paths for filenames (for easier git support), to avoid modifying transformation metadata and to set custom variables or parameters.

Kettle unit testing ready for prime time

I’ve also created a plugin to debug transformations and jobs a bit easier.  You can do things like set specific logging levels on steps (or only for a few rows) and work with zoom levels.

Clicking right on a step you can choose “Logging…” and set logging specifics.

Then, back on the subject of Neo4j, I’ve created a plugin to log the execution results of transformations and jobs (and a bit of their metadata) to Neo4j.

Graph of a transformation executing a bunch of steps. Metadata on the left, logging nodes on the right.

Those working with Azure might enjoy the Event Hubs plugins for a bit of data streaming action in Kettle.

The Kettle Needful Things plugin aims to fix bugs and solve silly problems in Kettle.  For now it sets the correct local metastore on Carte servers AND… features a new launcher script called MaitreMaitre supports transformations and jobs, local, remote and clustered execution.

The Kettle Environment plugin aims to take a stab at life-cycle management by allowing you to define a list of Environments:

The Environments dialog shown at the start of Spoon

In each Environment you can set all sorts of metadata but also the location of the Kettle and MetaStore home folders.

Finally, because downloading, patching, installing and configuring all this is a lot of work, I’ve created an automated process which does this for you on a daily bases (for testing) and so you can download Kettle Community Edition version 8.1.0.0 patched to 8.1.0.4 with all the extra plugins above in its 1GB glory at : remix.kettle.be

To get it on your machine simply run:

wget remix.kettle.be -O remix.zip

You can also give these plugins (Except for Needful-things and Environment) a try live on my sandbox WebSpoon server.  You can easily run your own WebSpoon from the also daily updated docker container.

If you have suggestions, bugs, rants, please feel free to leave them here or in the respective github projects.  Any feedback is as always more than welcome.  In fact, thanks you all for the feedback given so far.  It’s making all the difference.  If you feel the need to contribute more opinions on the subjects of Kettle feel free to send me a mail (mattcasters at gmail dot com) to join our kettle-community Slack channel.

Enjoy!

Matt

Farewell Pentaho

Dear Kettle friends,

12 years ago I joined a wonderful team of people at Pentaho who thought they could make a real change in the world of business analytics. At that point I recently open sourced my own data integration tool (then still called ‘ETL’) called Kettle and so I joined in the role of Chief Architect of Data Integration. The title sounded great and the job included everything from writing articles (and a book), massive amounts of coding, testing, software releases, giving support, doing training, workshops, … In other words, life was simply doing everything I possibly and impossibly could to make our software succeed when deployed by our users. With Kettle now being one of the most popular data integration tools on the planet I think it’s safe to say that this goal has been reached and that it’s time for me to move on.

I don’t just want to announce my exit from Pentaho/Hitachi Vantara. I would also like to thank all the people involved in making our success happen. First and foremost I want to express my gratitude to the founders (Richard, Doug, James, Marc, …) for even including a crazy Belgian like myself on the team but I also want to extend my warmest thanks to everyone who I got to become friends with at Pentaho for the always positive and constructive attitude. Without exaggeration I can say it’s been a lot of fun.

I would also explicitly like to thank the whole community of users of Kettle (now called Pentaho Data Integration). Without your invaluable support in the form of new plugins, bug reports, documentation, forum posts, talks, … we could never have pulled off what we did in the past 12 years! I hope we will continue to meet at one of the many popular community events.

Finally I want to thank everyone at Hitachi and Hitachi Vantara for being such a positive and welcoming group of people. I know that Kettle is used all over Hitachi and I’m quite confident this piece of software will not let you down any time soon.

Now I’m going to go skiing for a week and when I get back it’s time to hunt for a new job. I can’t wait to see what impossible problems need solving out there…

 

Cheers,
Matt

Pentaho Community Meetup 2016 recap

Dear friends,

I just came back from PCM16, the 9th annual edition of our European Pentaho Community Meetup.  We had close to 200 subscriptions for this event of which about 150 showed up making this the biggest so far.  Even though veterans of the conference like myself really appreciate the warmth in previous locations like Barcelona and Cascais, I have to admit we did get a great venue in Antwerp this year with 2 large rooms, great catering, top notch audiovisual support in a nice part in the city center. (Free high speed Antwerp city WiFi, Yeah!)

Content-wise everything was more than OK with back-to-back presentation on a large variety of subjects and I’m happy to say lots of Kettle related stuff as well.

For an in depth recap of the content you can see here for the technical track and here for the other sessions.

Personally I was touched by the incredibly positive response from the audience after my presentation on the state of the PDI unit testing project.  However, the big bomb was dropped when Hiromu Hota from Hitachi Reseach America started to present a new “WebSpoon” project.  You could almost hear everyone think: “Oh no, not another attempt at making a new Kettle web interface”.  However, 2 minutes into the presentation everyone in the audience started to realize that it was the real, original Spoon with all the functionality it has, ported 1:1 to a web browser on your laptop, thin device, phone or tablet. Applause spontaneously erupted, twitter exploded and people couldn’t stop talking about about it until I left the PCM crowd a day later.  Now I’m obviously very happy we managed to keep the WebSpoon project a secret for the past few months and it’s impossible to thank Hota-san enough for traveling all the way to present at our event this weekend.

My heartfelt thanks also go out to Bart Maertens and the whole know.bi crew for making PCM16 a wonderful experience and an unforgettable event!

See you all at PCM17!

Matt

Pentaho Kettle 解决方案

您好球迷!

I just received a copy of the Chinese translation of Pentaho Kettle Solutions.

Right off the bat when we made it open source over 8 years ago, Kettle became very popular in China.  Among the very first Chinese people who helped out with bug fixing and translation was Jason Chu , a Pentaho BI Consultant who lives in Beijing city.  Jason spent a few years of his life translating Pentaho Kettle Solutions, a book written by yours truly, Jos Van Dongen and of-course my colleage Roland Bouman.

The new book is called Pentaho Kettle解决方案 : 使用PDI构建开源ETL解决方案 and is available from Amazon.cn.

Matt holding Chinese Pentaho Kettle Solutions book

While I can’t read Chinese (at all) the look and feel of the book is excellent: graphics are great and for example screenshots were re-done in the Chinese localization setting of Spoon.  The book is published by Wiley and from what I learned from their method of publishing is that it is guaranteed to offer excellent results.  Congratulations to all involved in making this book possible!

It’s impossible to thanks Jason enough for all the hard work he has done, also now in translating our book and with this post I would like to recommend the end-result to all Chinese people!

Have fun with Kettle wherever you are!

Matt

 

 

 

 

 

 

Announcing Pentaho 5.0 Community Edition

Dear Pentaho friends,

Today it’s been almost a year since we released Kettle 4.4 as part of the Pentaho 4.8.0 suite release community edition. Today we’re very happy to announce the availability of Pentaho 5.0 community edition.

Now obviously, since this is open source software which is always available though continuous integration builds, we’re talking about the availability of builds which have been sufficiently verified by Pentaho QA, community members, partners and so on to be considered a release.

All new is also the new community landing page:

community.pentaho.com

On it you’ll find convenient links to the downloads, the source code, the CI builds and so on.

Thanks you all for your patience in delivering this new step forward.  In the coming weeks I’ll do a series of “New in Kettle 5.0” blog posts to inform you about what we did.  In short, these are the highlights:

  • Better previewing with dedicated window in run tab
  • Easier looping with transformation and job executor steps
  • Internal metrics gathering with Gantt chart
  • An improved plugin marketplace
  • Tons of Big Data improvements
  • Improved logging backend
  • Lots of new steps and job entries
  • A series of new plugin systems to make it easier to extend Kettle
  • Lots of new configuration options for tuning the engine back-ends
  • Integrated help system in the step and job entry dialogs

Even more important though:

  • Over 1500 bug fixes and feature requests
  • Metadata compatible with previous versions
  • API compatible with 4.x

I would like to thank and congratulate the whole community with this excellent release!

Happy downloading!
Matt

P.S. Keeping up with tradition let me share some statistics… The Kettle project, trimmed down to not include any plugins, now contains over 840,000 lines of code.  Kettle 4.4 was downloaded over 244,000 times in the last year, from SourceForge only.

 

Figuring out the layout of a delimited file

Hey Kettle friends,

Recently I got the question whether or not PDI would be able to automatically figure out, at runtime, what the layout of a delimited text file is simply by looking at the header of the file and by doing various scans of the file to determine the delimiter, data types of the columns and so on.

Now obviously you could also do without a header file but then the columns would have to be called something like “field1 to fieldN”.

First, we want to see which delimiter, from a list of possible delimiters, is used in the file.  We want to do this by reading every line of the file in question and by doing a character-by-character analyses of the line.  We’ll count the number of occurrences of all possible and then pick the most often used delimiter at the end and set it as a variable.

Detecting the delimiter
Detecting the delimiter

 

The hardest part of this is doing the counting in JavaScript.  We’re using an associative array:

 

// A simple algorithm: count every ,;/ and tab, see which one is used most often
//
var delimiters = [ ',', ';', '/', '\\t' ];
var counters = new Array();
var i;
for (i=0;i<line.length;i++) {
  var c=line.charAt(i);
  for (d=0;d<delimiters.length;d++) {
    if (c==delimiters[d]) {
      if (counters[c]==null) {
	    counters[c]=1;
      } else {
        counters[c]++;
      }
    }
  }
}
var maxC=0;
var max=0;
for (i=0;i<delimiters.length;i++) {   if (counters[delimiters[i]]>max) {
    maxC=i;
    max=counters[delimiters[i]];
  }
}

var delimiter=""+delimiters[maxC];
var count = counters[delimiters[maxC]];

So the above script populates an array of counters which is then evaluated to see which delimiter wins.  Obviously it should be easy to modify the list of possible participants in this delimiters game.

Now that we found the delimiter of the file, we can go ahead and determine the field names from the header of the file:

Detect field names
Detect field names

Again, we’re using a bit of JavaScript in the above :

var fields = line.split(getVariable("DELIMITER",""));
for (i=0;i<fields.length;i++) {
  var newRow = createRowCopy(getOutputRowMeta().size());
  var rowIndex = getInputRowMeta().size();
  newRow[rowIndex++] = fields[i];
  putRow(newRow);
}

var field=null;

As you can see, this script converts the line into different rows.

Previewing the field names detection
Previewing the field names detection

OK, but this still not solves the complete puzzle as we still need to figure out the hardest part: the data types, lengths, precisions, conversion masks of the various fields.  To do this we pass the column names we detected into another transformation:

metadata-detector-parser

 

So the “CSV file input” step is configured automatically by the parent transformation to read all fields as Strings.  Then there is another piece of JavaScript to allow us to make use of the same functionality being used to automatically detect data types in the “Get Fields” functionality of the “Text File Input” and “CSV file input” step dialogs.  In this case it’s done by a class called “StringEvaluator” which we’ll re-use here:

 

for (i=0;i<getInputRowMeta().size();i++) {
  var stringValue = getInputRowMeta().getString(row, i);
  var evaluator = evaluators.get(i);
  evaluator.evaluateString(stringValue);
}

At the end of the steps life, when the script ends, we have another block of code:

// write the results as new rows

for (i=0;i<getInputRowMeta().size();i++) {
  var stringValue = getInputRowMeta().getString(row, i);
  var evaluator = evaluators.get(i);
  evaluator.evaluateString(stringValue);

  var evalResult = evaluator.getAdvicedResult();
  var valueMeta = evalResult.getConversionMeta();

  var newRow = createRowCopy(getOutputRowMeta().size());
  var rowIndex = getInputRowMeta().size();

  newRow[rowIndex++] = getInputRowMeta().getValueMeta(i).getName();
  newRow[rowIndex++] = valueMeta.getTypeDesc();
  newRow[rowIndex++] = java.lang.Integer.toString(valueMeta.getLength());
  newRow[rowIndex++] = java.lang.Integer.toString(valueMeta.getPrecision());
  newRow[rowIndex++] = valueMeta.getConversionMask();
  newRow[rowIndex++] = valueMeta.getDecimalSymbol();
  newRow[rowIndex++] = valueMeta.getGroupingSymbol();

  putRow(newRow);
}

After some cleanup we finally have the data types and everything we need to properly read in the file.
You can fine the complete example over here.  You can use this on Kettle version 4.4.0 or above.

The output of the example is an XML file called layout.xml.  Obviously you can store it in different formats and/or inject it into another transformation.   You can then for example do complete normalization of every column value and store it in a database table.  Not that I would recommend such a generic structure but it’s possible.

Enjoy!

Matt

 

 

 

Pentaho 5.0 blends right in!

Dear Pentaho friends,

Ever since a number of projects joined forces under the Pentaho umbrella (over 7 years ago) we have been looking for ways to create more synergy across this complete software stack.  That is why today I’m exceptionally happy to be able to announce, not just version 5.0 of Pentaho Data Integration but a new way to integrate Data Integration, Reporting, Analyses, Dashboarding and Data Mining through one single interface called Data Blending, available in Pentaho Business Analytics 5.0 (Commercial Edition).

Data Blending allows a data integration user to create a transformation capable of delivering data directly to our other Pentaho Business Analytics tools (and even non-Pentaho tools).  Traditionally data is delivered to these tools through a relational database.  However, there are cases where that can be inconvenient, for example when the volume of data is just too high or when you can’t wait until the database tables are updated.  This for example leads to a new kind of big data architecture with many moving parts:

Evolving Big Data Architectures
Evolving Big Data Architectures

 

From what we can see in use at major deployments with our customers, mixing Big Data, NoSQL and classical RDBS technologies is more the rule than the exception.

So, how did we solve this puzzle?

The main problem we faced early on was that the default language used under the covers, in just about any business intelligence user facing tool, is SQL.  At first glance it seems that the worlds of data integration and SQL are not compatible.  In DI we read from a multitude of data sources, such as databases, spreadsheets, NoSQL and Big Data sources, XML and JSON files, web services and much more.  However, SQL itself is a mini-ETL environment on its own as it selects, filters, counts and aggregates data.  So we figured that it might be easiest if we would translate the SQL used by the various BI tools into Pentaho Data Integration transformations.  This way, Pentaho Data Integration is doing what it does best, not directed by manually designed transformations but by SQL.  This is at the heart of the Pentaho Data Blending solution.

The data blending service architecture
The data blending service architecture

To ensure that the “automatic” part of the data chain doesn’t become an impossible to figure out “black box”, we made once more good use of existing PDI technologies.  We’re logging all executed queries on the Data Integration server (or Carte server) so you have a full view of all the work being done:

Data Blending Transparency
Data Blending Transparency

In addition to this, the statistics from the queries can be logged and viewed in the operations data mart giving you insights into which data is queried and how often.

We sincerely hope that you like these new powerful options for Pentaho Business Analytics 5.0!

Enjoy!

Matt

–If you want to learn more about the new features in this 5.0 release, Pentaho is hosting a webinar and demonstration on September 24th – Two options to register: EMEA & North America time zones.

– Also check out the new exciting capabilities we deliver today on top of MongoDB!

Matt Casters
Chief of Data Integration at Pentaho, Kettle project founder