Blog finally restored

Dear Kettle & Neo4j fans,

Having your website getting hacked is bad enough. If you know me a little bit though, you know that I keep my software up-to-date (automatic WordPress updates!) and passwords impossible to guess. My laptop runs on a Linux variant with secure UEFI boot loader, encrypted hard drive and so on. As such it’s painful to see your provider throw websites onto the same server just to be on the cheap side and allow websites to compromise each other with cross site scripting attacks. After over 13 years of problem free loyal service www.ibridge.be just got hacked and subsequently blocked.

Anyway, that’s basically the reason why I’ve resurrected http://www.kettle.be and moved the blog to my own server with Linux App Armor on it and the works. At least now if I get hacked it will be my own darned fault.

The reason why it took a bit to get this work done is that I’ve been working really hard actually. Besides supporting the Neo4j field engineers and sales efforts in our Solutions team our software needs to move forward as well:

It’s a tremendous amount of fun but restoring that WordPress backup just wasn’t that much fun and less of a priority. Let me know if there’s anything that feels wrong ūüėČ

Cheers,

Matt

Calculate unique values in parallel for Neo4j Node creation

Hi Kettle and Neo4j fans!

So maybe that title is a little bit over the top but it sums up what this transformation does:

Here is what we do in this transformation:

  • Read a text file
  • Calculate unique values over 2 columns
  • Create Neo4j nodes for each unique value

To do this we first normalize the columns effectively doubling the amount of rows in the set.¬† Then we do some cleanup (remove double quotes).¬† The secret sauce is then to do a partitioned unique value calculation (5 partitions means 5 parallel thread). By partitioning the data on the single column we guarantee that the same data ends up on the same partition (step copy).¬† For this we use a Hash set (Unique Hash Set) step which uses memory to avoid a costly “Sort/Unique” operation.¬† While we have the data in parallel step copies, we also load the data in parallel into Neo4j.¬† Make sure you drop indexes on the Node/Label you’re loading to avoid transaction issues.

This allowed me to condense 28M rows at the starting point into 5M unique values and load those in just over 1 minute on my laptop.¬† I’ll post a more comprehensive walk-through example later but I wanted to show you this strategy because it can help you out there in need of decent data loading capacity.

Cheers,
Matt

Catching up with Kettle REMIX

Dear Kettle and Neo4j friends,

Since I joined the Neo4j team in April I haven’t given you any updates despite the fact that a lot of activity has been taking place in both the Neo4j and Kettle realms.

First and foremost, you can grab the cool Neo4j plugins from neo4j.kettle.be (the plugin in the marketplace is always out of date since it takes weeks to update the metadata).

Then based on valuable feedback from community members we’ve updated the DataSet plugin (including unit testing) to include relative paths for filenames (for easier git support), to avoid modifying transformation metadata and to set custom variables or parameters.

Kettle unit testing ready for prime time

I’ve also created a plugin to debug transformations and jobs a bit easier.¬† You can do things like set specific logging levels on steps (or only for a few rows) and work with zoom levels.

Clicking right on a step you can choose “Logging…” and set logging specifics.

Then, back on the subject of Neo4j, I’ve created a plugin to log the execution results of transformations and jobs (and a bit of their metadata) to Neo4j.

Graph of a transformation executing a bunch of steps. Metadata on the left, logging nodes on the right.

Those working with Azure might enjoy the Event Hubs plugins for a bit of data streaming action in Kettle.

The Kettle Needful Things plugin aims to fix bugs and solve silly problems in Kettle.¬† For now it sets the correct local metastore on Carte servers AND… features a new launcher script called Maitre.¬†Maitre supports transformations and jobs, local, remote and clustered execution.

The Kettle Environment plugin aims to take a stab at life-cycle management by allowing you to define a list of Environments:

The Environments dialog shown at the start of Spoon

In each Environment you can set all sorts of metadata but also the location of the Kettle and MetaStore home folders.

Finally, because downloading, patching, installing and configuring all this is a lot of work, I’ve created an automated process which does this for you on a daily bases (for testing) and so you can download Kettle Community Edition version 8.1.0.0 patched to 8.1.0.4 with all the extra plugins above in its 1GB glory at : remix.kettle.be

To get it on your machine simply run:

wget remix.kettle.be -O remix.zip

You can also give these plugins (Except for Needful-things and Environment) a try live on my sandbox WebSpoon server.  You can easily run your own WebSpoon from the also daily updated docker container.

If you have suggestions, bugs, rants, please feel free to leave them here or in the respective github projects.¬† Any feedback is as always more than welcome.¬† In fact, thanks you all for the feedback given so far.¬† It’s making all the difference.¬† If you feel the need to contribute more opinions on the subjects of Kettle feel free to send me a mail (mattcasters at gmail dot com) to join our kettle-community Slack channel.

Enjoy!

Matt

Farewell Pentaho

Dear Kettle friends,

12 years ago I joined a wonderful team of people at¬†Pentaho¬†who thought they could make a real change in the world of business analytics. At that point I recently open sourced my own data integration tool (then still called ‘ETL’) called Kettle and so I joined in the role of Chief Architect of Data Integration. The title sounded great and the job included everything from writing articles (and a¬†book), massive amounts of coding, testing, software releases, giving support, doing training, workshops, … In other words, life was simply doing everything I possibly and impossibly could to make our software succeed when deployed by our users. With Kettle now being one of the most popular data integration tools on the planet I think it’s safe to say that this goal has been reached and that it’s time for me to move on.

I don’t just want to announce my exit from Pentaho/Hitachi Vantara. I would also like to thank all the people involved in making our success happen. First and foremost I want to express my gratitude to the founders (Richard, Doug, James, Marc, …) for even including a crazy Belgian like myself on the team but I also want to extend my warmest thanks to everyone who I got to become friends with at Pentaho for the always positive and constructive attitude. Without exaggeration I can say it’s been a lot of fun.

I would also explicitly like to thank the whole community of users of Kettle (now called Pentaho Data Integration). Without your invaluable support in the form of new plugins, bug reports, documentation, forum posts, talks, … we could never have pulled off what we did in the past 12 years! I hope we will continue to meet at one of the many popular community events.

Finally I want to thank everyone at Hitachi and Hitachi Vantara for being such a positive and welcoming group of people. I know that Kettle is used all over Hitachi and I’m quite confident this piece of software will not let you down any time soon.

Now I’m going to go skiing for a week and when I get back it’s time to hunt for a new job. I can’t wait to see what impossible problems need solving out there…

 

Cheers,
Matt

Pentaho Community Meetup 2016 recap

Dear friends,

I just came back from PCM16, the 9th annual edition of our European Pentaho Community Meetup.  We had close to 200 subscriptions for this event of which about 150 showed up making this the biggest so far.  Even though veterans of the conference like myself really appreciate the warmth in previous locations like Barcelona and Cascais, I have to admit we did get a great venue in Antwerp this year with 2 large rooms, great catering, top notch audiovisual support in a nice part in the city center. (Free high speed Antwerp city WiFi, Yeah!)

Content-wise¬†everything was more than OK with back-to-back presentation on a large variety of subjects and I’m happy to say lots of Kettle related stuff as well.

For an in depth recap of the content you can see here for the technical track and here for the other sessions.

Personally I was touched by the incredibly positive response from the audience after my presentation on the state of the PDI unit testing project. ¬†However, the big bomb was dropped when Hiromu Hota from Hitachi Reseach America started to present a new “WebSpoon” project. ¬†You could almost hear everyone think: “Oh no, not another attempt at making a new¬†Kettle web interface”. ¬†However, 2 minutes into the presentation everyone in the audience started to realize that it was the real, original Spoon with all the functionality it has, ported 1:1 to a web browser on your laptop, thin device, phone or tablet. Applause spontaneously erupted, twitter exploded and people couldn’t stop talking about¬†about it until I left the PCM crowd a day later. ¬†Now I’m obviously very happy we managed to keep the WebSpoon project¬†a secret for the past few months and it’s impossible to thank Hota-san enough for traveling all the way to present at our event this weekend.

My heartfelt thanks also go out to Bart Maertens and the whole know.bi crew for making PCM16 a wonderful experience and an unforgettable event!

See you all at PCM17!

Matt

Pentaho Kettle Ťß£ŚÜ≥śĖĻś°ą

śā®Ś•ĹÁźÉŤŅ∑ÔľĀ

I just received a copy of the Chinese translation of Pentaho Kettle Solutions.

Right off the bat when we made it open source over 8 years ago, Kettle became very popular in China.  Among the very first Chinese people who helped out with bug fixing and translation was Jason Chu , a Pentaho BI Consultant who lives in Beijing city.  Jason spent a few years of his life translating Pentaho Kettle Solutions, a book written by yours truly, Jos Van Dongen and of-course my colleage Roland Bouman.

The new book is called¬†Pentaho KettleŤß£ŚÜ≥śĖĻś°ą : šĹŅÁĒ®PDIśěĄŚĽļŚľÄśļźETLŤß£ŚÜ≥śĖĻś°ą and is available from Amazon.cn.

Matt holding Chinese Pentaho Kettle Solutions book

While I¬†can’t read Chinese (at all) the look and feel of the book is excellent: graphics are great¬†and for example screenshots were re-done in the Chinese localization setting of Spoon. ¬†The book is published by Wiley and from what I learned from their method of publishing is that it is guaranteed to offer excellent results. ¬†Congratulations to all involved in making this book possible!

It’s impossible to thanks Jason enough for all the hard work he has done, also now in translating our¬†book and with this post I would like to recommend the end-result¬†to all¬†Chinese people!

Have fun with Kettle wherever you are!

Matt

 

 

 

 

 

 

Announcing Pentaho 5.0 Community Edition

Dear Pentaho friends,

Today it’s been almost a year since we released Kettle 4.4 as part of the Pentaho 4.8.0 suite release community edition. Today we’re very happy to announce the availability of Pentaho 5.0 community edition.

Now obviously, since this is open source software which is always available though continuous integration builds, we’re talking about the availability of builds which have been sufficiently verified by Pentaho QA, community members, partners and so on to be considered a release.

All new is also the new community landing page:

community.pentaho.com

On it you’ll find convenient links to the downloads, the source code, the CI builds and so on.

Thanks you all for your patience in delivering this new step forward. ¬†In the coming weeks I’ll do a series of “New in Kettle 5.0” blog posts to inform you about what we did. ¬†In short, these are the highlights:

  • Better previewing with dedicated window in run tab
  • Easier looping with transformation and job executor steps
  • Internal metrics gathering with Gantt chart
  • An improved plugin marketplace
  • Tons of Big Data improvements
  • Improved logging backend
  • Lots of new steps and job entries
  • A series of new plugin systems to make it easier to extend Kettle
  • Lots of new configuration options for tuning the engine back-ends
  • Integrated help system in the step and job entry dialogs

Even more important though:

  • Over 1500 bug fixes and feature requests
  • Metadata compatible with previous versions
  • API compatible with 4.x

I would like to thank and congratulate the whole community with this excellent release!

Happy downloading!
Matt

P.S. Keeping up with tradition let me share some statistics… The Kettle project, trimmed down to not include any plugins, now contains over 840,000 lines of code. ¬†Kettle 4.4 was downloaded over 244,000 times in the last year, from SourceForge only.

 

Figuring out the layout of a delimited file

Hey Kettle friends,

Recently I got the question whether or not PDI would be able to automatically figure out, at runtime, what the layout of a delimited text file is simply by looking at the header of the file and by doing various scans of the file to determine the delimiter, data types of the columns and so on.

Now obviously you could also do without a header file but then the columns would have to be called something like “field1 to fieldN”.

First, we want to see which delimiter, from a list of possible delimiters, is used in the file. ¬†We want to do this by reading every line of the file in question and by doing a character-by-character analyses of the line. ¬†We’ll count the number of occurrences of all possible and then pick the most often used delimiter at the end and set it as a variable.

Detecting the delimiter
Detecting the delimiter

 

The hardest part of this is doing the counting in JavaScript. ¬†We’re using an associative array:

 

// A simple algorithm: count every ,;/ and tab, see which one is used most often
//
var delimiters = [ ',', ';', '/', '\\t' ];
var counters = new Array();
var i;
for (i=0;i<line.length;i++) {
  var c=line.charAt(i);
  for (d=0;d<delimiters.length;d++) {
    if (c==delimiters[d]) {
      if (counters[c]==null) {
	    counters[c]=1;
      } else {
        counters[c]++;
      }
    }
  }
}
var maxC=0;
var max=0;
for (i=0;i<delimiters.length;i++) {   if (counters[delimiters[i]]>max) {
    maxC=i;
    max=counters[delimiters[i]];
  }
}

var delimiter=""+delimiters[maxC];
var count = counters[delimiters[maxC]];

So the above script populates an array of counters which is then evaluated to see which delimiter wins.  Obviously it should be easy to modify the list of possible participants in this delimiters game.

Now that we found the delimiter of the file, we can go ahead and determine the field names from the header of the file:

Detect field names
Detect field names

Again, we’re using a bit of JavaScript in the above :

var fields = line.split(getVariable("DELIMITER",""));
for (i=0;i<fields.length;i++) {
  var newRow = createRowCopy(getOutputRowMeta().size());
  var rowIndex = getInputRowMeta().size();
  newRow[rowIndex++] = fields[i];
  putRow(newRow);
}

var field=null;

As you can see, this script converts the line into different rows.

Previewing the field names detection
Previewing the field names detection

OK, but this still not solves the complete puzzle as we still need to figure out the hardest part: the data types, lengths, precisions, conversion masks of the various fields.  To do this we pass the column names we detected into another transformation:

metadata-detector-parser

 

So the “CSV file input” step is configured automatically by the parent transformation to read all fields as Strings. ¬†Then there is another piece of JavaScript to allow us to make use of the same functionality being used to automatically detect data types in the “Get Fields” functionality of the “Text File Input” and “CSV file input” step dialogs. ¬†In this case it’s done by a class called “StringEvaluator” which we’ll re-use here:

 

for (i=0;i<getInputRowMeta().size();i++) {
  var stringValue = getInputRowMeta().getString(row, i);
  var evaluator = evaluators.get(i);
  evaluator.evaluateString(stringValue);
}

At the end of the steps life, when the script ends, we have another block of code:

// write the results as new rows

for (i=0;i<getInputRowMeta().size();i++) {
  var stringValue = getInputRowMeta().getString(row, i);
  var evaluator = evaluators.get(i);
  evaluator.evaluateString(stringValue);

  var evalResult = evaluator.getAdvicedResult();
  var valueMeta = evalResult.getConversionMeta();

  var newRow = createRowCopy(getOutputRowMeta().size());
  var rowIndex = getInputRowMeta().size();

  newRow[rowIndex++] = getInputRowMeta().getValueMeta(i).getName();
  newRow[rowIndex++] = valueMeta.getTypeDesc();
  newRow[rowIndex++] = java.lang.Integer.toString(valueMeta.getLength());
  newRow[rowIndex++] = java.lang.Integer.toString(valueMeta.getPrecision());
  newRow[rowIndex++] = valueMeta.getConversionMask();
  newRow[rowIndex++] = valueMeta.getDecimalSymbol();
  newRow[rowIndex++] = valueMeta.getGroupingSymbol();

  putRow(newRow);
}

After some cleanup we finally have the data types and everything we need to properly read in the file.
You can fine the complete example over here.  You can use this on Kettle version 4.4.0 or above.

The output of the example is an XML file called layout.xml. ¬†Obviously you can store it in different formats and/or inject it into another transformation. ¬† You can then for example do complete normalization of every column value and store it in a database table. ¬†Not that I would recommend such a generic structure but it’s possible.

Enjoy!

Matt