EC2 : Scaling large files with S3 and Kettle

Dear Kettle fans,

Here is a problem: you want to process a large amount of data, coming from one or more CSV files.  Suppose for example you have a file like this one: (don’t download all at once, it’s 25 million rows or 2.4GB big)

Now, reading that file is a problem.  You can do all sorts of interesting things in parallel, but unless you can read the file itself in parallel, you have yourself a nice bottleneck.

As such, we created the parallel CSV Input step to come to the rescue.  While this works fine on ordinary file systems, it doesn’t offer any help out on Amazon EC2.  Over there, we don’t really have a file system yet that offers parallel access in a scalable way.  I investigated the promising s3fs project, but that one retrieved complete files in stead of chunks of data and always saves to local file first.  That in itself then becomes the bottleneck.


After even more searching over the weekend, I ended up using the JetS3t Java API to create a new “S3 Csv Input” plugin (SVN: svn:// to read the data in parallel, unbuffered and in chunks using AWS (Amazon Web Services) in this case on the REST service they provide.

The plugin allowed us to learn some interesting things about EC2 and S3…

Performance throttling

The first thing we see is that a single node never consumes more than 40-something% CPU.  We suspect that this is because of price/performance reasons.  Remember, these servers cost $US0.10/hour.  Even given this fact and the fact that we’re basically running inside a virtual machine, we can still process around 30,000 rows/s on a single node.

It scales!

The performance of a single node isn’t really all that interesting.  What we want to know is : if we read the same file with 2, 5 or 10 nodes, how long does it take then?

Well, obviously I tried it and here are the results:

–> 1 node=842 seconds, 2 nodes=433 seconds, 5 nodes=170 seconds, 10 nodes=96 seconds

This means that on 10 nodes, you go 8.8 times faster which is actually quite nice considering startup costs, etc.

The transformation

The 10 nodes scenario transformation looks like this:

This means: A cluster schema with 10 slave servers (1 carte per server) and 4 step copies per slave server (total of 40 step copies each reading around 25M/40=625k rows).


The throughput of this solution comes down to a staggering 260,000 rows/second or 935M rows/hour.


If you would have processed these 935M rows, you would have spent $US 1,00 (EURO 0,64) in EC2 rental costs and a few cents extra for the data transfer costs.


I always make it a point to provide the technology for ETL developers to use, never to dictate how that technology should be used.  However, in this case, I’m sure you’ll agree that the possibilities are endless, certainly in those situations where you want to apply complex logic, heavy calculations on individual or aggregated rows, scoring, etc. Especially those situations where a single server isn’t enough and you want to throw a bunch of servers at it.   What we have shown is that you *can* split up the load over several servers to make it scale beyond a single machine.


The main limitation seems to be how fast you can get data to upload to S3 for processing on EC2… using Kettle… and how fast you can get the data back to your own servers.  Obviously, if your services are already hosted on EC2, then that problem doesn’t exist.

Until next time,


Give MySQL a break please

In a unique display of mass hysteria, one blogger after the other and even slashdot (no, I’m not going to link) managed to take the completely innocent message that certain new enterprise features might get released as closed source only and turn it into an ongoing bad press onslaught about “MySQL closing down source code”.

Why don’t you all give MySQL a break here please?  The rule is always the same for everybody: the one that writes the code gets to pick the license.  Listen, I 100% believe in open source and I consider myself to be a big advocate, but commercial open source companies like MySQL (and Pentaho) are commercial entities.  At lease try to put yourself in their position for a second.  For example, if a customer asks you to NOT to release a piece of software they paid for, you don’t release it, it’s that simple.

In the end, what MySQL is doing is simple: they are experimenting with a commercial open source  (COS) model.  Why are they experimenting?  Because the concept of COS is very new and there are no clear guidelines.  It simply hasn’t been done before.  How do you keep growing?  How do you keep paying more open source developers?  How do you pay for the millions of web hits each day?  How do you pay for the millions of downloads, the Tera bytes of internet traffic?  How do you guarantee your long term survival?  How do you strike a balance between commercial success and widespread open source adoption?  How do you keep your investors happy as well as your community?

I guess we learned one thing the past week : it’s easier to spout criticism than to give answers to these tough questions.


3.1.0-M1 is out

For those of you that missed it: Pentaho Data Integration (Kettle) version 3.1.0-M1 is out.

A lot of new goodies are in that first milestone (development) release such as zooming, snap-to-grid, results pane integration, a series of bug fixes, improved translations, performance enhancements, parallel job entry execution, a new data validation step, a new Greenplum bulk loader step, a new Property Input step, new job entries SSH2GET and SSH2PUT, a new Split Field To Rows step (ported from plugin) and much much more.

Until next time,

Korean Kettle : ko_KR

Only a few weeks after we started to receive a lot of Japanese translations, Kim YoungWoo offered to do Korean translations.  A lot of translations came in over the weekend for both languages (as well as a host of other fixes) :

Here is the list of languages supported with the % complete next to it:

en_US : 99,80% complete  (7956)
fr_FR : 98,36% complete  (7841)
it_IT : 89,30% complete  (7119)
es_AR : 75,63% complete  (6029)
de_DE : 56,41% complete  (4497)
ja_JP : 56,26% complete  (4485)
zh_CN : 53,94% complete  (4300)
es_ES : 48,19% complete  (3842)
nl_NL : 18,09% complete  (1442)
pt_BR : 15,48% complete  (1234)
pt_PT : 15,48% complete  (1234)
ko_KR : 10,99% complete  (876)

Thanks again to all the great work done by all translators!


Revamping Spoon

Dear Kettle fans,

Together with the Pentaho UI engineers we’re trying to address a number of usability issues in Spoon. This is what we’ve come up with so far:

You can minimize, maximize, hide or show the execution results panel with a single mouse click.

Let us know what you think of the changes so far. So far we’ve been focusing on the right side of the screen. Next on the agenda is the left side panel.

Until next time,

Pentaho Community Gathering – Mainz 2008

What can I tell you?  Beyond being a great meeting place and information exchange oppertunity, it’s going to be a lot of fun with most if not all lead developers at Pentaho coming over : Thomas Morgner (Reporting, organizing), Mark Hall (Mining), Julian Hyde (Analyses) and many others, including myself and a crowd from ##pentaho.  There are even a few people like Doug coming over from the US in a desperate but futile attempt to try and understand the mysterious ways of us Europeans…

Local entertainement in Mainz

If there is interest, I can car pool from Brussels to Mainz.  Let me know far enough in advance.

Until next time,


Mantle : Pentaho platform on fire

Once in a while there are changes in software that are interesting to watch and witness.  Mantle, a new incubation project from one of Pentaho’s finest Mike D’Amour, is such a change.

Mantle is a new shell, a new layer on top of the normal Pentaho platform.  The new thing about it is that it uses AJAX (Web 2.0) using the Google Web Toolkit (GWT).   The interesting feature of GWT(*) is that it allows Java developers (like most of the devs at Pentaho) to write code that gets cross compiled into JavaScript.

One of the first things you’ll notice when you’ll hit the proof of concept site over at (forwards to an Amazon EC2 test node, be gentle) is that it is fast despite being run on a virtual server.

Not only that, right clicking on actions, reports, analyses, etc allows you to set properties, schedule, etc:

Mantle with right click in tree

It allows for a very rich experience for users as well as administrators.  One sample is the setting of security on a report:

IMHO, what is really interesting about it all is that with his in-depth knowledge of the Pentaho platform, Mike has been able to write his new Mantle shell in only a few weeks time. (in his spare time!)

As is the case with all of the work we do, but especially with proof of concepts like Mantle, we would like to invite you to give feedback on this great initiative, either here, on our ##pentaho IRC channel (on Freenode) or as issues over at the Mantle project.

Until next time,


(*) That is besides the fact that GWT carries an open source Apache license (no vendor lock-in folks) is highly optimized both in terms of JavaScript and download speed (faaaast!!), is browser independent (lowers our software maintenance cost), is well documented, has a large community and a nice API 🙂