EC2 : Scaling large files with S3 and Kettle

Dear Kettle fans,

Here is a problem: you want to process a large amount of data, coming from one or more CSV files.  Suppose for example you have a file like this one: (don’t download all at once, it’s 25 million rows or 2.4GB big)

Now, reading that file is a problem.  You can do all sorts of interesting things in parallel, but unless you can read the file itself in parallel, you have yourself a nice bottleneck.

As such, we created the parallel CSV Input step to come to the rescue.  While this works fine on ordinary file systems, it doesn’t offer any help out on Amazon EC2.  Over there, we don’t really have a file system yet that offers parallel access in a scalable way.  I investigated the promising s3fs project, but that one retrieved complete files in stead of chunks of data and always saves to local file first.  That in itself then becomes the bottleneck.


After even more searching over the weekend, I ended up using the JetS3t Java API to create a new “S3 Csv Input” plugin (SVN: svn:// to read the data in parallel, unbuffered and in chunks using AWS (Amazon Web Services) in this case on the REST service they provide.

The plugin allowed us to learn some interesting things about EC2 and S3…

Performance throttling

The first thing we see is that a single node never consumes more than 40-something% CPU.  We suspect that this is because of price/performance reasons.  Remember, these servers cost $US0.10/hour.  Even given this fact and the fact that we’re basically running inside a virtual machine, we can still process around 30,000 rows/s on a single node.

It scales!

The performance of a single node isn’t really all that interesting.  What we want to know is : if we read the same file with 2, 5 or 10 nodes, how long does it take then?

Well, obviously I tried it and here are the results:

–> 1 node=842 seconds, 2 nodes=433 seconds, 5 nodes=170 seconds, 10 nodes=96 seconds

This means that on 10 nodes, you go 8.8 times faster which is actually quite nice considering startup costs, etc.

The transformation

The 10 nodes scenario transformation looks like this:

This means: A cluster schema with 10 slave servers (1 carte per server) and 4 step copies per slave server (total of 40 step copies each reading around 25M/40=625k rows).


The throughput of this solution comes down to a staggering 260,000 rows/second or 935M rows/hour.


If you would have processed these 935M rows, you would have spent $US 1,00 (EURO 0,64) in EC2 rental costs and a few cents extra for the data transfer costs.


I always make it a point to provide the technology for ETL developers to use, never to dictate how that technology should be used.  However, in this case, I’m sure you’ll agree that the possibilities are endless, certainly in those situations where you want to apply complex logic, heavy calculations on individual or aggregated rows, scoring, etc. Especially those situations where a single server isn’t enough and you want to throw a bunch of servers at it.   What we have shown is that you *can* split up the load over several servers to make it scale beyond a single machine.


The main limitation seems to be how fast you can get data to upload to S3 for processing on EC2… using Kettle… and how fast you can get the data back to your own servers.  Obviously, if your services are already hosted on EC2, then that problem doesn’t exist.

Until next time,


9 thoughts on “EC2 : Scaling large files with S3 and Kettle”

  1. Interesting stuff, especially the nearly perfect scaling. I”ve been digging into some aspects of Kettle’s performance lately. It would seem that data conversion from source input is pretty CPU-hungry, because the per node throughput seems to be:

    1 node: 2400/842 = 2.85mb/s
    10 nodes: 2400/96 = 25mb/s aggregate (2.5mb /s /node)

    It seems like a lot of effort to get 25mb/s given that your average SATA disk can
    get sequential read speeds of 40-50mb/s.

    I just tested a simple transform against a 2.5GB file with a series of generated UUIDs and random numbers to load the data and then sum up the random numbers. Sure enough, I can barely make the disk sweat, only managing 85k rows/s (4mb/s). The CPU on this box is a bit old though, but I would not have expected this.

    Is this the primary goal of lazy conversion and parallel loading — to speed up the CPU bottleneck of conversions?

    Sorry if i’m going a bit off topic…

  2. Alex, lazy conversion will indeed speed that up.

    I wrote another blog entry on the subject:
    And another one on performance :

    The high CPU usage basically is the “Java” price you pay for Unicode conversions and data type conversions.
    To get around that we invented the concept of lazy conversion to try and avoid doing unneeded data conversions.
    If you’re loading into a database, it doesn’t really matter all that much since that’s going to be the bottleneck.

    On the slowness of the EC2 nodes: nothing we can do about that really. It’s a price/performance thing. There are faster nodes to be had, but they go at $US 0.30/hour. Also please remember that the data is coming in over REST (web services). That is bound to be a bit slower than reading from local disks. Again, the point is that it scales. Perhaps if we would add another 10 nodes, it would scale even further. I’m lacking the time to test it out at the moment.


  3. Hi,

    This is great info – thanks for the post. I’m a little confused though – is there an Amazon EC2 AMI off the shelf that has PDI pre-loaded on it? Or, do you need to set one up manually to get it running on the cloud?


  4. Hi Alex,

    I removed the AMIs since they were a few years old and as such a “little bit” outdated.
    However, in our latest book “Pentaho Kettle Solutions” you can read all about setting up your own AMI on Amazon in great detail.
    You can also read about the basics in this blog post:

    Good luck!


  5. What if the file is Gzipped?

    Can Pentaho’s S3 Input module unzip the file? It doesn’t have the Content:Compression attribute that Text File Input has.

    It seems that if anyone who has 2.5GB of text data in S3 would gzip it, to save 90% on storage and transmission costs.

  6. Mike, the S3 reader step grabs small chunks of the file in parallel from AWS, over a web service in other words.
    I’m currently not aware of an algorithm that can de-compress these GZipped chunks of data. I think you need to read the whole file at once.
    Obviously, that largely defeats the purpose of the parallel reading test. So you might as well mount the S3 on your Linux filesystem and read the file with “Text File Input”.

    I’ve been looking into the BZip2 format which you *can* unzip in parallel. There are utilities like pbzip2 that compress/uncompress in parallel. An algorithm such as that would stand a chance to work on the chunks of data we get back from S3 but not GZip I think. If I can find a Java library that can un-compress bzip2 chunks I’ll probably implement it too 🙂


  7. Thanks, Matt!
    That approach might work for s3fuse on Linux/Mac, but I think Windows uses are out of luck 🙁

    wins3fs exists, but it doesn’t seem to be stable on Win 7.

    Laying aside the parallel-download aspect,
    do you know of any other way to get a file (more importantly, a batch of files) from S3 to disk in Kettle or a related tool, so that Spoon/Kettle can automate the S3 download and then read-from-gzip process ?

  8. Thanks for the pointer!

    the gz:http://bucket solution ought to work for public s3 buckets (that don’t require a secret key. Something to consider.

    For non-publically-available data files, I guess using a call out to a downloader script is the way to go.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.