Kettle at the MySQL UC 2009

Hello Kettle fans,

Like Roland I got confirmation earlier this week that I could present my talk on “MySQL and Pentaho Data Integration in a cloud computing setting”, at the next MySQL user conference.

I’m very excited about the work we’ve done on the subject and it’s going to be great talking about it in April.

See you there!
Matt

Pentaho Data Integration vs Talend (part 1)

Hello data integration fans,

In the course of the last year or so there have been a number of benchmarks on blogs here and there that claimed a certain “result” pertaining to performance of both Talend and Pentaho Data Integration (PDI a.k.a. Kettle).  Usually I tend to ignore these results a bit, but a recent benchmark got so far off track that I had to finally react.

Benchmarking itself is a very time-consuming and difficult process and in general I advice people to do their own.  That being said, let’s attack a first item that appears in most of these benchmarks: reading and copying a file.

Usually the reasoning goes like this: we want to see how fast a transformation can read a file and then how fast it can also write it back to another file.  I guess the idea behind it is to get a general sense of how long it takes to process a file.

Here is a PDF that describes the results I became when benchmarking PDI 3.1.0GA and TOS 3.0.2GA.  The specs of the test box etc are also in there.

Reading a file

OK, so how fast can you read a file, how scalable is that process and what are the options?  In all situations we’re reading a 2.4GB test file with random customer data.  The download location is in the PDF on page 4 and elsewhere on this blog.

Remember that the architectures of PDI and Talend are vastly different so there are various options we can set, various configurations to try…

1) Simply reading with the “CSV File Input” step, lazy conversion enabled, 500k NIO buffer : 150,8 seconds on average for PDI. Talend performs this in 112,2 seconds.

2) This test configuration is identical to 1) except that PDI now runs 2 copies of the input step.  Results: 94,2 seconds for PDI.  This test is not possible in Talend since the generated Java software is single threaded.

Reading a delimited file, time in seconds, lower is better

There is a certain scalability advantage of being able to read and process files on multiple CPU and even multiple systems across a SAN.  There is a serious limitation in Talend since they can’t do that.  A 19% speed advantage for PDI is inconsequential for simple reads but brutal for more complex situations, very large files and/or lots of CPUs/systems involved.  For example, we have customers that read large web log files in parallel over a high speed SAN across a cluster of 3 or 4 machines.  Trust me, a SAN is typically faster than what any single box can process.

Writing back the data

The test itself is kinda silly but since it is being carried around in the blogosphere, let’s set a reference, a copy command.   I simply copied the file and timed the duration.  That particular copy set a reference time of 122.2 seconds: a copy from my internal disk to an external USB 2.0 disk. (for the exact configurations see the PDF)

3) If reading in parallel is the fastest option for PDI, we retain that option.  Then we write the data back with a single target file.  PDI handles this in 196.2 seconds.  Talend can’t read in parallel so we don’t have any results there.

4) A lot of times, these newly generated text files are just temporary files for upstream processes.  As such it might (or might not) be possible to create multiple files as target.  This would increase the parallelism in both this challenge as the upstream tasks.  PDI handles this task in 149.3 seconds.  Again I didn’t find any parallelization options in TOS.

5) Since neither 3) and 4) are possible in Talend I tried the single delimited reader / writer approach.  That one ran for 329.4 seconds.

Reading/writing a delimited file, time in seconds, lower is better

CPU utilisation

I also monitored the CPU utlisation of the various Talend jobs and Kettle transformations and came to the conclusion that Talend will never utilize more than 1 CPU while Kettle uses whatever it needs and get its hands on.  For the single threaded scenario, the CPU utilization is on par with the delivered performance of both tools.  There doesn’t seem to be any large difference in efficiency.

Conclusion

Talend wins in the first test with their single threaded reading algorithm.  I think their overhead is lower because they don’t run in multiple threads. (Don’t worry, we’re working on it :-))  In all the other situations where you have more complex situations, where you can indeed run in multiple threads, there is a severe performance advantage to using Kettle.  In the file reading/writing department for example, PDI runs in 3 threads and lazy conversion beats Talend by being more than twice as fast in the best case scenario and 65% faster in the worst case.

Please remember that my laptop is hardly and not by any definition “high end” equipment and that dual and quad core CPUs are commonplace these days.  It’s important to be able to use them properly.

The source code please!

Now obviously, I absolutely hate it when people post claims and benchmarks without backing them up.  Here you can find the used PDI transformations and TOS jobs.  With a little bit of modification I’m sure you all can run your own tests.  Just remember to be critical, even concerning these results!  Trust me when I say I’m not a TOS expert 🙂  Who knows, perhaps I used a wrong setting in TOS here or there.  All I can say is that I tried various settings and that this seemed the fastest for TOS.

Remember also that if even a simple “file copy” can be approached with various scenarios, that this certainly goes for more complex situations as well.  Even the other tools out there deserve that much credit.  Just because Talend can’t run in multiple threads, that doesn’t mean that Informatica, IBM, SAP and all are not capable of doing so.

If I find the time I’ll post a part 2 and 3 later on.  Feel free to propose your own scenarios to benchmark as well.  Whatever results come of it, it will lead to the betterment of both open source tools and communities.

Until next time,
Matt

Flighing high in economic storms

Next week I’ll be in Orlando for another week of brainstorming, planning scheming, plotting for world domination and yes, even coding.

Q : “What are you going to do next week?”
A : “The same thing I do every time when I’m in Orlando – Try to take over the world!”

So I went to kayak and entered my flight preferences: leave and return on Sunday giving me a full week over there.  I was almost shocked to see that the same flight I took 2 months ago now costs less than a third:

  • July 13th 2008 : BRU/MCO – MCO/BRU (over FRA) : 2,400 USD (summer time folks!)
  • October 12th 2008 : BRU/MCO – MCO/BRU (over IAD) : 2,000 USD
  • December 7th 2008 : BRU/MCO – MCO/BRU (over PHL) : 600 USD

Typically I’m not selecting the cheapest flight as that would put me on 18 hour layovers in Bankok or something like that. In the past I’ve once spent 8 hours at Chicago airport and trust me, it’s not worth the 100 USD you can save.  You’ll spend it on Internet access, food, “beverages”, magazines, etc.

That being said, the December 7th flight is the cheapest flight with “only” 1 layover.

In the past I’ve noticed that the airlines added more and more options for me to fly to Orlando or at least across the Atlantic ocean.  Now that the economic downturn is upon us, perhaps there’s finally a bit of over-capacity.  After all, the last 5 flights I took from Brussels to the US had been fully booked flights.  That’s right folks: listening to hollering kids with Mickey Mouse ears for 9 hours straight.  Even noise canceling headset have a hard time with that kind of noise.

Electronic System for Travel Authorization

Another thing of interest for the geeks among you is that you are now encouraged to apply for authorization to enter the US well in advance to replace the manually written green “Visa Waiver” documents.  Nobody makes a fuss about it, but registration for us Europeans is obliged or so you can read from January 12th 2009 on.  I’m sure there are going to be freedom fighters here and there that are going to be up in arms over this sort of program, but personally I’m glad that we can finally fill in those green “waiver” documents electronically at home.  From the looks of it, there’s nothing on there that you don’t already fill in manually now. (it felt kinda familiar filling them in)

I’ll let you know how they perceive my eagerness to fill in these “hidden” electronic documents at the US border next week 🙂

Until then,
Matt