PDI cloud : massive performance roundup

Dear Kettle fans,

As expected there was a lot of interest in cloud computing at the MySQL conference last week.  It felt really good to be able to pass the Bayon Technologies white paper around to friends, contacts and analysts.  It’s one thing to demonstrate a certain scalability on your blog, it’s another entirely to have a smart man like Nicholas Goodman do the math.

Sorting massive amounts of rows is hard problem to take on.  Making it scale on low-cost EC2 instances is interesting as it proves a certain level of scalability.  Nick ran 40 EC2 nodes in parallel to do the work and saw that it was good.  450,000 rows/s for $US 4,00/hour is not bad. Note: the tests sort 300M (50), 600M (100) and 1.8B (300) line-item rows from TCP-H respectively.

For certain, the paper seemed to make it easier for me to point to PDI scalability and it opened some doors for further testing on big iron at Sun Microsystems.  It was great to talk to so many people.  I even walked up to the Amazon Web Services booth at the expo to ask about the performance bottleneck in the EBS that was exposed by the white paper.  “It’s being worked on” was the reply 🙂

The most interesting thing about the PDI cloud integration work is that there don’t seem to be a lot of other ETL tool vendors doing it.  In fact, after a Google or 2 I could only find Informatica with a Saas (not even IaaS) offering and I kinda doubt that closed source software is a good match for cloud computing.

So I went out there and did a presentation on the subject to explain to people how they would set it up for themselves.  The open source way is to not only do the marketing but to allow people to run their own tests and see for themselves.  That way you get valuable feedback to improve your offering.

Here is a copy of the presentation I gave: Cloud Computing with MySQL and Kettle.

I thought it was a good session although for once I didn’t get “The Question”, you know the one where people ask me how Kettle is different from Talend and where I get to comment on their lack of scalability.  Oh well, I guess you can’t win them all 🙂

Finally, people have been asking me about integration with both SQLStream on the one hand and MapReduce/Hadoop/Hive/HDFS on the other hand.  I’m happy to say that the former is in progress and that I’ve started talks with the fine folks from Cloudera to get started on the latter.  I simply loved Aaron Kimball’s tutorial @ MySQL Conf on the MapReduce subject and think that there is a lot of potential for integration with PDI to make us scale even better.

Until next time,


Next week : MySQL UC

Dear Kettle & MySQL fans!

I’m really looking forward to go to the MySQL User Conference next week, not just because I’m speaking in 2 sessions again, but perhaps also because these are “interesting” times for MySQL and Sun Microsystems.  Pivotal times it would seem.

Here are the 2 sessions I’m going to do:

  • Cloud Computing with MySQL and Kettle : I’m particularly happy that MySQL accepted this session: it will demonstrate how easy it has become to do cloud computing exercises with tools like MySQL and Kettle.

So please drop in on our sessions and join the fun.  2 years ago my sessions drew quite a crowd and so I hope that this is again the case.  Pentaho is a sponsor of the event and even has a booth (#308) on the main show floor.  You can find me there to chat on Tuesday & Wednesday afternoon (1pm-4:30pm).  I’ll be there together with a group of people from Pentaho including Julian Hyde, James Dixon, Lars Nordwal, Lance Walter, Matt Papertsian & Jared Cornelius.

On Thursday I’ll be visiting the sages from SQLStream in the morning to talk about integrating their technology to create truly real-time data integration solutions without the need to fork over insane amounts of money.  Later that day we’ll all go see John Sichi‘s session at the nearby (same building) Percona Performance Conference.

See you soon!