Kettle at Talend

Dear open source ETL friends,

today I met with the nice folks from Talend in their offices in Paris.  Contrary to what some of you might conclude from the sometimes heated technical debates on the internet, it was nice to hear that both the folks from Talend and myself have the same opinion on our position on the market: we don’t see each other as competitors. (**)

Some of you might think that this sounds a lot like a marketing slogan so let me expand upon this a bit.  The matter of fact is that Talend and Pentaho Data Integration are sufficiently different in conception, architecture and implementation that they are in fact two distinct choices on the ETL market.  While some people prefer one tool and some prefer another, the choice is there.  Having the opportunity to try out both tools for Free and to have this choice is one of the most important differentiators with the traditional closed source ETL companies.

Now, don’t get me wrong, obviously as the Pentaho Chief Data Integration, I think that our architecture is a lot better. 🙂  However, the people from Talend have the same opinion about their ETL tool.  That’s the way it should be.  This is a good and healthy thing.

That should not relieve us from our responsibility as good behaving citizens in Open Source land to at least try and find a common ground on certain issues.
If KDE and Gnome can agree on certain desktop standards, if Compiz and Beryl can join forces again after a rocky episode, anything should be possible.

So it was a pleasure to find out that indeed we did find some small points to work on together. I’m hoping we’ll be able to let you in on the details real soon.

I do hope that both communities read this message and act accordingly in the same sharing, cooperating and dignified fashion.  In that regard I want to remind my readers that Pentaho Data Integration was only possible thanks to the incredible work put into dozens of fine open source libraries.  I’m pretty sure that the same goes for Talend.

Think about it for a minute… We’re both outpacing and outselling the traditional closed source ETL vendors, probably by a factor of 10 to 1.  That is not because of our differences, but because our similarities.

Until next time,

Matt

(**) Driving home at 280km/h on the High Speed Train, I also came to the conclusion I really enjoyed that red wine during lunch 😉

If you don’t know… you don’t know!

Dear friends,

We had a good time the last couple of days in Lyon. It was nice to meet people actually using Kettle every day in their day-to-day job. It was great to hear all about the succes stories and the future plans.
I also had the pleasure to meet Alain, project manager at BPM, all around nice guy and philosophical genius. At a certain point in our conversation he dropped the almost Zen-like knowledge on us:

If you don’t know, you don’t know!

Boy, it doesn’t get any deeper than that, does it?  In one simple sentence, Alain described the core problem of BI gap analyses AND the number one problem in software development. If you don’t know, you don’t know and you need to ask your customer, you need to get to know the requirements. All too often developers, data warehouse designers and ICT people in general decide for the end-users what is right for them. All too often it causes serious strain on the projects. In fact I’m absolutely sure it’s the absolute #1 cause for project failures. (**)

So next time you decide what is the best thing for your customers and end-users, remember Alain’s wise words.

Until next time!

Matt

(**) I don’t have any statistics available on this subject of-course. If I’m proven wrong later I’ll blame it on Alain “The Crazy Swiss” Debecker.

Kettle version 2.5.2 available

Dear Kettle fans,

A lot of things have happened in terms of development in the the 3.x codebase.
However, that doesn’t mean we don’t maintain the 2.5 tree anymore. A lot of fixes where back-ported from the 3.x tree and we managed to add a few new job entries as well.
Click here for a full list of the changes versus 2.5.1. A version 2.5.3 source tree has been opened in which we will continue to fix problems, at least for a while longer. We will no longer add new features to this version.

Grab the goodies over here:

Binary: Kettle-2.5.2.zip
Source: Kettle-src-2.5.2.zip
Windows installer: PDI-2.5.2.exe

Instructions on how to install using the executable are over here. Just make sure to use a Microsoft Windows operating system to install on. 🙂

Version 2.5.2, like before runs on version 1.4 or higher of the Java Runtime Environment.

Enjoy the release!

Best regards,

Matt

Ongoing spam assault

Hi friends,

There’s always a price to pay.  It seems that my blog has been getting more and more popular lately.  However, it has also become a spam magnet.  It’s not just that the number of spam comments has increased to +65000…

Apparently, there are some unfixed security holes in the WordPress software for this blog.  Those can be exploited by the spammers to drop all kinds of stupid links into the blog roll.  That’s the reason that section on the right is gone now.

I wasn’t really in need of or trying to sell you those items you may have seen on occasion 😉  If that’s the case, my apologies.

If anyone knows a good solution for this problem, feel free to drop me a note!

Until next time,

Matt

Test case : fast parallel flat file reading

Version 3 of Pentaho Data Integration will feature 2 news steps to load flat files:

  • CSV Input: to read delimited text files
  • Fixed Input: to read fixed width flat files

Both steps where not designed to be as versatile as possible. We already have the regular “Text File Input” step for that. These steps where designed to be as fast as possible.

Here are the reasons why these steps are fast:

  • They use non-blocking I/O (NIO) to read data in big chunks at the time
  • They have greatly simplified algorithms without any bloat
  • They allow us to read data using lazy conversions

Besides these items we also greatly reduced the overhead of garbage collection in the java virtual machine and the metadata handling in our new version.

So where does that leave us? How fast can we read data now? Well, let’s take a really big file and try it out for ourselves to see.

Generate a text file

A test file is a problem because for reference you want everyone to use the same text file and yet you can’t just post a 15GB text file on an FTP server somewhere. (it’s just not practical).

So you need to generate one. I created a small C program to handle this: printFixedRows.tar.gz

A Linux (x86) binary is included in the archive, but you can compile the C program for your own platform with the following command:

cc -o printFixedRows printFixedRows.c

Then you can launch this command to get a 15GB test file:

./printFixedRows 10000000 150 N > bigfixedfile.txt

This generates 10 million rows with 1529 bytes on each row. (a size of 15.290.001.529 bytes)

Reading a text file

Now that we’re certain that the system cache is not going to make too much of a difference in the results, we’re going to read this file.

Here is the transformation to read the file.

Parallel reading transformation

Performance monitoring

If we adjust the transformation and point it to the correct filename, we can run it.

We can then see that the performance is around 12k rows/s or 12000×1529 bytes/s = 18MB/s.
Each row has 152 fields in it of 4 different data types (String, Integer, Date and Number). As such, this step generates 1.8 million data fields per seconds.

In fact if we look at the results of “iostat -k 5” we basically find the same results:

iostat results

We also note that the CPU usage is very low at around 22% and that the I/O Wait % is very high. In other words: our transformation is waiting for the disk to give back data. The NIO system really pays off here by reading large blocks at the same time from disk, offering great performance.

Parallel reading

At first glance, parallel reading wouldn’t do us any good here since the disk can’t go any faster. Having 2 processes read the same file at the same time is not the solution.

So I copied the file over to another disk and made symbolic links in the /tmp folder

  • /tmp/0 for step copy 0 with a sym-link to biginputfile.txt on disk 1
  • /tmp/1 for step copy 1 with a sym-link to biginputfile.txt on disk 2

In my case it’s this:

0/bigfixed.txt -> /bigone/bigfixed.txt
1/bigfixed.txt -> /parking/bigfixed.txt

Please forget for a minute the fact that copying the data to another disk is slower than reading it in in the first place. I’m sure you all have fancy RAID drives that do this transparently, I’m just a poor blogger, remember? 🙂 If you have a fast disk, you can just fire up 2 or more copies of the step without the trick above to get the speedup.

As you all figured out by now, the speed of the transformation just doubled simply by doubling the number of copies to start to 2 for the “Fixed Input” step . Throughput went from 18MB/s to around 40MB/s. The second disk is a bit faster as it’s an internal disk, not an USB drive, and we’re now reading data at around 26k row/s. (4 million data fields per second)

iostat results

Possibilities and conclusions

Because of all the optimizations that we did, we are now reading data in faster than what our limited I/O system can give us. That means that using faster disks or in general getting more “spindles” is a good idea if you want to read data faster.

Obviously, the same goes for the output part as well, but that’s a whole different story for another time. For now, we demonstrated that the new “Fixed Input” step can read a 15GB file in 400 seconds flat. That’s not a bad result for my laptop system, any way you look at it. The 2 CPU are even 75% unused leaving room for lots of interesting transformations you want to do on the data.

Until next time,

Matt

Opinion: commercial BI

Now that everyone on the blogosphere has voiced their opinion on the latest SAP acquisition of BO, I can safely post my own rant as well.

I read all about the pro’s and cons of this deal. IMHO it just doesn’t matter a lot. In the end, the flood of open source software is going to wash away the proprietary sand castles left and right.

Yesterday we actually spent a lovely day at the beach with the family. When you start building, it feels like ages before the water will hit the castle. When the water closes in it looks like those few centimeters of water ain’t going to make a lot of difference. However, here’s a reality check: there’s just no escaping. After a couple of waves there was nothing left but a small pile of mud.

The analogy with closed source BI vendors is a bit flawed. It will not take 6 hours for the tide to come in, it will maybe take 6 years. However, I strongly feel that the tide is coming in and that the waves are going to hit hard.

Wars are already being fought left and right, castles are vigorously defended, there is lying and cheating and even lawsuits are filed against open source companies lately. In the end, it’s just going to be nothing but a small splash in the ocean.

Defending the castle with a passion

Some 8 or 9 years ago I actually started dabbling with Business Objects. It was a revolutionary business proposition at the time. No longer was there a need for expensive “SQL generators / report generators” on the mainframe. You could do the same thing and more for pocket change. You could get licenses for a thousand $US a seat, imagine that! It was the golden age for companies like BO, Cognos and many others.

A dot-com bubble and a few years later and there is an unprecedented consolidation wave going on. You see, the only way these big corporations can get more market-share and increase their turn-over is simply by buying other companies. Never mind that the BO reporting client software is 99% the same as 8 years ago. (Most of the bugs are the same too) Technically, not that much changed. The strategy of all the closed source BI companies is still the same: spend +75% of your turnover on sales and marketing. Innovation, software development and other costs are best kept at a minimum to keep your profit margins as high as possible. Suppose you acquired a few companies that have exactly the same business model as your own. You wouldn’t care about software alignment, integration and other stupid things would you? It just means you have to spend less before the customer buys either product.

Spending insane amounts of money on acquiring your customer is a nice strategy … as long as it works. If it stops working because someone undercuts you substantially, there is no way out.

The speed at which Pentaho and other open source companies are innovating tells me that there is soon going to be a tipping point when the size of the castle walls are not going to matter anymore. The business proposition of open source is just too good. It’s a win-win for both the customer AND for the professional open source companies.

Until next time,

Matt

4.3 million rows per second

Earlier today I was building a test-case in which I wanted to put a lot of Unicode data into a database table. The problem is of-course that I don’t have a lot of data, just a small Excel input file.

So I made a Cartesian product with a couple of empty row generators:

4M rows per second transformation

It was interesting to see how fast the second join step was generating rows:

4M rows per second log

Yes, you are reading that correctly: 717 million rows processed in 165 seconds = 4.3 million rows per second.

For those of you that would love to try this on their own machine. Here is an exclusive present for the readers of this blog in the form of a 3.0.0-RC2 preview of 2007/10/12 (88MB zip file). We’ve been fixing bugs like crazy so it’s pretty stable for us, but it’s still a few weeks until we release RC2. Don’t do anything crazy with this drop! This is purely a present for the impatient ones. If you find a bug, please file it! (give us a present back :-))

Until next time,

Matt

About i18n

We’ve come a long way in Pentaho Data Integration. The amount of source code increased dramatically over the almost 2 years that we’ve open sourced Kettle. The amount of text to translate into other languages has dramatically increased as well. I thought it was high time we gave the translators in our community a better tool to work with. The design philosophy in PDI has always been that everything you can change should get a GUI. So we created the Pentaho Translator:

Pentaho translator

With this single dialog we can translate all the keys that are in the Java source code. Here is how it works.

  1. Select the locale to translate into (fr_FR for example is French from France)
  2. Select the package in which there are untranslated keys:
    • light gray: 1-5 missing keys
    • gray : 6-10 missing keys
    • yellow: 11-25 missing keys
    • orange: 26-50 missing keys
    • red: > 50 missing keys
  3. Select a key from the TODO list in the middle
  4. Enter the translation to the right
  5. Apply moves you to the next key on the todo list
  6. When you are done, hit the “Save” button to save the changes back to disk.
  7. Send the changed properties files back to me 🙂

If you want to help out, you can download our Kettle i18n package.(20MB zip file) Simply unzip this file in a directory and launch Translator.bat or “sh translator.sh”. This package includes libraries for Windows and Linux (x86). UPDATE: the zip file now also includes pt_PT. You can now also add your own locale yourself in the translator.xml file.

So join in on the translation fun and leave your mark on Pentaho Data Integration!

Until next time,

Matt

P.S. Although the source code for Translator is customized for Kettle, I’m sure it will be possible to use this tool to translate other properties files based Java software as well. If you are interested, contact me over e-mail or have look at the source code for Translator2.java.