Hello data integration fans,
In the course of the last year or so there have been a number of benchmarks on blogs here and there that claimed a certain “result” pertaining to performance of both Talend and Pentaho Data Integration (PDI a.k.a. Kettle). Usually I tend to ignore these results a bit, but a recent benchmark got so far off track that I had to finally react.
Benchmarking itself is a very time-consuming and difficult process and in general I advice people to do their own. That being said, let’s attack a first item that appears in most of these benchmarks: reading and copying a file.
Usually the reasoning goes like this: we want to see how fast a transformation can read a file and then how fast it can also write it back to another file. I guess the idea behind it is to get a general sense of how long it takes to process a file.
Here is a PDF that describes the results I became when benchmarking PDI 3.1.0GA and TOS 3.0.2GA. The specs of the test box etc are also in there.
Reading a file
OK, so how fast can you read a file, how scalable is that process and what are the options? In all situations we’re reading a 2.4GB test file with random customer data. The download location is in the PDF on page 4 and elsewhere on this blog.
Remember that the architectures of PDI and Talend are vastly different so there are various options we can set, various configurations to try…
1) Simply reading with the “CSV File Input” step, lazy conversion enabled, 500k NIO buffer : 150,8 seconds on average for PDI. Talend performs this in 112,2 seconds.
2) This test configuration is identical to 1) except that PDI now runs 2 copies of the input step. Results: 94,2 seconds for PDI. This test is not possible in Talend since the generated Java software is single threaded.
Reading a delimited file, time in seconds, lower is better
There is a certain scalability advantage of being able to read and process files on multiple CPU and even multiple systems across a SAN. There is a serious limitation in Talend since they can’t do that. A 19% speed advantage for PDI is inconsequential for simple reads but brutal for more complex situations, very large files and/or lots of CPUs/systems involved. For example, we have customers that read large web log files in parallel over a high speed SAN across a cluster of 3 or 4 machines. Trust me, a SAN is typically faster than what any single box can process.
Writing back the data
The test itself is kinda silly but since it is being carried around in the blogosphere, let’s set a reference, a copy command. I simply copied the file and timed the duration. That particular copy set a reference time of 122.2 seconds: a copy from my internal disk to an external USB 2.0 disk. (for the exact configurations see the PDF)
3) If reading in parallel is the fastest option for PDI, we retain that option. Then we write the data back with a single target file. PDI handles this in 196.2 seconds. Talend can’t read in parallel so we don’t have any results there.
4) A lot of times, these newly generated text files are just temporary files for upstream processes. As such it might (or might not) be possible to create multiple files as target. This would increase the parallelism in both this challenge as the upstream tasks. PDI handles this task in 149.3 seconds. Again I didn’t find any parallelization options in TOS.
5) Since neither 3) and 4) are possible in Talend I tried the single delimited reader / writer approach. That one ran for 329.4 seconds.
Reading/writing a delimited file, time in seconds, lower is better
I also monitored the CPU utlisation of the various Talend jobs and Kettle transformations and came to the conclusion that Talend will never utilize more than 1 CPU while Kettle uses whatever it needs and get its hands on. For the single threaded scenario, the CPU utilization is on par with the delivered performance of both tools. There doesn’t seem to be any large difference in efficiency.
Talend wins in the first test with their single threaded reading algorithm. I think their overhead is lower because they don’t run in multiple threads. (Don’t worry, we’re working on it :-)) In all the other situations where you have more complex situations, where you can indeed run in multiple threads, there is a severe performance advantage to using Kettle. In the file reading/writing department for example, PDI runs in 3 threads and lazy conversion beats Talend by being more than twice as fast in the best case scenario and 65% faster in the worst case.
Please remember that my laptop is hardly and not by any definition “high end” equipment and that dual and quad core CPUs are commonplace these days. It’s important to be able to use them properly.
The source code please!
Now obviously, I absolutely hate it when people post claims and benchmarks without backing them up. Here you can find the used PDI transformations and TOS jobs. With a little bit of modification I’m sure you all can run your own tests. Just remember to be critical, even concerning these results! Trust me when I say I’m not a TOS expert 🙂 Who knows, perhaps I used a wrong setting in TOS here or there. All I can say is that I tried various settings and that this seemed the fastest for TOS.
Remember also that if even a simple “file copy” can be approached with various scenarios, that this certainly goes for more complex situations as well. Even the other tools out there deserve that much credit. Just because Talend can’t run in multiple threads, that doesn’t mean that Informatica, IBM, SAP and all are not capable of doing so.
If I find the time I’ll post a part 2 and 3 later on. Feel free to propose your own scenarios to benchmark as well. Whatever results come of it, it will lead to the betterment of both open source tools and communities.
Until next time,