The kindness of strangers

Dear Kettle fans,

There isn’t a week that goes by where I don’t find myself amazed by the number of contributions and help that the Pentaho Data Integration project receives in all kinds of forms.  There are people contributing anything from small patches to complete steps, folks helping out others on the forum, writing documentation, writing books, translating PDI, etc.  Without any question, this has been a truly amazing experience, not just for me but for the whole Kettle project.

It’s because of that overwhelmingly positive experience that I’ve always tried to be accessible and in contact with my community in all sorts of possible ways.  And because of that positive vibe I have refrained from commenting on the negative flip side to that story for the longest time.

The problem is really that lately things have been changing.  It’s probably caused in general by an increasing attention to open source and specifically by an increase in popularity of Kettle.  In any case, certain types of people do the following:

  • Send me personal email
  • IM me on skype/Yahoo!/MSN/AIM/…
  • Send me all sorts of messages and questions through the forums
  • Ask questions on this blog

Usually it’s a combination of any of the above.  Any time now I expect folks to be sending me direct twitter messages.  The questions are always the same:

I have an urgent Pentaho porblem.  I am incapable of using the forum for some stupid reason and so you have to help me, preferable now or within the next 15 minutes!!!!

This way, the meaning of “The kindness of strangers” becomes more and more like the one from the Nick Cave song.

I’ve just finished reading Linus‘ book “Just for fun” (Thanks again Domingo!) and his approach to the problem of staying in reach for people to contribute code and at the same time allowing yourself to have a life and a job is simple : if it ain’t fun, don’t do it.  Well, the barrage of this sort of questions has stopped being fun for me a long time ago.

As such, I’m going to try this approach: any question that could or should be asked on the forum is from now on silently ignored and deleted from my mailbox.  Any person that is not part of my “community” and that needlessly contacts me over IM gets blocked indefinitely.  And yes, that goes for twitter as well.  Off-topic questions on this blog go to the spam folder as well.  I will simply refuse to spend time on non-interesting topics.

I thought about creating a standard response e-mail, but any sort of replying is simply an encouragement to certain types of people and will only make matter worse. (been there, done that)

I’m sure everyone understands that this is the only way to free up time to work on the real problems at hand.  Thank you for your understanding in any case.

Until next time,

Matt

Gartner DI MQ

Dear Data Integration fans,

A few weeks ago, Yves de Montcheuil from Talend took a shot across the bow of Gartner for not including Talend in their Magic Quadrant (MQ) for data integration.  After that post, Andreas Bitter from Gartner (rightfully) felt personally under assault and felt the need to set the record straight.

I think the discussion itself is very interesting, but misses very important point:

The Magic Quadrant contains companies not trends nor communities nor people nor software!

Think about it for a second.  In the early days of JBoss there were complaints from Marc Fleury about the fact that only a small percentage of the “JBoss the software” users paid anything to “JBoss the company”.  Numbers that floated around back then were 0.01% or 0.1%, can’t remember exactly.

Those numbers make sense, I’ve heard about similar figures from other commercial open source companies.  Anything in the range 0.01% to 1% is possible.

Let’s be “optimistic” here and claim that a company like Pentaho converts 1% of all users into customers. (trust me, that figure would be really great given the millions of users out there :-))  That would mean that we’re disturbing the market of our competitors for the turnover x 100.  So if Pentaho would do a dollar turnover, we’re disturbing the closed source vendors for 100 dollars.

Pentaho and yes indeed Talend see that they are being a serious disturbance to the market dominance from the traditional DI vendors.  And that is why Yves feels a bit mistreated by Gartner.  However, since companies like Pentaho and Talend use a disruptive business model it is only normal that the Gartner MQ itself is also disrupted by our models. You simply can’t be part of the system if you want to disrupt it I guess. (*)

All that being said, it’s only a matter of time before something has got to give: open source or the Gartner DI MQ.  Yves, Andreas, let it be noted I’m betting on the former to come out of this as a winner.

Until next time,

Matt

(*) This also partly explains why Kettle and TOS are not really competitors: we’re using the same business model and are not disrupting each other.  We offer 2 completely different choices to our users.

Pentaho Data Integration vs Talend (part 1)

Hello data integration fans,

In the course of the last year or so there have been a number of benchmarks on blogs here and there that claimed a certain “result” pertaining to performance of both Talend and Pentaho Data Integration (PDI a.k.a. Kettle).  Usually I tend to ignore these results a bit, but a recent benchmark got so far off track that I had to finally react.

Benchmarking itself is a very time-consuming and difficult process and in general I advice people to do their own.  That being said, let’s attack a first item that appears in most of these benchmarks: reading and copying a file.

Usually the reasoning goes like this: we want to see how fast a transformation can read a file and then how fast it can also write it back to another file.  I guess the idea behind it is to get a general sense of how long it takes to process a file.

Here is a PDF that describes the results I became when benchmarking PDI 3.1.0GA and TOS 3.0.2GA.  The specs of the test box etc are also in there.

Reading a file

OK, so how fast can you read a file, how scalable is that process and what are the options?  In all situations we’re reading a 2.4GB test file with random customer data.  The download location is in the PDF on page 4 and elsewhere on this blog.

Remember that the architectures of PDI and Talend are vastly different so there are various options we can set, various configurations to try…

1) Simply reading with the “CSV File Input” step, lazy conversion enabled, 500k NIO buffer : 150,8 seconds on average for PDI. Talend performs this in 112,2 seconds.

2) This test configuration is identical to 1) except that PDI now runs 2 copies of the input step.  Results: 94,2 seconds for PDI.  This test is not possible in Talend since the generated Java software is single threaded.

Reading a delimited file, time in seconds, lower is better

There is a certain scalability advantage of being able to read and process files on multiple CPU and even multiple systems across a SAN.  There is a serious limitation in Talend since they can’t do that.  A 19% speed advantage for PDI is inconsequential for simple reads but brutal for more complex situations, very large files and/or lots of CPUs/systems involved.  For example, we have customers that read large web log files in parallel over a high speed SAN across a cluster of 3 or 4 machines.  Trust me, a SAN is typically faster than what any single box can process.

Writing back the data

The test itself is kinda silly but since it is being carried around in the blogosphere, let’s set a reference, a copy command.   I simply copied the file and timed the duration.  That particular copy set a reference time of 122.2 seconds: a copy from my internal disk to an external USB 2.0 disk. (for the exact configurations see the PDF)

3) If reading in parallel is the fastest option for PDI, we retain that option.  Then we write the data back with a single target file.  PDI handles this in 196.2 seconds.  Talend can’t read in parallel so we don’t have any results there.

4) A lot of times, these newly generated text files are just temporary files for upstream processes.  As such it might (or might not) be possible to create multiple files as target.  This would increase the parallelism in both this challenge as the upstream tasks.  PDI handles this task in 149.3 seconds.  Again I didn’t find any parallelization options in TOS.

5) Since neither 3) and 4) are possible in Talend I tried the single delimited reader / writer approach.  That one ran for 329.4 seconds.

Reading/writing a delimited file, time in seconds, lower is better

CPU utilisation

I also monitored the CPU utlisation of the various Talend jobs and Kettle transformations and came to the conclusion that Talend will never utilize more than 1 CPU while Kettle uses whatever it needs and get its hands on.  For the single threaded scenario, the CPU utilization is on par with the delivered performance of both tools.  There doesn’t seem to be any large difference in efficiency.

Conclusion

Talend wins in the first test with their single threaded reading algorithm.  I think their overhead is lower because they don’t run in multiple threads. (Don’t worry, we’re working on it :-))  In all the other situations where you have more complex situations, where you can indeed run in multiple threads, there is a severe performance advantage to using Kettle.  In the file reading/writing department for example, PDI runs in 3 threads and lazy conversion beats Talend by being more than twice as fast in the best case scenario and 65% faster in the worst case.

Please remember that my laptop is hardly and not by any definition “high end” equipment and that dual and quad core CPUs are commonplace these days.  It’s important to be able to use them properly.

The source code please!

Now obviously, I absolutely hate it when people post claims and benchmarks without backing them up.  Here you can find the used PDI transformations and TOS jobs.  With a little bit of modification I’m sure you all can run your own tests.  Just remember to be critical, even concerning these results!  Trust me when I say I’m not a TOS expert 🙂  Who knows, perhaps I used a wrong setting in TOS here or there.  All I can say is that I tried various settings and that this seemed the fastest for TOS.

Remember also that if even a simple “file copy” can be approached with various scenarios, that this certainly goes for more complex situations as well.  Even the other tools out there deserve that much credit.  Just because Talend can’t run in multiple threads, that doesn’t mean that Informatica, IBM, SAP and all are not capable of doing so.

If I find the time I’ll post a part 2 and 3 later on.  Feel free to propose your own scenarios to benchmark as well.  Whatever results come of it, it will lead to the betterment of both open source tools and communities.

Until next time,
Matt

Flighing high in economic storms

Next week I’ll be in Orlando for another week of brainstorming, planning scheming, plotting for world domination and yes, even coding.

Q : “What are you going to do next week?”
A : “The same thing I do every time when I’m in Orlando – Try to take over the world!”

So I went to kayak and entered my flight preferences: leave and return on Sunday giving me a full week over there.  I was almost shocked to see that the same flight I took 2 months ago now costs less than a third:

  • July 13th 2008 : BRU/MCO – MCO/BRU (over FRA) : 2,400 USD (summer time folks!)
  • October 12th 2008 : BRU/MCO – MCO/BRU (over IAD) : 2,000 USD
  • December 7th 2008 : BRU/MCO – MCO/BRU (over PHL) : 600 USD

Typically I’m not selecting the cheapest flight as that would put me on 18 hour layovers in Bankok or something like that. In the past I’ve once spent 8 hours at Chicago airport and trust me, it’s not worth the 100 USD you can save.  You’ll spend it on Internet access, food, “beverages”, magazines, etc.

That being said, the December 7th flight is the cheapest flight with “only” 1 layover.

In the past I’ve noticed that the airlines added more and more options for me to fly to Orlando or at least across the Atlantic ocean.  Now that the economic downturn is upon us, perhaps there’s finally a bit of over-capacity.  After all, the last 5 flights I took from Brussels to the US had been fully booked flights.  That’s right folks: listening to hollering kids with Mickey Mouse ears for 9 hours straight.  Even noise canceling headset have a hard time with that kind of noise.

Electronic System for Travel Authorization

Another thing of interest for the geeks among you is that you are now encouraged to apply for authorization to enter the US well in advance to replace the manually written green “Visa Waiver” documents.  Nobody makes a fuss about it, but registration for us Europeans is obliged or so you can read from January 12th 2009 on.  I’m sure there are going to be freedom fighters here and there that are going to be up in arms over this sort of program, but personally I’m glad that we can finally fill in those green “waiver” documents electronically at home.  From the looks of it, there’s nothing on there that you don’t already fill in manually now. (it felt kinda familiar filling them in)

I’ll let you know how they perceive my eagerness to fill in these “hidden” electronic documents at the US border next week 🙂

Until then,
Matt

Canonical: take my money

Dear Canonical,

You claim that there is little money in the desktop software business and more in services.  Well here is something I would pay money for:

Take the top selling business laptops from Dell, Acer, HP, Lenovo and offer customized distributions for them.

I would pay for that in an instant.  All too often people confuse open source with free of charge.  I’m perfectly capable of making that distinction.  In fact, I use my machines for my work and don’t want to spend days configuring all the devices on them.  As such, I would pay something like 50 USD for a customized (K)Ubuntu or perhaps 150-200 USD if it came with some sort of (e-mail) support contract for a year.

I don’t use Linux / Ubuntu because it costs less, I use it because I prefer it over Windows to do my job.  I would pay that kind of money because I would save time and money in the long run.

Until the major hardware vendors offer decent (worldwide) support for Linux on their machines (out of the box that is), I think this is an idea with potential and I hope at least someone picks it up.  Go ahead, let me spend money on it!

Until next time,
Matt

Commercial open source is possible

It was the exception to the rule: Matt Asays latest blog entry was interesting.  It asked the question: “Is commercial open source possible?“.  The interesting part actually comes from quotes from Lawrence Lessigs book Remix.

The whole discussion revolves around the awkwardness of asking for money in an open source setting.  A few of the original quotes to illustrate:

What if Wal-Mart asked all customers to ‘pitch in and help Wal-Mart by sweeping at least one aisle each time you shop?

and

Money in the sharing economy is not just inappropriate; it is poisonous.

My take on this is that for sure there are developers out there that would rather die than pay anything for open source software.  In enterprise software like Pentaho Data Integration you encounter a mixture of these developers-rolled-into-ETL-designers over regular ETL designers to business users. As such, the awkwardness of the situation depends on the type of user and the type of situation he or she is in.

On top of that, software being open source or not, shouldn’t matter to these users except for the fact that it lowers risks and costs.  No, personally I see commercial open source as an extremely viable and very possible alternative, especially now that risk and cost are that big on the agenda in a lot of companies.

If I look around in my community, I don’t see any complaints about the fact that you have to pay to receive training, services or enterprise support.  Some might complain that you have to pay too much, but that group of people will always exist no matter how cheap you sell support.  I even had one person ask me why $50 wouldn’t be enough for professional support!

How about squashing bugs that bite customers with priority?  Would that tick off our community?  Not showing any signs so far because that’s what we’ve been doing.   In fact, I think most community members want Pentaho to do well since it ensures long term viability of the software.  Here’s a question I can’t answer myself but perhaps a lot of people are seeing the company behind the software as being part of the community.  Or perhaps they should.

My take on this whole thing: a couple of nice quotes don’t make a nice idea.  In turbulent times, it’s important to keep you eyes on the future and away from the dusty past.  The days that an open source developer was a sandal wearing bearded stranger are long gone.  At the same time our customers are learning open source is business as usual.  Better? Cheaper? Sure!  But nevertheless still business as usual.

Until next time,
Matt

Dead wrong

Belgian consultancy company Element 61 has just posted an opinion piece under the disguise of a review on open source ETL.

What a load of utter nonsens.  Try reading this:

Instead of using SQL statements to transform data, an Open Source ETL tool gives the developer a standard set of functions, error handling rules and database connections. The integration of all these different components is done by the Open Source ETL tool provider. The straightforward transformations can be implemented very quickly, without the hassle of writing queries, connecting to data sources or writing your own error handling process. When there are complex transformations to make, Open Source ETL tools will often not offer out-of-the-box solutions.

Well Mr Jan Claes, we’re perfectly capable of handling quite complex transformation with high performance too.  If Kettle isn’t capable of handling your ETL needs, neither is Informatica, DataStage, OWB or BODI.  If you prefer Oracle Warehouse builder because it allows you to squeeze PL/SQL or SQL into your ETL tool, than that’s fine, just don’t use false arguments to dis open source ETL tools.  ETL tools should allow you to write LESS code and make it easier to maintain your transformations, not more.  Being open source has nothing to do with that fact.

Most reputed ETL-vendors provide an extensive amount of connections to all sorts of data sources. This is a problem with Open Source ETL tools: most of them are based on Java architecture and need JDBC to connect to a database. In the basic license, a few connections are available but when there is a need for extra connections, the customer has to pay an extra fee and/or for some platforms (like mainframe sources) nothing might be available.

You have to be kidding, right?  Kettle supports 34 database types + generic ODBC, OCI and JNDI connections out the box for free.  On top of that we connect to legacy systems like SAP/R3 and obviously your mainframe as well if needed (very few people ever do).  The painful truth is that we’re doing better, not worse.

Java & XML knowledge required for complex transformations.

This comment just made the article provably false since you never need any Java or XML knowledge to use Pentaho Data Integration.  (I’m sure the same goes for Talend by the way)

Lack of skills, knowledge & resources.

Pentaho has plenty of partners , even in Belgium. (Cronos for example)  We also have the lead developer of Pentaho Data Integration (me) working in Belgium as well as Davy Nys our Sales Representative.  Professional support, training (on site if needed) is offered as well.

In these turbulent financial times, open source ETL it exactly the answer to constantly shrinking budgets and that is why Pentaho is doing better than ever before despite the credit crunch.

Element 61 in the mean time needs to get hit with a clue stick.  It’s one thing to accept money from the big boys, it’s a completely different thing to spread demonstrable lies.

Until next time,
Matt

Meme(me)

I guess it’s a new type of chain-blogging like we used to have chain-mails.
Just passing along the instructions for other bloggers….

1. Take a picture of yourself right now.
2. Don’t change your clothes, don’t fix your hair…just take a picture.
3. Post that picture with NO editing.
4. Post these instructions with your picture.

It’s kinda fun.

Until the next time,
Matt

Black holes : ban all matches!

Because of the construction work on our house we had to move to a different place.  We have a gas driven stove to cook our daily food on.

Today, when I wanted to light a match to start the fire, I quickly stopped when the following thought occurred: what if lighting that match would create a black hole and suck up the complete earth?

OK, I hear you say, it happens many times all over the world, people light up matches all the time!  Yeah, but it would just be MY luck if a black hole happens to get created in MY match, right?

I mean, those things are highly explosive things, you never know what goes on in there?  Has this process been investigated long enough?  I certainly have no clue how a match ignites!

Personally, I think they should just switch to electronic lighters and ban all matches on the planet.  I’m sure most of you agree with me!

Until next time,

Matt

Pentaho and the iPhone

Today we announced that Pentaho can deliver BI solutions to the iPhoneWill Gorman & the rest of the team at Pentaho again did a great job here!

The announcement is great news for all those people that have one … or can buy one.

Until Friday, that excluded myself.  In Belgium there is a law against “coupled sales” of goods.  That means that you can not sell an iPhone together with a contract.  The same is true for a number of other European countries by the way.  The minute an operator would try to sell the iPhone exclusively with a contract, the other operators would go to court and win too. As a result of this law, most if not all phones are sold in Belgium without a contract and 100% of the phones are usable on all our networks.

This issue has led to the fact that Apple didn’t even bother to sell an iPhone in Belgium.  This in turn made our minister of enterprises Vincent Van Quickenborne angry. (Minister “Q” for his friends and voters)  So far, “Q” has been mostly known for failing to simplify the administration in Belgium.

Now, he is angry because he feels like it was such a shame that Apple “was unable” to launch the iPhone in Belgium previously because of the coupled sales law.  Things would be so much better without that law, people would be able to get their hands on various technological gadgets for very cheap prices.

Obviously this is not true.  Let’s compare prices shall we.  The minimum contract for an iPhone in the US is $70/month for 2 years.  That will set you back at least $200 for the phone and $1880 for the contract at a minimum. That’s in total a whopping $2080.

In Belgium, the price for the same iPhone is set at 525 EURO or $820.  You will still have to buy a contract but in fact you are free to do as you please.  If you just want to toy with it on your own Wifi at home and use it as an otherwise regular phone/iPod/picture viewer/whatever, you’re free to do so.  That itself can cost you from next to nothing with pre-paid cards to a lot more than the $52,5/month remaining.

Please note that I explicitly left out the data ($30) or text plans ($5) that AT&T has to offer that can crank up the cost tremendously ($2600, $2720).

Now, how many people would pony up the >$2000 if they had a big warning on the box, in the store or on the website?

Warning, will cost at least $1000 per year!

Exactly!  My feeling is that by charging for the actual price of the device instead of allowing people to get lured into all kinds of shady contracts, costs are probably lower for the customer on average!

In short, I think “Q” is dead wrong and full of it.  He would do better to investigate the ever ongoing and prevailing illegal coupled sales of Microsoft Windows and computers.

Until next time,
Matt

P.S. All this still doesn’t mean I’m going to buy one of these overpriced gadgets, but I just might. 🙂