Kettle vs Oracle REF CURSOR

Dear Kettle fans,

PDI-200 has been out there for a while now.  Jens created the feature request a little over 3 years ago.  I guess the main thing blocking this issue was not as much a technical problem but more of a licensing and dependency one (Oracle JDBC dependency and distribution license).

However, now that we have the User Defined Java Class step we can work around those pesky problems. That is because the Java code in there only gets compiled and executed at runtime so it’s perfectly fine to create any sort of dependency in there you like.

The following transformation reads a set of rows from a stored procedure as described on this web page.

In short, our UDJC step executes the following code:

begin ? := sp_get_stocks(?); end;

The result is a set of rows and the parameter is in this case a single numeric value.

The step contains mostly Java code but thanks to configuration options you only need to do 2 things to make this work for your own REF CURSOR returning procedures…

First you need to specify the output fields of the rows…

And then you need to specify the parameters:

The source code for this sample transformation is over here and runs on Pentaho Data Integration version 4.x (or higher).  All in all it only took a few hours to write these 150 lines of Java so perhaps it can serve as inspiration for other similar problems you might have with Oracle or other databases.

Until next time,

Matt

Google Goodies and Lego

Dear Kettle friends,

Will Gorman and Mike D’Amour, Senior Developers at Pentaho, are presenting Pentaho’s Google integration work at the Google I/O Developer Conference. (at the Sandbox area to be specific)   Yesterday, Pentaho announced that much.

Here are a few of the integration points:

  • Google maps dashboard (available in the Pentaho BI server you can download)
  • A new Google Docs step was created for Pentaho Data Integration Enterprise Edition
  • Running (AVI, 30MB) the Pentaho BI server on Android
  • A new Google Analytics step was created for Pentaho Data Integration Enterprise Edition
  • Since version 2.0, the Pentaho BI server depends heavily on Google Web Toolkit (GWT)

To top that off, Will twittered about this new Lego bar-chart + logo they created for the conference:

UPDATE: now with building instructions and action video!

We are all soooo proud of them!

Until next time,

Matt

Kettle at the MySQL UC 2009

Hello Kettle fans,

Like Roland I got confirmation earlier this week that I could present my talk on “MySQL and Pentaho Data Integration in a cloud computing setting”, at the next MySQL user conference.

I’m very excited about the work we’ve done on the subject and it’s going to be great talking about it in April.

See you there!
Matt

Give MySQL a break please

In a unique display of mass hysteria, one blogger after the other and even slashdot (no, I’m not going to link) managed to take the completely innocent message that certain new enterprise features might get released as closed source only and turn it into an ongoing bad press onslaught about “MySQL closing down source code”.

Why don’t you all give MySQL a break here please?  The rule is always the same for everybody: the one that writes the code gets to pick the license.  Listen, I 100% believe in open source and I consider myself to be a big advocate, but commercial open source companies like MySQL (and Pentaho) are commercial entities.  At lease try to put yourself in their position for a second.  For example, if a customer asks you to NOT to release a piece of software they paid for, you don’t release it, it’s that simple.

In the end, what MySQL is doing is simple: they are experimenting with a commercial open source  (COS) model.  Why are they experimenting?  Because the concept of COS is very new and there are no clear guidelines.  It simply hasn’t been done before.  How do you keep growing?  How do you keep paying more open source developers?  How do you pay for the millions of web hits each day?  How do you pay for the millions of downloads, the Tera bytes of internet traffic?  How do you guarantee your long term survival?  How do you strike a balance between commercial success and widespread open source adoption?  How do you keep your investors happy as well as your community?

I guess we learned one thing the past week : it’s easier to spout criticism than to give answers to these tough questions.

Matt

Rolling back transactions

Pentaho Data Integration (Kettle) never was a real transactional database engine, and never pretended to be that. It was designed to handle large data volumes and slam a commit in between every couple of thousand rows to prevent the databases from chocking on the logging problem.

However, more and more people are using Kettle transformations in a transactional way. They want to have the option to roll back any change that happened to a database during the execution of a transformation in case anything goes wrong.

Well, we have been working on that in the past, but never quite got it right… until today actually. As part of bug report 724 I lifted the decision to commit or roll back all databases to the transformation level.

Take for example a look at this transformation:

What happens is that the first 2 steps will always finish execution before a single row hits the Abort step. That means that all rows from the “CSV file input” step will be inserted into the database table before the transformation fails. Well, in the past, even if you enabled “Unique connections”, this would have resulted in those rows to remain in the table.

To test yourself, use revision 6587 in trunk to build yourself or download a nightly build tomorrow.

With a little luck (further tests and then more tests) we can back-port this fix to version 3.0.2 this week, ready for the 3.0.2GA release at the end of next week.

I’m hoping to extend this same principle to jobs as well in the (more distant) future.

Until next time,
Matt

Back to basics

A few days ago someone made the comment that Pentaho Data Integration (Kettle) was a bit too hard to use. The person on the chat was someone that tried to load a text file into a database table and he was having a hard time doing just that.

So let’s go back to basics in this blog post and load a delimited text file into a MySQL table.

If you want to see how it’s done, click on this link to watch a real-time (non-edited) flash movie. It’s 11MB to download and is about 2-3 minutes long.

Load customers flash demo

Until next time!

Matt

Making the case for Kettle

Dear data integration fans,

Once in a while, there are discussions on various blogs (usually with me smack in the middle of it) debating the differences between code generation and model based execution, how this impacts the way we approach databases, the open nature of it all, etc.

With this blog entry I want to push the notion that Pentaho Data Integration (Kettle) didn’t just evolve by chance into the state it is today as a streaming, metadata driven, model based engine. I made some careful design choices early on…

Open as possible

The goal of Kettle from the beginning was to be as open as possible. My definition of “as open as possible” included:

  • open source with an LGPL license (see this JBoss link [PDF] for a nice explanation)
  • open, readable XML format
  • open, readable relational repository format
  • open API
  • easy to set up (<2 minutes)
  • open to all kinds of databases
  • easy-to-use GUI
  • easy to pass data around
  • easy to convert data from/to any possible format

That list didn’t include any code-generation. Although it was an option I seriously considered, I didn’t like the other code-generators I dealt with until that date. (Mainly OWB and some home-brew stuff) The problem with those is that when things go wrong (and things always do go wrong) you’ll have to become an expert in the generated language/code pretty soon.

That was most certainly the case in Oracle Warehouse Builder (OWB) at the time. Re-arranging operators in a certain order would yield inexplicable different results and you found yourself digging through mega-bytes of generated PL/SQL code. That was OK for the simple exercises like loading a bit of data, it became horrible for handling more complex transformations, including slowly changing dimensions.

As such, I considered this way of working not as open as possible since the generated code is by definition not transparent and highly subject to change at re-generation. Specifically the re-generation invalidates the claim that it is possible to change the code by hand “if you don’t like the generated code”.

All those considerations lead to the decision to go for a stream-based meta-data driven engine. That decision comes with many advantages and a few possible drawbacks. There are things to consider, implications towards databases, etc. It’s only honest to say that there probably isn’t a perfect solution anyway.

Our engine is not perfect either, let’s take a look at a few specific issues areas regarding databases.  Let’s also see how we handle those issues when they occur.

RDBMS / ELT

Even though splitting tasks up into different pieces allows you to run a lot of things in parallel, there is an overhead because you need to pass data from one step to another.  We also suffer because we don’t have random access to all the data like an RDBMS does. Because of that it is not unheard of for us to recommend that you run joins or sorts on the database server where the data is coming from. A database can typically sort faster because it does have random access to all the data.  As such it can sort indexes and doesn’t have to go through all the data.

By allowing the Kettle user/developer to execute whatever SQL of procedure he or she wishes we are again as open as possible to this scenario.

Now, there are people that take this to the extreme to point to the miracles of ELT. Well, let me give another view on that. Suppose you wrote a nice ELT program and everything is being processed on the database server. While that might a little bit faster on occasion, chances are very high that either one of these conditions is going to occur:

  • the database server is not tuned for high CPU load (rather for I/O)
  • the source system you are reading from is already too occupied
  • parallelism is low or non-existing in a typical stored procedure
  • stored procedure languages are typically interpreted byte-code languages and CPU-wise pretty slow/costly.
  • debugging becomes very hard or impossible
  • you hit some limit on the database (there are many, even on Oracle), even long after you went live with your solution simply because tables and indexes grow.
  • deployment must be carefully managed or something goes wrong with that: stale/not-compiled procedures because tables changed, referenced functions changed, etc)

Personally I have faced all of these issues and a few more too. I had enough reasons to skip ELT completely at the time. Heck, I still don’t like it and personally I think for all the right reasons. This is going to sound harsh, but the reasoning I had and still have to some extent is that if you can’t write an SQL query, I don’t think you should be doing any heavy database lifting. Let the database do what it does best and do the rest outside of the database: pick the best tool for the job.

Latency

One other possible problem regarding databases is latency. Since we don’t run inside the database engine we can feel it at times. Sometimes is not a problem and sometimes it is. Calls to the database have a certain turn-around time. We solve that for the better part by allowing you to launch multiple copies of a step in parallel. The “Database Lookup” step comes to mind here. This effectively opens multiple connections to the database and reduces overall latency.  Of course a bigger reduction of latency is obtained with the use of memory lookups and large caches in PDI.  In a lot of cases, it’s even faster to lookup the data in an in-memory cache than doing a join on the database.  Again, you should pick the right tool for the job.

Conclusion

As we keep pushing our platform forward with many (many!) new things in the pipe-line, we sometimes forget why certain choices where made and how we arrived where we are today. Well, I for one am happy to say that 6 years after the decision to go with the current Kettle architecture, I’m still very happy with it and so should you 😉

Until tomorrow,

Matt

P.S. This is a re-hash of some of my opinions in a post over on Marc Russel’s blog.