Help OpenMRS!!!

My friend and colleague Julian Hyde of Mondrian fame just blogged about this: help out the OpenMRS project , please!

The folks behind the OpenMRS are helping to improve the health-care systems in developing countries. More in particular, they are fighting AIDS with this software. OpenMRS has certainly shown to be up to the task at hand: it is currently tracking the medical conditions of over a million people in 12 countries.

Because of the exponential growth of users, this project is in urgent need of BI manpower. Julian and myself have both agreed to help out with strategical advice for the BI part of OpenMRS.

If you want to be part of the team, if you know a bit about the Pentaho Platform, Pentaho Data Integration (Kettle), Pentaho Analysis (Mondrian), MySQL and/or reporting, let us know the sooner the better and help us make the world a better place. Maybe you are a student or someone with limited income and maybe you are looking for an opportunity to brush up your BI skills. In that case the OpenMRS project can offer you some money to help you out for a few months too.

Please help Julian and me spread the word about this fantastic win-win opportunity.

Come work with us to help OpenMRS create a BI solution that saves lives.

Thank you in advance,

Matt

A new debugger for Kettle

We have been aligning Pentaho Data Integration to go into feature freeze in a few weeks when we’ll release 3.0.0-RC1. However, before we do so, I wanted to write a (simple) debugger. It’s important to get at least the API in there so that we can continue to build on top of that in the 3.x update releases.

How does it work? Well, suppose you have a simple transformation like this one:

A simple transformation

We just generate empty rows and add an id from 1 to 1000. Now we want to pause the transformation and see the content of the row where

id=387

Well, that is what we made possible. Simply click on the debug icon in the toolbar:

Debug icon in the toolbar

That will open up the debug dialog:

The new debug window
As you can see, we can specify a condition on which the transformation is paused. We can also specify to keep the last N rows in memory before the condition was met. Pressing OK and launching the transformation in the execution dialog will then show the requested rows:

Previewing rows

As you can see, for your convenience, the order of the rows is reversed. (most recent first) If you try this yourself, you will note in the transformation log tab that the transformation you are debugging is paused. That means that you can now hit the resume button and the transformation will simply continue to run. If a condition is met again, the transformation will be paused again and another preview dialog is presented.

The old-style preview has also been converted to the new pause/resume capabilities.

One interesting observation is that the performance hit while running in debugging or preview mode has been kept very low.  The slowdown obviously depends on the number of conditions and the buffer sizes, but typically I think you will not experience any performance drop at all.
The Pentaho Data Integration development team and I really hope that these new capabilities will shorten your time to hunt down complex transformations.

Until next time,

Matt

Back up a little…

Working with Linux distribution Kubuntu full time for the last 4 months has been a pleasure compared to the years I worked with Windows (2000/XP). To pick one example I have always been struggling with on Windows: Backup & Recovery.Because I’m doing all sorts of things on my Linux laptop, I also have a LOT of files hanging around: Test files, transformations, jobs, test-code, branches, images, screen-caps, screen shots, +50k e-mails, the list goes on.

Originally I thought about scheduling a big “cp -pR” to an external hard disk, but I was pleasantly surprised that Kubuntu ships with Keep by default.

Keep backup

Using this program I have been synchronizing my home folder (/home/matt) to an external hard disk. It works using rsync and starts automatically around 23:00 every 2 days.

However… what about the software, the configuration files, the fixes, the updates and in general all the customizations that I did to my system? What about those? Well, I came across a piece of software called Mondo Rescue.

Mondo backup

Mondo Rescue can be installed on your Ubuntu system using the Adept software installer or with the command:

sudo apt-get install mondo

What Mondo does is create a backup of your complete system. Basically it puts all the information from your hard disk onto CD/DVD and even makes the first CD or DVD booteable. That way I can re-create my laptop from bare-metal in case of a HD crash, theft, upgrade problems or regular plain old stupidity.

Command mondoarchive allows you to specify where you want to backup the data to (HD, CD, DVD) and the directories you don’t want to see included.

Be aware that the mondoarchive backup will take many hours to run and you typically fire this up during the night using cron. mondoarchive can be run using the command-line as well.  Backup can happen directly to .iso files so that you can burn those to CD/DVD the next morning for an even greater peace of mind. Together with Keep (or rsync for the fans) you can make sure your system is safe from harm.

Until next time,

Matt

Back to basics

A few days ago someone made the comment that Pentaho Data Integration (Kettle) was a bit too hard to use. The person on the chat was someone that tried to load a text file into a database table and he was having a hard time doing just that.

So let’s go back to basics in this blog post and load a delimited text file into a MySQL table.

If you want to see how it’s done, click on this link to watch a real-time (non-edited) flash movie. It’s 11MB to download and is about 2-3 minutes long.

Load customers flash demo

Until next time!

Matt

Kettle 3 Milestone 2 is available

UPDATE: version 3.0 has been released!

Dear Kettle fans,

After a long period of bug-squashing and other frantic coding activities, we are happy to give you Kettle’s second milestone of version 3.0.0. (77MB zip file)

What has changed since M1?

  • New icons!! This is the first release to include a new set of icons and as such a fresh new look.
  • A new Mondrian Input step to read from Pentaho Analyses using MDX.
  • A new Regular Expression evaluation step
  • Access Input (don’t ask!)
  • Fixed / improved repository support
  • Improved database dialect handling (SQL Server .. problem and forcing identifiers to lower/uppercase)
  • Vast improvements to the clustering/partitioning engine (dynamic partitioning, etc)
  • A “Explore DB” button for quick access to your databases
  • A complete set of internal changes with respect to the “look & feel” of Spoon
  • A series of improvements to the Lazy Conversion algorithms
  • The remaining steps that where missing in M1 got ported
  • Of-course a host of bugs got fixed as well

With this 3.0.0M2 release we’re setting another big step towards more stability and maybe a GA release in a few months. However, please refrain from using this version in production situations!
Enjoy it!

Matt

Help us out and report bugs over here

Making the case for Kettle

Dear data integration fans,

Once in a while, there are discussions on various blogs (usually with me smack in the middle of it) debating the differences between code generation and model based execution, how this impacts the way we approach databases, the open nature of it all, etc.

With this blog entry I want to push the notion that Pentaho Data Integration (Kettle) didn’t just evolve by chance into the state it is today as a streaming, metadata driven, model based engine. I made some careful design choices early on…

Open as possible

The goal of Kettle from the beginning was to be as open as possible. My definition of “as open as possible” included:

  • open source with an LGPL license (see this JBoss link [PDF] for a nice explanation)
  • open, readable XML format
  • open, readable relational repository format
  • open API
  • easy to set up (<2 minutes)
  • open to all kinds of databases
  • easy-to-use GUI
  • easy to pass data around
  • easy to convert data from/to any possible format

That list didn’t include any code-generation. Although it was an option I seriously considered, I didn’t like the other code-generators I dealt with until that date. (Mainly OWB and some home-brew stuff) The problem with those is that when things go wrong (and things always do go wrong) you’ll have to become an expert in the generated language/code pretty soon.

That was most certainly the case in Oracle Warehouse Builder (OWB) at the time. Re-arranging operators in a certain order would yield inexplicable different results and you found yourself digging through mega-bytes of generated PL/SQL code. That was OK for the simple exercises like loading a bit of data, it became horrible for handling more complex transformations, including slowly changing dimensions.

As such, I considered this way of working not as open as possible since the generated code is by definition not transparent and highly subject to change at re-generation. Specifically the re-generation invalidates the claim that it is possible to change the code by hand “if you don’t like the generated code”.

All those considerations lead to the decision to go for a stream-based meta-data driven engine. That decision comes with many advantages and a few possible drawbacks. There are things to consider, implications towards databases, etc. It’s only honest to say that there probably isn’t a perfect solution anyway.

Our engine is not perfect either, let’s take a look at a few specific issues areas regarding databases.  Let’s also see how we handle those issues when they occur.

RDBMS / ELT

Even though splitting tasks up into different pieces allows you to run a lot of things in parallel, there is an overhead because you need to pass data from one step to another.  We also suffer because we don’t have random access to all the data like an RDBMS does. Because of that it is not unheard of for us to recommend that you run joins or sorts on the database server where the data is coming from. A database can typically sort faster because it does have random access to all the data.  As such it can sort indexes and doesn’t have to go through all the data.

By allowing the Kettle user/developer to execute whatever SQL of procedure he or she wishes we are again as open as possible to this scenario.

Now, there are people that take this to the extreme to point to the miracles of ELT. Well, let me give another view on that. Suppose you wrote a nice ELT program and everything is being processed on the database server. While that might a little bit faster on occasion, chances are very high that either one of these conditions is going to occur:

  • the database server is not tuned for high CPU load (rather for I/O)
  • the source system you are reading from is already too occupied
  • parallelism is low or non-existing in a typical stored procedure
  • stored procedure languages are typically interpreted byte-code languages and CPU-wise pretty slow/costly.
  • debugging becomes very hard or impossible
  • you hit some limit on the database (there are many, even on Oracle), even long after you went live with your solution simply because tables and indexes grow.
  • deployment must be carefully managed or something goes wrong with that: stale/not-compiled procedures because tables changed, referenced functions changed, etc)

Personally I have faced all of these issues and a few more too. I had enough reasons to skip ELT completely at the time. Heck, I still don’t like it and personally I think for all the right reasons. This is going to sound harsh, but the reasoning I had and still have to some extent is that if you can’t write an SQL query, I don’t think you should be doing any heavy database lifting. Let the database do what it does best and do the rest outside of the database: pick the best tool for the job.

Latency

One other possible problem regarding databases is latency. Since we don’t run inside the database engine we can feel it at times. Sometimes is not a problem and sometimes it is. Calls to the database have a certain turn-around time. We solve that for the better part by allowing you to launch multiple copies of a step in parallel. The “Database Lookup” step comes to mind here. This effectively opens multiple connections to the database and reduces overall latency.  Of course a bigger reduction of latency is obtained with the use of memory lookups and large caches in PDI.  In a lot of cases, it’s even faster to lookup the data in an in-memory cache than doing a join on the database.  Again, you should pick the right tool for the job.

Conclusion

As we keep pushing our platform forward with many (many!) new things in the pipe-line, we sometimes forget why certain choices where made and how we arrived where we are today. Well, I for one am happy to say that 6 years after the decision to go with the current Kettle architecture, I’m still very happy with it and so should you 😉

Until tomorrow,

Matt

P.S. This is a re-hash of some of my opinions in a post over on Marc Russel’s blog.