Data Mining

Perhaps you caught the news, perhaps not, but Pentaho just got a bit larger again by the acquisition of the Weka project.
I’m really excited about this because it means we can finally crank out a couple of new steps for Kettle without having to release the whole Kettle project under the GPL license. We could create new Weka plugins under GPL and then offer them to customers under a commercial license as well.

Some time ago I played around with Weka a bit and found it extremely hard to read data into the different available engines. The plan I have for the Kettle-Weka integration is to build a couple of steps that provide you with the best of both worlds: easy drag&drop data integration and state-of-the-art data mining modules. If it wasn’t for the very big workload I’m under, I would get started on this right away. Unfortunately though, the new meta-data architecture for Pentaho (+GUI) is taking it’s fair amount of time to develop.

Another option to aim for is the creation/inclusion of a data profiler for Pentaho Data Integration to do analyses of source data.
All in all, these are very exciting times for Pentaho. It’s an honour to be able to take part in it.

data mining with weka GUI sample

Feature complete

OK, it’s about time I created another category on my blog. So I said to myself: “There must be thousands of people blogging about Linux, why not you?”
I really love my Linux laptop. I really do. Some people like to complain about this or that not working and usually it concerns some multi-media thing not working. Up until recently I would have sighed and said that, yes indeed, Linux on the desktop is not really there yet, maybe next year.

Of-course, we (the linux zealots) have been saying “Maybe next year” for quite a few years now. Until something strange happened a couple of months ago: I bought a new laptop, in this case, the cheapest dual core I could find: An Acer 5610 and I absolutely love to run Linux on it.
The reason for buying the thing was a demo I had to do for the Linuxtag in Germany and my other laptop being stuck at Acer Support for 3 weeks.

Anyway, the strange thing about it is that it only took me 1 hour to install SuSE Linux on it to see that it worked absolutely fine. It took another hour to re-install it on the complete hard-disk because it was clear from that moment on I wasn’t going to be taking advantage of the Microsoft tax.
After installing SuSE Linux 10.1 there were only 3 things that didn’t work: GLX (still not considered production ready by most distributions) but amazingly, that was fixed in half an hour by following this link. At that point the 2 things that were not working. The first was my wireless LAN, but because I’m using it here in my office I didn’t quite need it.

The second problem was: anything multi-media didn’t work. Fortunately that was fixed almost instantly by the MPlayer wizards at Linuxtag. That was actually also a time issue because there are plenty of articles on the web these days on how to do that.

This week, almost by accident I came across the solution for the wireless LAN: apparenty I didn’t download the 10.1 Add-on CD that contained the WLAN kernel modules and 3945 daemon. Just installed it, changed the networking method to ifup and it worked.

I’ve been using it for the last couple of months and this machine really rocks. Let me list some advantages:

– A complete build of the Kettle distribution zip (ant zip) file takes 31 seconds on Linux and 5 minutes on Windows XP Pro, despite the fact that the Windows box has a 2Ghz Pentium-M and is faster on paper. I bet those virus scanners have something to do with it…

– GLX/Compiz, once you turn of the wobbly windows is really a joy to work with.

Amarok!!! This media player is the best thing that could have happened to Linux. It’s so good I hooked up the stereo to my laptop to listen to the internet radio stations. No commercials anymore for me please!
– Rock solid: even though at times you run really heavy stuff on it, the box stays responsive because of the extra CPU. I’m convinced that this is the way forward for heavy users like myself. My Windows box just locks up waaaay too often lately to be even funny anymore.

– It takes Evolution (my e-mail client) 5 seconds to retrieve 30 e-mails from the popserver. On Windows the arrival of a single e-mail locks up my machine for 5 seconds. Again I bet that anti-virus software has something to do with it. That being said, if I turn of anti-spam on the Linux box, the retrieval is almost instantly. It makes sense, 30 mails is usually not even a MB of data!

To conclude I think that SuSE is doing a good job by offering pre-install images for laptops. I would be one the first people to jump on this. I would pay good money to have the process of getting the steps mentioned above solved by the laptop vendors. In any case, enjoy the screenshot of my desktop below…

Until next time,


This is it...

Kettle webinar at MySQL!

Hi Kettle fans,

The 2.3.1 release has been dropped! These kinds of things are always a bit troublesome because of the testing I need to do to get it out, the documentation updates, etc, etc. It’s also the right time for me to do bug administration: clean up old stuff, etc. This is not the most enjoyable type of activity and I’m glad it’s over and done with.

Anyway, it’s about time we did something enjoyable again: a webinar! Yes, we have been invited to do a MySQL Webinar next Tuesday. At 10:00 am Pacific, 1:00 pm Eastern or 17:00 GMT for the people living in my timezone. The presentation will be approximately 45 minutes long followed by Q&A.

I think that this is a great opertunity to show of Kettle and MySQL. It has to be said, most questions and samples I see from the community run on MySQL, so they seem to be a great match indeed.

So, please do register (it’s free) and join us at the MySQL Web Seminar on Kettle next Tuesday!!

See you then,


Deploying transformations

A couple of days ago, Jens and I added about 30 samples to subversion and the Kettle distribution. (Jens is one of the more active Kettle comitters and works for Pentaho partner Proratio)
One of the problems we were facing with the introduction of more complex examples is the location of the input and output files. One way of handling this is by defining an environment variable that points to the right location. This requires manual work and since we are all as lazy as possible, Jens came up with a new idea. Clever as he is, he proposed to generate internal Kettle variables that point to the location of the transformation file. The files can then be stored relative to this transformation.

Normally I wouldn’t have bothered to include this feature in the upcoming 2.3.1 release but since it’s a new feature and it has little to no impact on the rest, I included it anyway. (My hands were itching too much to implement this :-))

However, it’s very good for a lot of people because deploying transformations and jobs in your organisations just became a whole lot easier as well as you can see in the screenshot below.

These are the variables that were added:


Variable Name Sample value
Internal.Kettle.Build.Date 2006/09/06 18:01:39
Internal.Kettle.Build.Version 9
Internal.Kettle.Version 2.3.1

In a transformation

Variable Name Sample value
Internal.Transformation.Filename.Directory D:\Projects\SVN\javaForge\Kettle\trunk\samples
Internal.Transformation.Filename.Name Denormaliser – 2 series of key-value pairs.ktr
Internal.Transformation.Name Denormaliser – 2 series of key-value pairs sample
Internal.Transformation.Repository.Directory /

In a Job

Variable Name Sample value
Internal.Job.Filename.Directory /home/matt/jobs
Internal.Job.Filename.Name Nested jobs.kjb
Internal.Job.Name Nested job test case
Internal.Job.Repository.Directory /

Until next time,


Internal variable use case example

Key-value madness

People that write data integration solutions often have a tough job at hand. I can tell because I get to see all the questions and bugs that get reported.

That is the main reason I committed a first collection of 30 examples to the codebase to be included in the next GA release (2.3.1 will be released next Friday, more on that later).

Today I would like to talk about one of the more interesting examples on the de-normaliser step. What the de-normaliser step does is help you out with the lookup of key-value pairs so that you can attribute the value after lookup to a certain field. The step is a fairly recent addition to Pentaho Data Integration, but gets a lot of attention. I guess that’s because the use of a key-value pair system is often used in situations where programmers need a very flexible way of storing data in a relational database.

The typical example is that you have a Customer table with a Customer ID and all of a sudden 50 extra attributes need to be stored in the database somewhere. However, the main Customer table is not chosen to store these fields (usually because they would be empty in +90% of the cases) but they are stored in a table containing key-value pairs. These key-value pair tables typically look like this:

CustomerID Key Value
101 COUSINS_BIRTH_DATE 1969/02/14
101 COUSINS_INCOME 1723.86
101 …45 more keys…

OK, that’s fine. The problem starts when the reporting and data warehouse developers want to get all these 50 fields in the Customer Dimension. That is because there it would make sense to de-normalise the data and present the data in an easy-to-retrieve fashion, next to the other customer data. That’s the point of the data warehouse, remember.

Typically, what you would do is launch 50 lookups like this:

SELECT Value AS Cousins_First_Name
FROM Customer_Key_Value
WHERE CustomerID=101

Now, you can immagine the maintainance nightmare that results from such a system:

  • If key-value pairs get added we need to add extra lookups
  • If a key changes, we need to look in 50 lookups to find the right one
  • What if for some reason, 2 or more of the same keys would appear for the same CustomerID (a key collission), what would you do: take the first, the last, concatenate separated by commas, etc? Perhaps an error would be giving, causing your transformation to fail?
  • Usually, not only Strings but also dates, numbers and booleans are stuffed into the Value field and all kinds of data conversions would have to take place.

Also, the performance would go right down the drain because of the many individual calls to the database. That is why we create the de-normaliser step to allow you to solve this problem once and for all. The way to do it is easy. First, you join with the rest of the customer dimension:

SELECT C.*, KV.Key, KV.Value
FROM Customer C, Customer_Key_Value KV
WHERE C.CustomerID=KV.CustomerID

This way we get all the key-value pairs in multiple rows and then apply the step as shown in this image:

Sample Denormaliser Step

As you can see, we have put all the key-value pair lookups, the conversions AND the aggregation logic into a single dialog making it very easy to do maintenance.

Now, yesterday, someone asked me if the step could also de-normalise multiple value fields for the same key. Since the requirement never came up until yesterday, I had to say that it could not. However, this morning I added support for this as part of a fix for a bug report. So now you can indeed de-normalise different value fields, using the same key value. (if you update using the latest kettle.jar library)

The relevant example can be found in the new release distribution in the samples directory. (or if you access the source code using subversion)

The file is called:

Denormaliser – 2 series of key-value pairs.ktr

Until next time,


Simpler reporting : make your data richer

A lot of time, I hear discussions about which reporting tool is the easiest to use for certain special tasks. Most of the time, I just ignore these “threads” because it’s not my cup of tea as a developer of ETL solutions.
However, it has to be said, often the solution to complex reporting requirements is to be found in ETL.
When you find yourself struggling with complex reports that need any of the following:

  • compare different records
  • aggregate beyond simple sums and averages.
  • report on a non-existing records (report 0 sales, etc)

Well, in those cases you need ETL.

Let’s take for example the case of the reporting on non-existing sales: how can you report that there has been 0 sales for a certain product during a certain week? Well, you can create an aggregate table in your ETL that contains the following:

  • Dimensions
    • Time (end of week)
    • Product (lowest level)
  • Facts
    • Sum of Sales for the product during the passed week.

You will agree with me that if you create such a simple fact table, that the report based on it will also be simple. That’s not all. You get another benefit from this: you create the possibility to extend the fact table with other interesting data when it comes up. For example, in the same transformation that populates this aggregate fact table, you can add the following (examples):

  • Facts
    • Total Sales last month
    • Total Sales last quarter
    • Total Sales last half year
    • Total Sales Year-to-date
    • Total Sales ever
    • Difference in sales with last week

You see, the initial investment of building the ETL job is not thrown away over time, you build upon the earlier effort and grow with the needs.
Now you might think that these metrics cost a lot in terms of CPU and I/O to calculate, but of-course that’s not true: you can just look at the numbers for the previous week and do a simple addition…..
Sample aggregation transformation

The added advantage of doing it during the night in the ETL jobs is that the speed of the reports that run on this aggregate are bound to run very fast because no calculations need to take place afther the facts.

Finally, let me say that by constructing your datawarehouse like this (one fact table derived from another) you’re actually enriching your data, making it high quality information… which brings me to this famous quote as what we’re doing here is exactly the oposite:

“Where is the wisdom? Lost in the knowledge.
Where is the knowledge? Lost in the information.”

T.S. Eliot

“Where is the information? Lost in the data.
Where is the data? Lost in the #@$%?!& database.”

Joe Celko

Until next time,