3D Video games

I just came across this older article with a priceless quote from Neil Raden in it:

Look, I’m playing a 3-D video strategy game with four people in China I don’t even know, while I’m downloading data to my iPod, while I’m answering messages in Yahoo messenger. Are you going to tell me I can’t have a report for three months because it has to go through QA?

Ain’t that true!  It reminded me as well of something I saw a while ago on another blog somewhere…

There are other gems in that interview, look for yourself.

Until next time,



ETL vs Reporting : more chatting

More chatting with a self-acclaimed “Noob”. (not a real nickname)  The soup was provided by yours truly.

I’m posting this as it might be interesting for others as well. 


(16:56:19)  Noob:  Hey man
(16:56:26)  Noob:  i wanted to ask about DWH
(16:56:31)  soup-nazi:  k
(16:56:53)  Noob:  after alot of reading and such am i right in assuming that most BI solutions are static as such
(16:57:09)  Noob:  consdering we have to decide on DWH dimensions and fact tables before we start
(16:57:46)  soup-nazi:  No, you are not right, in fact…
(16:58:19)  soup-nazi:  You do need to decide up-front, but the fact that people change and requirements change is absolute
(16:59:11)  soup-nazi:  So what you need is a water-fall project model: Gather requirements – Design model – make technical design – implement ETL – make reports / testing – handover / docs
(16:59:31)  soup-nazi:  Each time you go through the cycle your model is updated and the ETL changes.
(16:59:50)  soup-nazi:  A star schema is designed to allow this to happen.  It copes very well with changes to the requirements.
(17:00:27)  soup-nazi:  Your DWH can take a punch when you design it correctly and use technical (or surrogate) keys everywhere.
(17:01:09)  soup-nazi:  k?
(17:01:34)  Noob:  Thats interesting. so would you consider a customer as an ongoing project and if the customer required changes such as new reports you would have to add to your original development
(17:02:07)  soup-nazi:  Yes, absolutely.
(17:02:51)  soup-nazi:  Here is the unconvenient truth: users change their mind.  And they should be able to change their minds.
(17:03:09)  soup-nazi:  Business requirements change, everything changes.
(17:03:26)  soup-nazi:  In some organisations things change faster than in others, but changes are happening everywhere.
(17:03:59)  Noob:  thats very true. have you ever come up against a customer that wanted more control to devlop thier own reports as things change?
(17:04:22)  Noob:  as learning the ins and outs of the Pentaho BI platform isn’t an easy job
(17:05:16)  soup-nazi:  Yes, that’s why we developed Pentaho metadata and the Web Based Ad-hoc Querying and Reporting (WAQR) solution
(17:05:39)  soup-nazi:  That allows plain simple users to build their own reports.
(17:05:47)  soup-nazi:  online
(17:05:49)  soup-nazi:  on the web
(17:05:54)  soup-nazi:  available now
(17:06:00)  soup-nazi:  in version 1.5M3 or later
(17:06:08)  Noob:  actually build thier own reports? not just edit ETL transformations?
(17:06:41)  soup-nazi:  yes, one has nothing to do with the other.
(17:06:48)  Noob:  you will have to forgive me if i ask stupid questions about the pentaho products. ive been thrown in the deep end here 🙂
(17:06:54)  soup-nazi:  ETL (Kettle) is not meant for reporting.
(17:07:46)  Noob:  Okay, well i have this little dedicated linux box sitting next to me and the plan is to use it to build a BI solution
(17:07:55)  Noob:  ive started with what i know best, Kettle
(17:08:14)  Noob:  and transformed my data into a DWH for some basic reporting
(17:08:39)  Noob:  from then on i would use the other Pentaho tools such as report designer and the .xaction editor to create the rest?
(17:09:00)  soup-nazi:  Yes.
(17:09:21)  soup-nazi:  Most of these tools start out with a SQL query that defines your data.
(17:09:45)  soup-nazi:  The SQL query is usually simple because it reads from a star schema DWH.
(17:10:08)  soup-nazi:  With Pentaho metadata you can even eleminate that SQL writing by designing your data model;
(17:10:21)  soup-nazi:  WAQR then writes the SQL for you based on a user-selection.
(17:10:35)  Noob:  ahhhhh, very nice =)
(17:10:54)  Noob:  WAQR you say. is there a release? maybe ive already seen it
(17:11:47)  soup-nazi:  Download Pentaho 1.5 milestone 3, it’s in there. (Create report from the first page after login)
(17:12:01)  soup-nazi:  It’s some AJAX like frontend
(17:12:12)  Noob:  AJAX, i like! =)
(17:12:25)  soup-nazi:  1.5M3 is available from the Pentaho frontpage at the moment
(17:14:11)  Noob:  Okay, im downloading it just now to have a look.
(17:14:32)  soup-nazi:  Pentaho metadata 1.5M3 is also available for download BTW.
(17:14:40)  Noob:  but in all, we start with Kettle and place all our OLTP data into a DWH
(17:14:59)  Noob:  from there we do everything else. reporting, dashboards, charts ect ect
(17:15:17)  soup-nazi:  yes.
(17:15:23)  Noob:  all the data for these reports is taken from the DWH
(17:15:28)  soup-nazi:  yes
(17:15:37)  Noob:  okay. starting to get a better understanding
(17:15:57)  soup-nazi:  In fact, once you put the lowest level of data, the transactions, in the DWH, you never go back to the source system.
(17:16:19)  soup-nazi:  You can build what we call aggregates or analytical fact tables that are derived from the lowest level fact tables.
(17:16:50)  soup-nazi:  For example, if you want to know how long ago it was that each individual customer bought something, you need to do lookups.
(17:17:05)  soup-nazi:  Those lookups are easier in ETL than they are in reporting tools. (virtually impossible).
(17:17:12)  soup-nazi:  So you build an aggregate, etc…
(17:17:38)  soup-nazi:  You refine the data, you lift it until at the very end you get a highly top-level fact table to drive your dash-boards.
(17:18:36)  Noob:  wow, its alot to take in. just aswel im gonna keep a copy of this conversation. Also i appreciate your helping this *Noob* =)
(17:18:55)  soup-nazi:  np


Data vs Metadata : Kettle 3.0

A few weeks ago we started work on Kettle 3.0.
One of the notable changes is a redesign of the internal data engine.
More to the point, we’re aiming for a strict separation of data and metadata.

The main reason for doing so was the reduction of object allocation and also to allow us to extend the metadata without a performance impact.
We anticipated a performance gain here and there, and initial test-code gave us a 15-20% increase in performance to hope for.

Who could have guessed that reading this file, sorting it and writing it back to disk would turn out to become more than 5 times faster? (4.83 seconds!!!)  Granted, we take the opportunity to do a full code review and if we spot a possible performance problem, we fix it too.   

We’re also making a library of test-cases to run performance and regression tests again version 2.5 code.  The result with comparison (speedup calculation) is posted here.

One of the nice things about the code changes is that although it will break the API, it’s a Good Think(TM) for the project in the long run.  It will give us breathing room to keep innovating in the future.  Speeding up steps between 15 and 1700% is a good start for the first 20 steps that are converted.  It’s also nice to see a lot of test-cases double or triple in speed:

  • Select Values : x9 (Random select), x6 (Delete field), x5.5 (Change metadata)
  • Calculations: between x1.8 and 3.6 faster
  • Add sequence : up to 3x faster
  • Table Output : 15% faster up to x2
  • Add constant values: x5
  • Filter rows: x1.5
  • Row generator: x2.5 - x3 (up to 1.2M empty rows/s)
  • Sort rows: x1.15 - x5
  • Stream Lookup: x1.24 - x1.36 (and this step already got some serious tuning in the past)
  • Table Input: x1.2
  • Text File Input: x1.6 - x3.4
  • Text File Output: x1.75 - x2.9

On top of that, memory usage should have been reduced as well, although we don’t have any exact figures on that.

Until next time,



Spam, spam, lovely spam

About 10 months ago I started to receive a lot of spam messages on this bog. The site is powered by WordPress and so I installed Askimet.  So far, Askimet captured over 22.000 spam messages. 

Although I never found a false positive so far, I always make it a point to skim over the rejected messages, just to make sure. 

Usually the log is filled with ****, **** and ****, but the spammer below actually expresses profound regret for having being forced to send me a message.

Hello Sirs, I’m very sorry for my post
buy v*agra

Well, I guess they thought it wouldn’t be as bad if they made up excuses for it or something.   On the one hand this is a funny message, on the other hand it’s very sexist to assume you have to have at least 2 men to write a blog entry. 

Oh well.

Until next time,


Pot to Kettle : you’re version 2.5 now!

The Kettle development team is proud to release version 2.5.0 GA into the wild. After the very successful version 2.4.0 our expectations are very high for this release.

For a complete change log, see here: Kettle_2_5_0_change_log.pdf 
To download the source of binary distribution go to the sourceforge download section of Pentaho Data Integration 2.5.0 or download a torrent: Kettle-2.5.0.zip.torrent or Kettle-src-2.5.0.zip.torrent

As always, we are looking forward to reading your feedback on this version.

Until next time!


Kettle 2.5.0

Serving SQL

How do you like your SQL served? Personally I like this type:

SELECT   dc.state
       , dd.year
       , dd.month
       , sum(fs.sales) as total_sales
FROM     fact_sales     fs,
         dim_date       dd,
         dim_customer   dc
WHERE    fs.customer_tk = dc.customer_tk
AND      fs.date_tk     = dd.date_tk
AND      dd.year IN ( 2006, 2007 )
GROUP BY dc.state
       , dd.year
       , dd.month

Obviously, it’s inspired by simple star schema querying. Anything goes for me, but I kinda like a little bit of “structure” in my SQL (no pun intended), just a little bit of overview.

What’s your favorite SQL format?


P.S. We will soon be launching PDI version 2.5.0, stay tuned!