Going virtual

Development for Pentaho Data Integration has been making good headway lately.  One of the new cool things that we recently implemented is the ability to have reference source files, transformations and jobs from any location you like.

The underlying libraries we use to do that is the Apache Commons Virtual File System.

Here is a simple example that you can try with the latest dev version:

sh kitchen.sh -file:http://www.kettle.be/GenerateRows.kjb

Let’s have a look at this job in Spoon.  To open it directly from the URL above follow this procedure:

Open file from URL

Type in the url:

Selecting OK will load the job in Spoon:

The transformation we are about to launch is also located on the webserver.  The internal variable for the job name directory is:

Internal.Job.Filename.Directory    http://www.kettle.be/

This allows us to reference the transformation as follows:

Please note that if you try this yourself you’ll note that you can’t save the job back to the webserver.  That is not because we don’t support that, but because you don’t have the permission to so.

Please have a quick look at the almost endless list of possibilities over here. These include direct loading from zip-files, gz-files, jar-files, ram drives, SMB, (s)ftp, (s)http, etc.

We will extend this list even further in the near future with our own drivers for the Pentaho solutions repository and later on for the Kettle repository (something like: psr:// and pdi:// URIs)
As cool examples go, here is one to end with:

Until next time,

Matt

Handling errors

In the next milestone build of Pentaho Data Integration (2.4.1-M1) we will be introducing advanced error handling features. (2.4.1-M1 is expected around February 19th)
We looked hard to find the easiest and most flexible way to implement this, and I think we have found a good solution.

Here is an example:

Error handling table output sample

The transformation above works as follows: it generates a sequence between -1000 and 1000.  The table is a MySQL table with a single “id” column defined as TINYINT.  As you all know, that data type only accepts values between -128 and 127.

So what this transformation does is, it insert 256 rows into the table and divert all the others to a text file, our “error bucket”.

How can we configure this new feature?  Simple: click right on the step where you want the error handling to take place, in this case, the “Table output” step.  If the step supports error handling, you will see this popup menu appear:
Error handling popup menu

Selecting the highlighted option will present you with a dialog that allows you to configure the error handling:

Error handling dialog

As you can see, you can not only specify the target step to which you want to direct the rows that caused an error.  You can also include extra information in the error rows so that you know exactly what went wrong.  In this particular case, these are the extra fields that will appear in the error values error rows:

  • nrErrors: 1
  • errorDescription:  be.ibridge.kettle.core.exception.KettleDatabaseException:
    Error inserting row
    Data truncation: Out of range value adjusted for column ‘id’ at row 1
  • errorField: empty because we can’t retrieve that information from the JDBC driver yet.
  • errorCode: TOP001 (placeholder, final value TBD)

At the moment we have only equiped the “Script Values” (easy to cause errors with) and “Table Output” steps with these new capabilities.  However, in the coming weeks, more steps will follow suit.

Until next time,

Matt

ETL/Reporting deathmatch

To bring new life to the eternal ETL vs. Reporting deathmatch, Thomas Morgner (lead developer for Pentaho Reporting) and myself wrote a plugin for Kettle to drive the shiny new reporting engine that is being developed.  It seems that each time we meet for a short while, some code is being written. 🙂
The new engine works a lot like a HTML/CSS processor and is one of the really cool new things on the block.  I’m sure you will hear a lot more about it once it’s actually finished.

This is what the plugin looks like:

Pentaho reporting plugin screenshot

I dumped the source code into subversion over on Javaforge for the time being.

Subversion checkout is at URL: http://svn.javaforge.com/svn/PentahoReportingPlugin/trunk , username: Anonymous , password: anon

If you want to join the coding effort, let me know and I’ll give you subversion write access.   A binary version of the first plugin draft can be found over here.  Download the zip file and unzip it in the plugins/steps/ directory of your 2.4.0 Kettle distribution.  Restart Spoon and you’re set.
The easiest way to get started is by using the included “auto-start.xml” report definition.  You can simply send some rows to the plugin and it will generate a (one-page) report from it.  PDF seems to work already, the rest I’m told is a bit shaky.
NOTE: This is not production grade software and for sure a lot of functionality is still missing.  However, consider helping us make this better ;-)  That way, when the new Pentaho Reporting engine is ready, this plugin will also be ready.

Whatever the case, soon the questions with regard to “PDF Output”, “Excel Output”, etc. will all be gone.  Then you can create documents in whatever layout you want.
Until next time,

Matt