Re-Introducing UDJC

Dear Kettle fans,

Daniel & I had a lot of fun in Orlando last week. Among other things we worked on the User Defined Java Class (UDJC) step.  If you have a bit of Java Experience, this step allows you to quickly write your own plugin in a step. This step is available in recent builds of Pentaho Data Integration (Kettle) version 4.

Now, how does this work?  Well, let’s take Roland Bouman‘s example : the calculation of the the date of Easter.  In this blog post, Roland explains how to calculate Easter in MySQL and Kettle using JavaScript.  OK, so what if you want this calculation to be really fast in Kettle?  Well, then you can turn to pure Java to do the job…

import java.util.*;

private int yearIndex;
private Calendar calendar;

public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
{
Object[] r=getRow();
if (r==null)
{
setOutputDone();
return false;
}

if (first) {
yearIndex = getInputRowMeta().indexOfValue(getParameter("YEAR"));
if (yearIndex<0) {
throw new KettleException("Year field not found in the input row, check parameter 'YEAR'!");
}

calendar = Calendar.getInstance();
calendar.clear();

first=false;
}

Object[] outputRowData = RowDataUtil.resizeArray(r, data.outputRowMeta.size());
int outputIndex = getInputRowMeta().size();

Long year = getInputRowMeta().getInteger(r, yearIndex);
outputRowData[outputIndex++] = easterDate(year.intValue());

putRow(data.outputRowMeta, outputRowData);

return true;
}

private Date easterDate(int year) {
int a = year % 19;
int b = (int)Math.floor(year / 100);
int c = year % 100;
int d = (int)Math.floor(b / 4);
int e = b % 4;
int f = (int)Math.floor(( 8 + b ) / 25);
int g = (int)Math.floor((b - f + 1) / 3);
int h = (19 * a + b - d - g + 15) % 30;
int i = (int)Math.floor(c / 4);
int k = c % 4;
int L = (32 + 2 * e + 2 * i - h - k) % 7;
int m = (int)Math.floor((a + 11 * h + 22 * L) / 451);
int n = h + L - 7 * m + 114;

calendar.set(year, (int)(Math.floor(n / 31) - 1), (int)((n % 31) + 1));
return calendar.getTime();
}

All you then need to do is specify a return field in the Fields tab called “Easter” (a Date) and a parameter YEAR (the field to contain the year).

Screen shot of the UDJC step

The performance on my machine (Dual Core 2 Duo 2.33Ghz) is 134,000 rows/s for the JavaScript version and 450,000 rows/s for the UDJC version.  That’s over 3 times faster to do exactly the same thing.

Here is a link to the Kettle test transformation for those that want to give it a try.  As you can see, the deployment issue of having a plugin around is completely gone since now you can do anything you can do with a plugin from within the comfort of the UDJC step in Spoon.

The UDJC step uses the wonderful Janino library to compile the entered code to Java byte-code that gets executed at the same speed as everything else in Kettle.  This gives us pretty much optimal performance.

You can expect some tweaks to the UDJC step before 4.0 goes into feature freeze.  However, the bulk of the changes are in there and working great.  Thank you Daniel, for an outstanding job!

Until next time,

Matt

Job drill down & sniff testing

Dear Kettle fans,

Besides refactoring and cleaning up code, I fortunately can write some new code once in a while as well.

Today, I’m happy to demo job drill down and step sniff testing for you.

The first feature, Job drill down, allows you to drill down into a running job entry, into sub-jobs, transformations and even mappings (sub-transformations). All the time, you’ll see the logging for that part of the root job as well as the usual metrics:

The second feature is also a lot of fun. It allows you to execute a transformation in Spoon and see the rows that are coming out of a step in real time. I called this feature a sniff test since that’s what it seems to be doing:

Hope you like these small usability features.  If you want to try it out yourself, get Hudson build 1598 or later.

Until next time,

Matt