Announcing Pentaho 5.0 Community Edition

Dear Pentaho friends,

Today it’s been almost a year since we released Kettle 4.4 as part of the Pentaho 4.8.0 suite release community edition. Today we’re very happy to announce the availability of Pentaho 5.0 community edition.

Now obviously, since this is open source software which is always available though continuous integration builds, we’re talking about the availability of builds which have been sufficiently verified by Pentaho QA, community members, partners and so on to be considered a release.

All new is also the new community landing page:

On it you’ll find convenient links to the downloads, the source code, the CI builds and so on.

Thanks you all for your patience in delivering this new step forward.  In the coming weeks I’ll do a series of “New in Kettle 5.0” blog posts to inform you about what we did.  In short, these are the highlights:

  • Better previewing with dedicated window in run tab
  • Easier looping with transformation and job executor steps
  • Internal metrics gathering with Gantt chart
  • An improved plugin marketplace
  • Tons of Big Data improvements
  • Improved logging backend
  • Lots of new steps and job entries
  • A series of new plugin systems to make it easier to extend Kettle
  • Lots of new configuration options for tuning the engine back-ends
  • Integrated help system in the step and job entry dialogs

Even more important though:

  • Over 1500 bug fixes and feature requests
  • Metadata compatible with previous versions
  • API compatible with 4.x

I would like to thank and congratulate the whole community with this excellent release!

Happy downloading!

P.S. Keeping up with tradition let me share some statistics… The Kettle project, trimmed down to not include any plugins, now contains over 840,000 lines of code.  Kettle 4.4 was downloaded over 244,000 times in the last year, from SourceForge only.


Big Kettle News

Dear Kettle fans,

Today I’m really excited to be able to announce a few really important changes to the Pentaho Data Integration landscape. To me, the changes that are being announced today compare favorably to reaching Kettle version 1.0 some 9 years ago, or reaching version 2.0 with plugin support or even open sourcing Kettle itself…

First of all…

Pentaho is again open sourcing an important piece of software.  Today we’re bringing all big data related software to you as open source software.  This includes all currently available capabilities to access HDFS, MongoDB, Cassandra, HBase, the specific VFS drivers we created as well as the ability to execute work inside of Hadoop (MapReduce), Amazon EMR, Pig and so on.

This is important to you because it means that you can now use Kettle to integrate a multitude of technologies, ranging from files over relational databases to big data and NoSQL.  You can do this in other words without writing any code.  Take a look at how easy it is to program for Hadoop MapReduce:

In other words, this part of the big news of today allows you to use the best tool for the job, whatever that tool is. You can now combine the large set of steps and job entries with all the available data sources and use that to integrate everything. Especially for Hadoop the time it takes to implement a MapReduce job is really small taking the sting out of costly and long training and testing cycles.

But that’s not all…

Pentaho Data Integration as well as the new big data plugins are now available under the Apache License 2.0. This means that it’s now very easy to integrate Kettle or the plugins in 3rd party software. In fact, for Hadoop, all major distributions are already supported including: Amazon Elastic MapReduce, Apache Hadoop, Cloudera’s Distribution including Apache Hadoop (CDH), Cloudera Enterprise, EMC Greenplum HD, HortonWorks Data Platform powered by Apache Hadoop, and MapR’s M3 Free and M5 Edition.
The change of Kettle from LGPL to Apache License 2.0 was broadly supported by our community and acts as an open invitation for other projects (and companies) to integrate Kettle. I hope that more NoSQL, Big Data and Big Search communities will reach out to us to work together to even broaden our portfolio. The way I see it, the Kettle community just got a whole lot bigger!

Where are the goodies?

The main landing page for the Big Data community is placed on our wiki to emphasize our intention to closely work with the various communities to make Pentaho Big Data a success. You can find all information over there, including a set of videos, PDI 4.3.0 preview download (including Big Data plugins), Hadoop installation instructions, PRD configuration information and much more.

Thanks for your time reading this and thanks for using Pentaho software!


Streaming XML content parsing with StAX

Today, one of our community members posted a deviously simply XML format on the forum that needed to be parsed.  The format looks like this:

    <DATE>Fri, 01 Jun 2001 22:50:00 GMT</DATE>

    <DATE>Fri, 01 Jun 2001 22:50:02 GMT</DATE>

Typically we parse XML content with the “Get Data From XML” step which used XPath expressions to parse this content.  However, since the meaning of the XML content is determined by position instead of path, this becomes a problem.  To be specific, for each CONVERSION block you need to pick the last preceding EXPR and EXCH values.  You could solve it like this:

Unfortunately, this method requires a full parsing of your file 3 times and once extra for each additional preceding element.  The joining and all also slows things down considerably.

So this is another case where the new “XML Input Stream (StAX)” step comes to the rescue.  The solution using this step is the following:

Here’s how it works:

1) The output of the “positional element.xml” step flattens the content of the XML file so that you can see the output of each individual SAX event like “start of element”, “characters”, “end of element”.  Every time you get the path, parent path, element value and so forth.  As mentioned in the doc this step is very fast and can handle files with just about any size with a minimal footprint.  It will appear in PDI version 4.2.0GA.

2) With a bit of scripting we collect information from the various rows that we find interesting.

3) We filter out only the result lines (the end of the CONVERSION element).  What you get is the following desired output:

The usage of JavaScript in this example is not ideal but compared to the reading speed of the XML I’m sure it’s fine for most use-cases.

Both examples are up for download from the forum.

The “XML Input Stream (StAX)” step has also shown to work great with huge hierarchical XML structures, files of multiple GB in size.  The step was written by colleague Jens Bleuel and he documented a more complex example on his blog.

Have fun with it!


Real-time streaming data aggregation

Dear Kettle users,

Most of you usually use a data integration engine to process data in a batch-oriented way.  Pentaho Data Integration (Kettle) is typically deployed to run monthly, nightly, hourly workloads.  Sometimes folks run micro-batches of work every minute or so.  However, it’s lesser known that our beloved transformation engine can also be used to stream data indefinitely (never ending) from a source to a target.  This sort of data integration is sometimes referred to as being “streaming“, “real-time“, “near real-time“, “continuous” and so on.  Typical examples of situations where you have a never-ending supply of data that needs to be processed the instance it becomes available are JMS (Java Message Service), RDBMS log sniffing, on-line fraud analyses, web or application log sniffing or of-course … Twitter!  Since Twitter is easily accessed it’s common for examples to pop up regarding it’s usage and in this blog post too we will use this service to demo the Pentaho Data Integration capabilities wrt to processing streaming data.

Here’s what we want to do:

  1. Continuously read all the tweets that are being sent on Twitter.
  2. Extract all the hash-tags used
  3. Count the number of hash-tags used in a one-minute time-window
  4. Report on all the tags that are being used more than once
  5. Put the output in a browser window, continuously update every minute.

This is a very generic example but the logic of this can be applied to different fields like JMS, HL7, log sniffing and so on.  It differs from the excellent work that Vincent from Open-BI described earlier this week on his blog in the sense that his Talend job finishes where ours will never end and where ours will do time-based aggregation in contrast to aggregation over a finite data set.

Also note that in order for Kettle to fully support multiple streaming data sources we would have to implement support for “windowed” (time-based) joins and other nifty things.  We’ve seen very little demand for this sort of requirement in the past, perhaps because people don’t know it’s possible with Kettle.  In any case, if you currently are in need of full streaming data support, have a look at SQLStream, they can help you. SQLStream is co-founded by Pentaho’s Julian Hyde of Mondrian fame.

OK, let’s see how we can solve our little problem with Kettle instead…

1. Continuously read all the tweets that are being sent on Twitter.

For this we are going to use one of the public Twitter web services, one that delivers a never-ending stream of JSON messages:

Since the format of the output is never-ending and specific in nature I wrote a small “User Defined Java Class” script:

public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
HttpClient client = SlaveConnectionManager.getInstance().createHttpClient();

Credentials creds = new UsernamePasswordCredentials(getParameter("USERNAME"), getParameter("PASSWORD"));
client.getState().setCredentials(AuthScope.ANY, creds);

HttpMethod method = new PostMethod("");

// Execute request
InputStream inputStream=null;
BufferedInputStream bufferedInputStream=null;
try {
int result = client.executeMethod(method);

// the response
inputStream = method.getResponseBodyAsStream();
bufferedInputStream = new BufferedInputStream(inputStream, 1000);

StringBuffer bodyBuffer = new StringBuffer();
int opened=0;
int c;
while ( (!=-1  && !isStopped()) {
char ch = (char)c;
if (ch=='{') opened++; else if (ch=='}') opened--;
if (ch=='}' && opened==0) {
// one JSON block, pass it on!
Object[] r = createOutputRow(new Object[0], data.outputRowMeta.size());
String jsonString = bodyBuffer.toString();

int startIndex = jsonString.indexOf("{");
if (startIndex<0) startIndex=0;

// System.out.print("index="+startIndex+" json="+jsonString.substring(startIndex));

r[0] = jsonString.substring(startIndex);
putRow(data.outputRowMeta, r);

} catch(Exception e) {
throw new KettleException("Unable to get tweets", e);
} finally {

return false;

As the experienced UDJC writers among you will notice: this step never ends as long as the twitter service keeps on sending more data.  Depending on the stability and popularity of twitter that can be “a very long time“.

You could improve the code even further to re-connect to the service every time it drops away.  Personally I would not do this.  I would rather have the transformation terminate with an error (as it is implemented now), send an alert (e-mail, database, SNMP) and re-start the transformation in a loop in a job.  That way you have a trace in case twitter dies for a few hours.

2. Extract all the hash-tags used

First we’ll parse the JSON returned by the twitter service, extract the first 5 hash-tags from the message, split this up into 5 rows and count the tags…

3. Count the number of hash-tags used in a one-minute time-window

The counting is easy as you can simply use a “Group by”  step.  However, how can we aggregate in a time-based fashion without too much tinkering?   Well, we now have the “Single Threader” step which has the option to aggregate in a time-based manner so we might as well use this option:

This step simply accumulates all records in memory until 60 seconds have passed and then performs one iteration of the single threaded execution of the specified transformation.  This is a special execution method that doesn’t use the typical parallel engine.  Another cool thing about this engine is that the records that go into the engine in the time-window can be grouped and sorted without the transformation being restarted every minute.

4. Report on all the tags that are being used more than once

The filtering is done with a simple “Filter Rows” step.  However, thanks to the magic of the “Single Threader” step we can sort the rows descending by the tag occurrence count in that one-minute time-window.  It’s also interesting to note that if you have huge amounts of data, that you can easily parallelize your work by starting multiple copies of the single threader step and/or with some clever data partitioning.  In our case we could partition by hash-tag or re-aggregate the aggregated data.

5. Put the output in a browser window, continuously update every minute.

As shown in an earlier blog post, we can do this quite easily with a “Text File Output” step.  However, we also want to put a small header and a separator between the data from every minute so we end up with a transformation that looks like this:

The script to print the header looks like this:

var out;
if (out==null) {
out = _step_.getTrans().getServletPrintWriter();
out.println("'Real-time' twitter hashtag report, minute based");

The separator between each minute is simple too:

if (nr==1) {
var out = _step_.getTrans().getServletPrintWriter();

You can execute this transformation on a Carte instance (4.2.0) and see the following output:

'Real-time' twitter hashtag report, minute based

1;tatilmayonezi;5;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
2;AUGUST6THBUZZNIGHTCLUB;3;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
3;teamfollowback;3;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
4;ayamzaman;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
5;dnd;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
6;follow;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
7;malhação;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
8;rappernames;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
9;thingswelearnedontwitter;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000

1;ska;5;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
2;followplanetjedward;4;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
3;chistede3pesos;3;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
4;NP;3;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
5;rappernames;3;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
6;tatilmayonezi;3;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
7;teamfollowback;3;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
8;AvrilBeatsVolcano;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
9;CM6;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
10;followme;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
11;Leão;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
12;NewArtists;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
13;OOMF;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
14;RETWEET;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
15;sougofollow;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
16;swag;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
17;thingswelearnedontwitter;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000


For reference, I used the following URL to start the streaming report:

http://cluster:cluster@ Stream%2FRead a twitter stream.ktr&USERNAME=MyTwitterAccount&PASSWORD=MyPassword

I placed the complete example over here in case you want to try this yourself on PDI/Kettle version 4.2.0-RC1 or later. Things you can add to make it even cooler is to have this transformation send an e-mail every time a certain hash-tag gets used more than 15 times in a given minute.  That sort of alerting support for example gives you easy access to emerging new trends, events and memes.

For reference, take a look at this earlier blog post of mine where I describe the internal cleanup mechanisms inside of Kettle that prevent our transformation from ever running out of memory or resources.

Until next time,


Kettle data in a browser

Dear Kettle fans,

As you can tell from the Kettle JDBC driver project and also from the Talend job execution job entry (if you’re still wondering, that was NOT a joke) we announced a few weeks ago, we’re constantly looking for new and better ways to integrate Kettle into the wide world.

Today I’m blogging to spread the word about a new exciting possibility: expose Kettle data over a web service.

Here’s the situation: you have a 3rd party vendor that wants to read data from you.  However, you don’t want to spend a lot of time coding web services and what not to deliver the data in a language neutral format.  Anybody can read data from a simple web service.  You can use any programming language and it’s easy to test in a browser.

The way that it works arguably couldn’t be easier…  Let’s take a look.

We start with a simple transformation that reads some customer data.  We take the “Read customer data” example from the samples:

samples/transformations/CSV Input – Reading customer data.ktr

Next, we’re going to replace the dummy step with a “Text File Output” step (or “XML Output” if you prefer that):

Save the transformation in the same samples folder under the “Servlet Data Example.ktr” filename.  In my case the full filename is: (from the transformation settings dialog)

/home/matt/svn/kettle/trunk/samples/transformations/Servlet Data Example.ktr

Now you can use an instance of Carte that runs on your local host on port 8282 (execute 8282) to get the data in your browser.  The URL is easy to construct.

For file (XML) based transformations:


For transformations stored in a repository:


Please note that you need to replace forward slashes with %2F in your browser.  That leads to this URL in our example:

http://cluster:cluster@ Data Example.ktr

The result is can be tested with your browser:

For the script kiddies among you it is possible to get a hold of the servlet print writer in JavaScript too:

var out = _step_.getTrans().getServletPrintWriter();
out.println(“<H1>Hello, world!\n</H1>”);

Well, there you have it. I hope you like this idea.  You can try it out yourself if you download a new build of Pentaho Data Integration from our Jenkins CI build server.

Now excuse me while I put the same button in the “JSON Output” step…

Until next time,


UPDATE Apr 27: “JSON Output” now also supports the new “Servlet output” option.  Furthermore it is now also possible to pass parameters and variables through the URL.  Simply add &PARAM=value and the appropriate parameter or variable in the transformation will be set prior to execution.

Executing Talend Jobs in Kettle

Dear Kettle friends,

I know a number of you have been asking for this feature to facilitate your migration projects for a while. However, today the “Talend Job Execution” job entry has finally arrived in Kettle!

Take a look at this screen shot demonstrating how easy it is to integrate:

Executing a Talend Job in Kettle: screenshot

The way that this works is by first doing an export of your job in Talend Open Studio:

Then you can configure your new job entry:

The name of the class can be derived from the generated code in TOS “Code” tab (behind your Designer view).

I’m looking forward to your (non-hostile, non-flamebait, non-trolling as always) comments!

Until next time,


P.S. You can see this new job entry yourself in the upcoming PDI 4.2 milestone release or download it now from CI.

Data Cleaner 2

Dear Kettle friends,

Some time ago while I visited the nice folks from Human Inference in Arnhem, I ran into Kasper Sørensen, the lead developer of DataCleaner.

DataCleaner IconDataCleaner is an open source data quality tool released (like Kettle) under the LGPL license.  It is essentially to blame for the lack of a profiling tool inside of Kettle.  That is because having DataCleaner available to our users was enough to push the priority of having our own data profiling tool far enough down.

Kasper worked on DataCleaner pretty much in his spare time in the past.  Now that Human Inference took over the project I was expecting more frequent updates and that’s what we got indeed.  Not only did version 2 come out recently, we also got versions 2.0.1 a few weeks back and today version 2.0.2.  All this indicates a fast-paced project.

DataCleaner was mentioned a few times in books about Pentaho software.  For example it was referenced in Pentaho Solutions as well as in Pentaho Kettle Solutions (chapter 6 – Data profiling).  This was done to allow folks that need to do a bit data profiling before they start with the data integration work, to get the job done.

So what’s happening with DataCleaner besides Kasper going all-out now that he works full time on the product? What purpose does it serve?

Let’s start with my favorite option: the “Quick Analysis” option.  You point it to a database table (or CSV file) and you let it fly.  Here’s the sort of thing it comes back with:

In essence it will give you most of what you need to know about the quality of your data before getting into the data integration work.  It’s offers a really nice and rich user interface.  In the previous screen shot you can for example click on the green arrows to display sample rows with that particular data characteristic.

Because not all profiling jobs are as easy as this one, DataCleaner has been featuring more “data integration” like features in version 2.0.  These will for example allow you to Filter certain rows based on a wide pallet of DQ oriented criteria such as dictionaries, JavaScript, Rules and much more.  The next screen shot shows the use of a filter to limit the number of analyzed rows:

Don’t expect any Kettle like drag&drop like data integration. This is specifically targeted towards on-line data quality and data profiling more specifically. However, that’s what the tool claims to be good at and it is good at that.

There’s obviously a lot more to tell about DataCleaner but I hope that this little blog post will make you at least interested and makes you want to give it a go yourself.

Since DataCleaner and Kettle are license-compatible I’ll be looking at creating a plugin to integrate DataCleaner into Spoon … once I find a bit of time to do so or if someone volunteers to jump right in.  Kasper wasn’t quite convinced it would be easy to do but not all things in life have to be easy.

You can download DataCleaner over here so download it now and make sure to let them know what you think of it.

Until next time,


Reading from MongoDB

How to read data from MongoDB using PDI 4.2

Hi Folks,

Now that we’re blogging again I thought I might as well continue to do so.

Today we’re reading data from MongoDB with Pentaho Data Integration.  We haven’t had a lot of requests for MongoDB support so there is no step to read from it yet.  However, it is surprisingly simple to do with the “User Defined Java Class” step.

For the following sample to work you need to be on a recent 4.2.0-M1 build.  Get it from here.

Then download mongo-2.4.jar and put it in the libext/ folder of your PDI/Kettle distribution.

Then you can read from a collection with the following “User Defined Java Class” code:

import java.math.*;
import java.util.*;
import java.util.Map.Entry;
import com.mongodb.Mongo;
import com.mongodb.DB;
import com.mongodb.DBCollection;
import com.mongodb.BasicDBObject;
import com.mongodb.DBObject;
import com.mongodb.DBCursor;

private Mongo m;
private DB db;
private DBCollection coll;

private int outputRowSize = 0;

public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
	DBCursor cur = coll.find();

	if (first) {
		outputRowSize = data.outputRowMeta.size();

	while(cur.hasNext() && !isStopped()) {
		String json =;
		Object[] row = createOutputRow(new Object[0], outputRowSize);
        	int index=0;
		row[index++] = json;

	    	// putRow will send the row on to the default output hop.
    		putRow(data.outputRowMeta, row);


    	return false;

public boolean init(StepMetaInterface stepMetaInterface, StepDataInterface stepDataInterface)
	try {
        	m = new Mongo("", 27017);
		db = m.getDB( "test" );
    		coll = db.getCollection("testCollection");

 		return parent.initImpl(stepMetaInterface, stepDataInterface);
	} catch(Exception e) {
	  	logError("Error connecting to MongoDB: ", e);
    		return false;

You can simply paste this code into a new UDJC step dialog. Change the parts in the init() method to server your needs. This code reads all the data from a collection in a Mongo database.  The output of this step is a set of rows contain each one JSON string. So make sure to specify one JSON String field as output of your step.  These JSON structures can be parsed with the new “JSON Input” step and then you can do whatever you want with it.

Please let us know what you think of this and whether or not you would like to see support for writing to MongoDB and/or dedicated steps for it.  I’m sorry to say I have no idea of the popularity of these new NoSQL databases.

Until next time,


UPDATE: The functionality described in this UDJC code is available in a new “MongoDB Input” step in 4.2.0-M1 or later.

UPDATE2: We also added authentication for MongoDB in PDI-6137

P.S. To install and run MongoDB on your Ubuntu 10.10 machine, do this:

sudo apt-key adv --keyserver --recv 7F0CEB10
sudo apt-get update
sudo apt-get install mongodb