The Single Threader step

Dear Kettle fans,

At the end of last year while we were doing a lot of optimizations and testing with embedding Pentaho Data Integration in Hadoop we came upon the brilliant idea to write a single threaded engine.

The idea back then was that since Hadoop itself was already using parallelism it might be more efficient for once to process rows of data in a single thread with minimal overhead.  This is very much like the approach that Talend has: they generate a single threaded Java application that has very little overhead for data passing.  So an engine was written, for the Java fans materialized in class SingleThreadedTransExecutor, to allow for that to happen.  The writing and testing was a lot of fun and for a single threaded engine the result is indeed very fast.  However, to make a long and tedious story short, the Pentaho Hadoop team tested the performance and found out that the regular parallel (multi-threaded) engine worked faster. (Duh!)  I guess it also has to do with the fact that if you use a single Hadoop node per server you indeed have multiple cores at your disposal.  So it might be the test-setup that plays a role as well.

Well, at that point we had an engine without a use-case which is always a bad place to end up.  So the engine risked being stuck on a one-way trip to Oblivion.

However, there is actually a use-case for the step.  Once every couple of months we get the question (from the sales-team usually, not from actual users) if it is possible to limit the number of threads or processors used in a transformation.  Up until now the answer was “No, if you have 20 steps you’ll have 20 threads, end of story”.

The new “Single Threader” that we’re introducing and that uses the single threaded engine changes that.  The most pressing problem that this step solves is the reduction of data passing and thread context switching overhead.

Let’s take an example, a transformation with 100 steps.  To make matters worse, the dummy steps don’t do anything so all we’re measuring with this case is overhead:

Because this transformation uses over 100 threads on a 4-core system a lot over thread context switching is taking pace.  We also have over 100 row buffers and locks between the steps that lower performance.  Not by much, but as we’ll see it all adds up.

OK, now let’s put the 100 dummy steps in a sub-transformation:

For this we use 1 extra step, an Injector step that will accept the rows from this parent transformation:

Please note that we can execute the “Single Threader” step in multiple copies.  On my test-computer I have 4 cores so I can run in 4 different threads.  In the “Single Threader” step we can specify the sub-transformation we defined above as well as the number of rows we’ll pass through at once:

When we then look at the performance of both solutions we find out that our original transformation runs in 105 seconds on my system.  The new solution completes the task in about 55 seconds or almost half the time or almost twice as fast.

Since this behaves very much like a Mapping or sub-transformation you can also use it as a way to execute re-usable logic. As an additional advantage it makes complex transformations perhaps a bit less cluttered.

Well, there you have it: another option to tune the performance of your transformations.  You can find this feature in new downloads from out Jenkins CI build server or later in 4.2.0-RC1.

Until next time,

Matt

Reading from MongoDB

How to read data from MongoDB using PDI 4.2

Hi Folks,

Now that we’re blogging again I thought I might as well continue to do so.

Today we’re reading data from MongoDB with Pentaho Data Integration.  We haven’t had a lot of requests for MongoDB support so there is no step to read from it yet.  However, it is surprisingly simple to do with the “User Defined Java Class” step.

For the following sample to work you need to be on a recent 4.2.0-M1 build.  Get it from here.

Then download mongo-2.4.jar and put it in the libext/ folder of your PDI/Kettle distribution.

Then you can read from a collection with the following “User Defined Java Class” code:

import java.math.*;
import java.util.*;
import java.util.Map.Entry;
import com.mongodb.Mongo;
import com.mongodb.DB;
import com.mongodb.DBCollection;
import com.mongodb.BasicDBObject;
import com.mongodb.DBObject;
import com.mongodb.DBCursor;

private Mongo m;
private DB db;
private DBCollection coll;

private int outputRowSize = 0;

public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
{
	DBCursor cur = coll.find();

	if (first) {
		first=false;
		outputRowSize = data.outputRowMeta.size();
 	}

	while(cur.hasNext() && !isStopped()) {
		String json = cur.next().toString();
		Object[] row = createOutputRow(new Object[0], outputRowSize);
        	int index=0;
		row[index++] = json;

	    	// putRow will send the row on to the default output hop.
        	//
    		putRow(data.outputRowMeta, row);
	}

	setOutputDone();

    	return false;
}

public boolean init(StepMetaInterface stepMetaInterface, StepDataInterface stepDataInterface)
{
	try {
        	m = new Mongo("127.0.0.1", 27017);
		db = m.getDB( "test" );
    		coll = db.getCollection("testCollection");

 		return parent.initImpl(stepMetaInterface, stepDataInterface);
	} catch(Exception e) {
	  	logError("Error connecting to MongoDB: ", e);
    		return false;
	}
}

You can simply paste this code into a new UDJC step dialog. Change the parts in the init() method to server your needs. This code reads all the data from a collection in a Mongo database.  The output of this step is a set of rows contain each one JSON string. So make sure to specify one JSON String field as output of your step.  These JSON structures can be parsed with the new “JSON Input” step and then you can do whatever you want with it.

Please let us know what you think of this and whether or not you would like to see support for writing to MongoDB and/or dedicated steps for it.  I’m sorry to say I have no idea of the popularity of these new NoSQL databases.

Until next time,

Matt

UPDATE: The functionality described in this UDJC code is available in a new “MongoDB Input” step in 4.2.0-M1 or later.

UPDATE2: We also added authentication for MongoDB in PDI-6137

P.S. To install and run MongoDB on your Ubuntu 10.10 machine, do this:

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv 7F0CEB10
sudo apt-get update
sudo apt-get install mongodb

My new netbook…

Dear Linux fans,

Last weekend I saw an ad for a netbook in a Carrefour superstore leaflet that I guess was just too good to refuse.

Unlike other netbooks, this one was priced really low: €199,00 (including taxes which makes it cost my company €164.46 or about 200 $USD).  For me, that’s the price point where a netbook makes sense, not €400-500 what you see all over the place.

Now, for that low price, you get the following machine:

  • 1.6Ghz VIA C7-M CPU
  • 512MB RAM (DDR2 667, shared with video, 384 available)
  • 120GB hard disk (2.5″, 7200rpm)
  • 1024×600 LCD screen (pretty good quality actually)
  • Webcam
  • WIFI b/g
  • 2xUSB 2.0
  • VGA port
  • a multi-format card reader (SD, SDHC, MMC)
  • Microphone
  • Sound in/out
  • Mandriva Linux 2009.1

It was very interesting to see that “Windows 2007 Home Premium” was priced at exactly the same price.  Talk about a total waste of money on the Microsoft side.

OK, back to the netbook.  The memory issue is not a problem.  I already ordered a 2GB DDR2 RAM module for the machine at €39.

UPDATE 10/27 : the RAM arrived, was installed in 5 minutes and all works fine now.  With 1.9GB available the machine is a lot snappier too.

Performance is obviously not stellar but I didn’t expect this either.  I paid less for it then my current cell phone.  However, it plays full screen AVI without a glitch.

The only real problem the box has is that it comes with … Mandriva Linux.  Maybe I’m spoiled by years of Ubuntu use, but this distribution really sucks.  Can I please just install some software, customize the UI a bit?  Please?  I don’t recall the last time I couldn’t install a piece of software on Ubuntu because a package couldn’t be downloaded.  WTF?  And charge €28 just to get a couple of codecs to play audio/video? I can legally use these drivers in Europe without a problem.

Don’t get me wrong, all hardware is supported and works fine, including audio, the webcam, skype, flash, etc.

Anyway, I tried to put Ubuntu Netbook Remix 9.04 on it by booting from a USB stick.  Unfortunately, either the image or the stick has an issue since it freezes upon installer boot.  The live system boots but has a nasty video problem.  So I’m going to retry later next week.  Heck, maybe it’s better to just wait until Kubuntu 9.10 Netbook Remix comes out next week.

Feel free to leave advice on what distro to pick and how to best handle the install.  Also feel free to leave tips on how to explain the kids that this is not a toy.

Thanks in advance!

Cheers,

Matt

The kindness of strangers

Dear Kettle fans,

There isn’t a week that goes by where I don’t find myself amazed by the number of contributions and help that the Pentaho Data Integration project receives in all kinds of forms.  There are people contributing anything from small patches to complete steps, folks helping out others on the forum, writing documentation, writing books, translating PDI, etc.  Without any question, this has been a truly amazing experience, not just for me but for the whole Kettle project.

It’s because of that overwhelmingly positive experience that I’ve always tried to be accessible and in contact with my community in all sorts of possible ways.  And because of that positive vibe I have refrained from commenting on the negative flip side to that story for the longest time.

The problem is really that lately things have been changing.  It’s probably caused in general by an increasing attention to open source and specifically by an increase in popularity of Kettle.  In any case, certain types of people do the following:

  • Send me personal email
  • IM me on skype/Yahoo!/MSN/AIM/…
  • Send me all sorts of messages and questions through the forums
  • Ask questions on this blog

Usually it’s a combination of any of the above.  Any time now I expect folks to be sending me direct twitter messages.  The questions are always the same:

I have an urgent Pentaho porblem.  I am incapable of using the forum for some stupid reason and so you have to help me, preferable now or within the next 15 minutes!!!!

This way, the meaning of “The kindness of strangers” becomes more and more like the one from the Nick Cave song.

I’ve just finished reading Linus‘ book “Just for fun” (Thanks again Domingo!) and his approach to the problem of staying in reach for people to contribute code and at the same time allowing yourself to have a life and a job is simple : if it ain’t fun, don’t do it.  Well, the barrage of this sort of questions has stopped being fun for me a long time ago.

As such, I’m going to try this approach: any question that could or should be asked on the forum is from now on silently ignored and deleted from my mailbox.  Any person that is not part of my “community” and that needlessly contacts me over IM gets blocked indefinitely.  And yes, that goes for twitter as well.  Off-topic questions on this blog go to the spam folder as well.  I will simply refuse to spend time on non-interesting topics.

I thought about creating a standard response e-mail, but any sort of replying is simply an encouragement to certain types of people and will only make matter worse. (been there, done that)

I’m sure everyone understands that this is the only way to free up time to work on the real problems at hand.  Thank you for your understanding in any case.

Until next time,

Matt

Google Goodies and Lego

Dear Kettle friends,

Will Gorman and Mike D’Amour, Senior Developers at Pentaho, are presenting Pentaho’s Google integration work at the Google I/O Developer Conference. (at the Sandbox area to be specific)   Yesterday, Pentaho announced that much.

Here are a few of the integration points:

  • Google maps dashboard (available in the Pentaho BI server you can download)
  • A new Google Docs step was created for Pentaho Data Integration Enterprise Edition
  • Running (AVI, 30MB) the Pentaho BI server on Android
  • A new Google Analytics step was created for Pentaho Data Integration Enterprise Edition
  • Since version 2.0, the Pentaho BI server depends heavily on Google Web Toolkit (GWT)

To top that off, Will twittered about this new Lego bar-chart + logo they created for the conference:

UPDATE: now with building instructions and action video!

We are all soooo proud of them!

Until next time,

Matt

Canonical: take my money

Dear Canonical,

You claim that there is little money in the desktop software business and more in services.  Well here is something I would pay money for:

Take the top selling business laptops from Dell, Acer, HP, Lenovo and offer customized distributions for them.

I would pay for that in an instant.  All too often people confuse open source with free of charge.  I’m perfectly capable of making that distinction.  In fact, I use my machines for my work and don’t want to spend days configuring all the devices on them.  As such, I would pay something like 50 USD for a customized (K)Ubuntu or perhaps 150-200 USD if it came with some sort of (e-mail) support contract for a year.

I don’t use Linux / Ubuntu because it costs less, I use it because I prefer it over Windows to do my job.  I would pay that kind of money because I would save time and money in the long run.

Until the major hardware vendors offer decent (worldwide) support for Linux on their machines (out of the box that is), I think this is an idea with potential and I hope at least someone picks it up.  Go ahead, let me spend money on it!

Until next time,
Matt

25 Years GNU

Just as I was reading more news from nowhere, I came across this great Stephen Fry video over at gnu.org.

After 25 years of GNU, their principles are as relevant as ever, still untainted and fresh as ever.   My congratulations go out to Richard Stallman and the whole team behind the Free Software Fondation.  Thanks for hanging in there for so long.  It seems the more people ridicule Richard, the stronger his ideas become in the world.  Keep going man, don’t let the trolls get you.

I bargained for salvation and they gave me a lethal dose
I offered up my innocence and got repaid with scorn
“Come in” she said
“I’ll give you shelter from the storm”.

Bob Dylan, just to stay even remotely in the right time-frame. 🙂

Until next time,
Matt