Database partitioning

We’ve been experimenting lately with database partitioning (in version 2.3.2-dev, make sure to update your kettle.jar to the latest snapshot). In our context, database partitioning means that we divide our data over a cluster of several databases.

A typical way of doing that is that you divide the customer_id by the number of hosts in the cluster and get the remainder. If the remainder is 0, you store the data on the first host in the cluster, 1 for the second, 2 for the third, etc.

This sort of thing is something that we’ve been implementing in Kettle for the last couple of weeks. The reasoning is simple: if one database is not up to the task, split the load over 2 or 5 or 10 databases on any amount of hosts.  ( Now imagine all the PCs at work all running an in-memory database 🙂 )
Besides small changes to the Kettle transformation engine, we also added meta-data that allows you to specify the information it needs to work. Let’s take a look at a sample transformation:

Database Partitiong Sample

As you can see from the image above, we have added some text to the top of the partitioned steps. “P” indicates that the step is partitioned, “x5” is the number of copies that will be launched for the same amount of partitions. Also: the fieldname that we partition on is shown between brackets. Finally, the partitioning schema name is shown.

Let’s take a look on what the transformation does:

  1. It performs a “SELECT * FROM T2” on a cluster. In this case that means that the same statement is executed 5 times on 5 different databases/hosts.
  2. It adds a sequence “new_id” in a partitioned way: using 5 different steps.
  3. We re-partition the data and store it on a different cluster with 3 hosts called “Cluster with 3”
  4. We store the data in 3 text files using the new internal variable ${Internal.Step.Partition.ID}

Now, how do we enter the meta-data for this?

Well, to start off, we added a new tab to the database dialog:

Reference database connection

As you can see, nothing much changed, the information entered in the usual place: hostname/databas/port is our reference database. This is the database you expore and reference in general. However, all the DDL is now executed against all the hosts in this cluster: (cluster tab)

Database clustering

As you can see, we can add the hosts that participate in the cluster as well as the database name and port. Note that I don’t have a lot of machines to play with, so I’m just using 5 different databases on the same host.

What we need to do then is create a partitioning schema for the transformation. You can enter the partitioning schema in the transformation settings dialog (CTRL-T):

Schema editor

The button “Get partitions” allows you to take over the partition IDs you specified in one of the partitioned database connections.

With this information we can now configure the steps we want to partition. Click right on the step in the graphical view:

Partitioning step menu

This will allow you to select the partitioning method:

Partitioning method

As you can see, we only do Rest of division at the moment, but in the future (with your help) we could implement any kind of partitioning including mirroring, etc. In the case of the rest-of-division partitioning method we need the fieldname to get the ID from:

Partitioning field

Finally, we need to specify the partitioning schema: this will allow the step to “map” onto the correct partitioning schema. Also, when the schema of the previous step is different from the next step, we need to “re-partition” : we need to calculate a new partition in stead of sticking with the old one.

Partition schema selection

That’s all there is to it to enable database partitioning in Kettle.

At this time, we are experimenting with this new engine and would very much like to have feedback and test results. Let us know what you think and if you can see other uses for these new features. At this time, nothing is definite yet and there is plenty of room for improvement.

Until next time,


Just say no…

I just came across this interesting article on Software Development Times.  In it the author (Allen Houb) places a full frontal attack on XML.  I find the article interesting because I agree with most if not all what he says…

“Surely”, you say, “that can’t be right, you use XML all over the place in Kettle?”

Yes, absolutely, but I never ever expose this XML to users nor developers of Kettle unless they really want to see what’s going on.

The article is also interesting because at Pentaho we have been trying as hard as possible to limit the exposure of XML to our users (end-users and developers alike).  Our solution is not to create a new programming language as Mr. Houb suggests, but to create all kinds of wizards and rich GUI’s that translate the XML into a human readable format.

We even created a button for it:

No XML button
Until next time,

P.S. Just to be clear about this: in situations where the XML is strictly limited to what is was designed to do: being an interface, it really is the best format to use.  Support for internationalisation alone warrants the use.