Clustering/clouds made easy

Dear Kettle fans,

The last main item on my agenda before we could release a release candidate of version 3.2. was the inclusion of a number of features that would help us make dynamic clustering easier.

It was already possible to make things happen, but thanks to Sven Bodens parameters, we can now up all that one level.  Let me explain to you what we did with a small simple example…

The “Slaved” step takes the place of one or more steps that you would like to see run clustered (optionally partitioned) on a number of machines.

So let’s say this transformation is part of a job…

We want to have this run on Amazon EC2.  So I created an AMI just for you to test with:

IMAGE   ami-f63ed99f    kettle32/PDI32_CARTE_CLUSTER_V4.manifest.xml    948932434137    available       public          i386    machine

The input of that AMI is a piece of XML that configures the Carte instance on it that is started automatically upon boot of the image:

<slave_config>
<slaveserver>
<name>carte-slave</name>
<hostname>localhost</hostname>
<network_interface>eth0</network_interface>  <!– OPTIONAL –>
<port>8080</port>
<username>cluster</username>
<password>cluster</password>
<master>N</master>
</slaveserver>
</slave_config>

This file, let’s call it carte-master.xml is passed when we run our instance:

ec2-run-instances -f carte-master.xml -k <your-key-pair> ami-f63ed99f

When it’s booted we take the internal Amazon EC2 IP address of this server and pass that into a second document, let’s call it carte-slave.xml:

<slave_config>
<masters>
<slaveserver>
<name>master1</name>
<hostname>Internal IP address</hostname>
<port>8080</port>
<username>cluster</username>
<password>cluster</password>
<master>Y</master>
</slaveserver>

</masters>

<report_to_masters>Y</report_to_masters>

<slaveserver>
<name>carte-slave</name>
<hostname>localhost</hostname>
<network_interface>eth0</network_interface>  <!– OPTIONAL –>
<port>8080</port>
<username>cluster</username>
<password>cluster</password>
<master>N</master>
</slaveserver>
</slave_config>

Then we fire up 5 slaves with that configuration…

ec2-run-instances -f carte-slave.xml -k <your-key-pair> ami-f63ed99f -n 5

These 5 slaves will report to the master and explain where they can be reached.  So all we need to do in our PDI job/transformation is create a master slave configuration:

To top it off, we define MASTER_HOST and MASTER_PORT as parameters in the job and transformation…

So all that’s left to do is specify these parameters when you execute the job…

As you can see from the dialog, we pass the complete job (including sub-jobs and sub-transformations) over to the “Cluster Master” slave server prior to execution because it is not possible nor needed for Spoon to contact the various slave servers directly.  That is because they report with their internal IP addresses.  We wouldn’t want it otherwise since that offers the best performance (and costs less).

These goodies are soon to be had in a 3.2.0-RC1 near you…

Until next time,
Matt

Resource exporter

Dear Kettle fans,

One of the things that’s been on my TODO list for a while was the creation of a resource exporter

Resource exporter?

It’s called “Resource exporter” and not “Job exporter” or “Transformation exporter” because it is intended to export more than just a single job or transformation.  It exports all linked resources of a job or transformation.

The means that if you have a job that has 5 transformation job entries, you will be exporting 6 resources (1 job and 5 transformations).  If those transformations use 3 sub-transformations (mappings) you will in total export 9 resources.

The whole idea behind this exercise is to be able to create a package (for example to send to someone) that has all needed resources contained in a single zip file.

Let’s look at an example!  We have a job to load/update a complete data warehouse.  It loads source files, updates dimensions and a fact table, in total 31 transformations and jobs.

The top level job we want to export is the “Load data warehouse.kjb” (can be in a repository too!).  Thanks to the very recently added “export” option in Kitchen, we can run this:

sh kitchen.sh -file=’/parking/TDWI/PDI/Load data warehouse.kjb’ -export=/tmp/foo.zip

This generates the file “/tmp/foo.zip” that contains all the used resources.  Please note you can also do this in Spoon under the “File” menu.

What about job and transformation filenames?

If you look in the ZIP file with “unzip -l” you will notice entries like this one:

    33107  03-10-09 18:31   Update_Customer_Dimension_023.ktr
Originating file : /parking/TDWI/PDI/Update Customer Dimension.ktr (file:///parking/TDWI/PDI/Update Customer Dimension.ktr)

This resource gets called in the “Update dimensions” job, so let’s look inside of the generated XML to see how this is solved.  We see that the <filename> entries have been replaced by the correct link:

<filename>${Internal.Job.Filename.Directory}/Update_Customer_Dimension_023.ktr</filename>

This is interesting, because the originating transformation could have been located anywhere.  Once it’s exported to the zip file, it’s referenced with a relative path (using PDI internal variables).  That in turn means you can locate the zip file anywhere, even on a remote web server and it would still be executable.  In fact, Kitchen gives us advice on how to run the “Load data warehouse” job in the ZIP file:

This resource can be executed inside the exported ZIP file without extraction.
You can do this by executing the following command:                           

sh kitchen.sh -file='zip:file:///tmp/foo.zip!Load_data_warehouse_001.kjb'

What about input file names?

Obviously, you can’t go about zipping input files that can sometimes be quite large.  So we opted to create a set of named parameters that you can use to define the location of the input files.

In our example, we have a set of files read with “CSV Input” and “Text File Input” steps that are located in 2 folders:  “/parking/TDWI/” and “/parking/TDWI/Source Data”.  During the export, the step metadata will be changed to read:

${DATA_PATH_x}/<filename>

In this specific case we then create 2 parameters in the job, sub-jobs and sub-transformations:

DATA_PATH_1, default=/parking/TDWI/Source Data
DATA_PATH_2, default=/parking/TDWI

These named parameters can then be used during execution with kitchen.  If you send the “foo.zip” file to someone else along with the data in “/bar and “/bar/Source Data” you can execute the job as follows:

sh kitchen.sh -file=’zip:file:///bar/foo.zip!Load_data_warehouse_001.kjb’
-param:DATA_PATH_1=”/bar/Source Data”
-param:DATA_PATH_2=”/bar”

The subject of named parameters is worthy of a complete article all by itself.  It’s the brain child of Kettle star Sven Boden.  It would take us too far to explain the details, but you can see what parameters are defined for the job like this:

sh kitchen.sh -file='zip:file:///bar/foo.zip!Load_data_warehouse_001.kjb' -listparam

Parameter: DATA_PATH_1=, default=/parking/TDWI/Source Data : Data file path discovered during export
Parameter: DATA_PATH_2=, default=/parking/TDWI : Data file path discovered during export            

Because the default values are set you can in fact test the job before you send it over.

What’s next?

Next on the agenda (after the 3.2 release) is to make this function available in the execution dialog so that we can more easily do remote execution.  Another interesting execution option is to store the generated zip files in a folder or even in a database so that we can always see exactly what was executed at a certain given time.

Until next time,
Matt

Pentaho Partner Summit ’09

Dear reader,

In a little over 3 weeks, April 2nd and 3rd, we’re organizing a Pentaho Partner Summit at the Quadrus Conference Center in Menlo Park near San Francisco.

If you are (as the invitation describes) an “Executive, luminary, current or prospective partner from around the world” and if you come over you’ll meet myself, Julian Hyde and perhaps a couple of other architects as well.  That is outside of a host of other interesting people like Zack Urlocker (MySQL) and of course Richard Daley our CEO. We’ll be doing a couple of lengthy sessions on Kettle and Mondrian among other things.

See you there!

Matt