Digging Mondrian

On Friday I committed code to 3.0 trunk to allow people to execute an MDX query on a Mondrian server and get the result back in a tabular format. This particular code to “flatten” an OLAP cube into rows was written by Julian Hyde, the lead developer and founder of Mondrian OLAP a.k.a. Pentaho analyses.

If you run the Pentaho demo on your box and then look at the Analyses sample, you could see something like this:

JPivot sample

Suppose you wanted to get this exact data to work with, create analytical data, exports, … Well, now you have the option of doing it in Kettle:

Mondrian input

What you do is create a database connection to the database that Mondrian reads from, hand it the location of the Mondrian schema (catalog). You then click on the MDX button in JPivot and copy the MDX query to the “Mondrian Input” step in Kettle. That’s all it takes.

You then can preview the output of the Dummy step, like this in screen shot:

Mondrian input step preview results

As you can see, integration of the different Pentaho software pieces is continuing…

Until next time!

Matt

Being lazy

Dear Kettle fan,
Since our code is open, we have to be honest: in the past, the performance of Kettle was less than stellar in the “Text File” department. It’s true that we did offer some workarounds with respect to database loading, but there are cases when people don’t want to touch any database at all. Let’s take a closer look at that specific problem…

Reading and writing text files…
Let’s take a look at this delimited (CSV) file (28MB). Unzipped, the file is around 89MB in size.

Suppose you read this file using version 2.5.1 (soon to be out) with a single “Text File Input” step. On my machine, that process consumes most of the available CPU power it can take and takes around 57 seconds to complete. (1M rows/minute or 60M rows/hour)

When we analyze what’s eating the CPU resources during that minute, there are a number of bottlenecks: metadata object allocation and garbage collection, byte[] to String conversion (by Java) and data type conversion (String to Date, Integer, etc)

Now suppose we write that data back to another file. (mostly identical) On my machine, that transformation takes around 80 seconds. Again, it’s String-to-byte[] conversion that’s eating CPU as well as the normal data type conversions taking place. (Date to String, Integer to string, etc)

So we made some improvements in the 3.0 version…

First of all, the metadata object allocation issue was solved by separating the data from the metadata. Before, we saw that the JVM was doing garbage collection for up to 40% or more of the time. That performance hit is gone and has led to great improvements all around.

Then, during 3.0 development, we can up with a new way of handling the specific Textfile-to-Textfile problem: in a lot of cases we’re doing the same conversions twice:

  • byte[] (in a file) to Java String (UTF-8), then Java String (UTF-8) back to byte[]
  • String to Date using a certain mask, then Date to String
  • String to Integer using a certain mask, then Integer to String again

There is a symmetry to be found here. As it turns out, UTF-8 (Unicode) encoding is very expensive, CPU-wise. Unfortunately, it’s also an important cornerstone of Java and something we can’t really do without. It allows us to read files from countries around the globe in all sort of languages and codepages. Throwing this out is not an option.

However… lazy bums as we are, we did come up with the concept of “Lazy Conversion”. Lazy conversion delays conversion of data as long as possible with the hope it might never occur at all. Obviously, in the case where you read from a text-file and write back to another one in the same format, conversion never occurs. In that particular case, we read bytes and dump them to file again without ever looking at the actual content. (unless we have to).

Let’s see how the 3.0 engine handles this.

  • When we run the exact same transformation we ran in version 2.5.1 in version 3.0, it takes 39 seconds (half the time)
  • When we use the new CSV Input step and turn on Lazy Conversion it takes about 10 seconds. (8 times faster, 6M row/min or 360M rows/hour. The transformation used to create the image below can be found here: lazy-conversion-3.ktr

Lazy conversion sample

  • If we simply read the file, without writing it back, the CSV Input step is completely I/O bound. Since the file fits in the cache of my system, the transformation completes in about 2 seconds. (reading is obviously faster than writing)

Fast, faster, fastest...

Conclusion

Our “lazy conversion” algorithms show great promise in those situations where a lot of data goes through the transformation untouched. That usually includes sorts, aggregates on certain keys, etc. Because the data stays in its raw binary format all the time, serialization of objects (during sorts, temporary files, over sockets, etc) is also a lot faster.

For the data warehouse architects and Kettle users out there, it’s another option to help you process your data as fast as possible.

Until next time,

Matt

Clustering & partitioning

Let’s have a quick look at some clustering examples in the new 3.0 engine:

Clustering combined with multiple copies of a step

This example runs all steps in the transformation in a clustered mode. That means that there are 4 slave transformations that run in parallel with each other.
The interesting part is that first of all the “Fixed Input” step is running in parallel, each copy reading a certain part of a file.
The second thing to mention about it is that we now allow you to run multiple copies of a single step on a cluster. In this example, we run 3 copies of a step per slave transformation. In total there are 12 copies of the sort step in action in parallel.

IMPORTANT: this is a test-transformation, a real world sorting exercise would also include a “Sorted Merge” step to keep the data sorted. I was too lazy to redo the screenshots though 🙂

The clustered transformations now also support logging to databases:

Slave transformations logging to database

As you can see, we added the name of the transformation in the dialog to show the log of all the slave transformations executed.
This feature makes it easier to see the execution log on the remote server, after the run.

It is now also possible to use the execution stats in a JavaScript job entry (previously called Evaluation):

Evaluating runtimes

As such, you can validate that the number of input rows where the same as the output rows. The numbers returned to are the total amounts, the sum of the stats for all the slave transformations.

To finish up for today, here is a sample where we partition data in a certain fashion:

Clustering and partitioning

This sample reads a file on the master, then sends if off to 2 slave servers (over TCP/IP sockets).
The data then needs to be partitioned. That means that data from one slave server needs to go to all the others. For this Nx(N-1) sockets are opened for each partition that’s running on the slave server. In this case we have 2 partitions running meaning that we open 4 extra sockets to make sure the data can find a way to the correct partition.
In the picture above, the “Text File Output” and “Get the partition Nr” steps are partitioned using the same partition schema and as such, the data stays in the same “swimming lane” without any performance loss.

The Dummy Step (note that the icon has changed since I took the screenshot) is doing the re-partitioning here. If you open the generated slave transformations you will see the form that these transformations take:

A typical slave transformation

A slave transformation sample using 20 extra sockets to re-partition the data.

We are also correctly setting a number of internal variables for your convenience now. This allows you to read/write the data from/to partitioned files, etc.

The Dx2 or Dx10 means that the partitioning is dynamic: 2 or 10 partitions per slave server in the cluster:

Dynamic partitioning

The advances in 3.0 shown above are making sure that the Kettle engine allows you to scale beyond a single box. In fact, it will allow you to get the maximum out of your favorite disk subsystem by attacking your data in parallel.

Until next time,

Matt

P.S. Don’t you just love these new icons 😉