Pentaho 5.0 blends right in!

Dear Pentaho friends,

Ever since a number of projects joined forces under the Pentaho umbrella (over 7 years ago) we have been looking for ways to create more synergy across this complete software stack.  That is why today I’m exceptionally happy to be able to announce, not just version 5.0 of Pentaho Data Integration but a new way to integrate Data Integration, Reporting, Analyses, Dashboarding and Data Mining through one single interface called Data Blending, available in Pentaho Business Analytics 5.0 (Commercial Edition).

Data Blending allows a data integration user to create a transformation capable of delivering data directly to our other Pentaho Business Analytics tools (and even non-Pentaho tools).  Traditionally data is delivered to these tools through a relational database.  However, there are cases where that can be inconvenient, for example when the volume of data is just too high or when you can’t wait until the database tables are updated.  This for example leads to a new kind of big data architecture with many moving parts:

Evolving Big Data Architectures
Evolving Big Data Architectures


From what we can see in use at major deployments with our customers, mixing Big Data, NoSQL and classical RDBS technologies is more the rule than the exception.

So, how did we solve this puzzle?

The main problem we faced early on was that the default language used under the covers, in just about any business intelligence user facing tool, is SQL.  At first glance it seems that the worlds of data integration and SQL are not compatible.  In DI we read from a multitude of data sources, such as databases, spreadsheets, NoSQL and Big Data sources, XML and JSON files, web services and much more.  However, SQL itself is a mini-ETL environment on its own as it selects, filters, counts and aggregates data.  So we figured that it might be easiest if we would translate the SQL used by the various BI tools into Pentaho Data Integration transformations.  This way, Pentaho Data Integration is doing what it does best, not directed by manually designed transformations but by SQL.  This is at the heart of the Pentaho Data Blending solution.

The data blending service architecture
The data blending service architecture

To ensure that the “automatic” part of the data chain doesn’t become an impossible to figure out “black box”, we made once more good use of existing PDI technologies.  We’re logging all executed queries on the Data Integration server (or Carte server) so you have a full view of all the work being done:

Data Blending Transparency
Data Blending Transparency

In addition to this, the statistics from the queries can be logged and viewed in the operations data mart giving you insights into which data is queried and how often.

We sincerely hope that you like these new powerful options for Pentaho Business Analytics 5.0!



–If you want to learn more about the new features in this 5.0 release, Pentaho is hosting a webinar and demonstration on September 24th – Two options to register: EMEA & North America time zones.

– Also check out the new exciting capabilities we deliver today on top of MongoDB!

Matt Casters
Chief of Data Integration at Pentaho, Kettle project founder


Join the Pentaho Benelux Meetup in Antwerp!

Dear Benelux Pentaho fans,

On October 24th, Bart Maertens and I are organizing the revival of the Pentaho Benelux User Group with a meetup in Antwerp.  We’re doing this in the fantastic setting of the historical Antwerp Central Station building smack in the center of Antwerp.  Since obviously this location is easy to reach by public transport and since there’s ample parking space, we hope that this location will be within reach for everyone.

Antwerp Central Station

Here is the entrance:

PBUG on the map
PBUG on the map
Attrium entrance, room is upstairs
Attrium entrance, room is upstairs

And the view from the balcony:

The view from above
The view from above

The final agenda hasn’t been completed yet but as a good user group we would like to focus on sharing Pentaho experiences (good or bad) among the attendees.  Because of that we will focus this meetup on use-cases and practical information, giving feedback to Pentaho (questionnaire) and so on.

Registration has opened today and is completely free for all.

Register now at:

We will update the schedule as soon as more speakers are know on the Eventbrite page.

We sincerely hope that you will like this PBUG initiative and that you find the time to join us.  To keep this event completely free of charge and accessible to all, Pentaho partner has agreed to pay for the room and Pentaho will pay for pizza and beer.

See you all in Antwerp!

Matt (& Bart)


The Pentaho Big Data Forum

Dear friends,

If you’re in the Washington DC area next Tuesday, April 23rd, why not drop in on our complementary Big Data Forum:

Come and listen to us and our partners Cloudera, 10gen and Unisys and see what we can do for you in the Big Data space.

See you soon in DC!


Celebrating 10 Years of Kettle Coding

Dear Kettle friends,

The other week Jens and I were wondering how long it had been since I first started coding the current version of Kettle.  So I started a thorough computing forensics investigation leading to the discovery of a  backup of the first ever version of Kettle.

The date that comes up from that backup is March 4th, 2003, just about 10 years ago.  The development of Kettle started earlier with analyses documents (most probably lost but nothing much was actually lost if you know what I mean) and even a version written in C as that was the main programming language I used back then to get things done.

Java was at mainstream version 1.3 and 1.4 but lots of “Applets” still ran at 1.1 or 1.2, generics didn’t exist, computers had in general 1 CPU, 512MB RAM… and I had a book called something like “Java in 21 days” to teach me how to get going.  From there it took another 2 and a half years, lots of re-factoring and lots of help to get to the open sourcing of version 2.2 in December 2005.

While going back to the beginning of Kettle’s history it’s easy to understate the importance of Pentaho. After all, of those 10 years of the current code-base, over 7 have been spent working with the rest of the Pentaho team to build the best data integration tool on the planet.  Programming alone is fine but in general you get more things done in a team.  It’s absolutely fantastic to see the whole team chip in alongside the community on things like bug fixing, builds, continuous integration, UI, design, plugins, website, forums, JIRA triage, product management, marketing, events, sales, …

Thank you all for making Kettle the awesome tool it is today and the incredible tool that Kettle5 will be.



10 Years of Kettle Pie

Data federation

Dear Kettle friends,

For a while now we’ve been getting requests from users to support a system called “Data Federation” a.k.a. a “Virtual Database”.   Even though it has been possible for a while to create reports on top of a Kettle transformation, this system could hardly be considered a virtual anything since the Pentaho reporting engine runs the transformation on the spot to get to the data.

The problem?  A real virtual database would have to understand SQL and a data transformation engine typically doesn’t.  It’s usually great at generating it, parsing it not so.

So after a lot of consideration and hesitation (you don’t really want to spend too much time in the neighborhood of SQL/JDBC code unless you want to go insane) we decided to build this anyway, mainly because folks kept asking about it and because it’s a nice challenge.

The ultimate goal is to create a virtual database that is clever enough to understand the SQL that the Mondrian ROLAP engine generates.

Here is the architecture we’re in need of:

In other words, here’s what the user should be able to do:

  • He/she should be able to create any kind of transformation that generates rows of data, coming from any sort of database.
  • It should be possible to use any kind of software that understands the JDBC and SQL standards
  • It should have a minimal set of dependencies as far as libraries are concerned
  • Data should be streamed to allow for massive amounts of data to be passed from server to client
  • The SQL should be able to understand basic SQL including advanced WHERE, GROUP BY, ORDER BY, HAVING clauses. (anything that an OLAP engine needs)
Not for the first time, I though to myself (and the patient ##pentaho community on IRC) : “This can’t be that hard!!”.  After all, you only need to parse SQL that gets data from a single (virtual) database table since joining and so on can be done in the service transformation.
So I started pounding on my keyboard for a few weeks (rudely interrupted by a week of vacation in France) and a solution is now more or less ready for more testing…
You can read all details about it on the following wiki page:
The cool thing about Kettle data federation is that anyone can test this in half an hour time following the next few simple steps:
  • Download a recent 5.0-M1 development build from our CI system (any left failed unit tests are harmless but an indication that you are in fact dealing with non-stable software in development)
  • Create a simple transformation (in .ktr file format) reading from a spreadsheet or some other nice and simple data source
  • Create a Carte configuration file as described in the Server Configuration chapter on the driver page specifying
    • The name of the service (for example “Service”)
    • the transformation file name
    • the name of the step that will deliver the data
  • Then start Carte
  • Then configure your client as indicated on the driver page.
For example, I created a transformation to test with that delivered some simple static data:
I have been testing with Mondrian on the EE BI Server 4.1.0-GA, and as indicated on the driver page, simply replaced all the kettle jar files in the server/biserver-ee/tomcat/webapps/pentaho/WEB-INF/lib/ folder.
Then you can do everything from inside the user interface.
Create the data source database connection:
Follow the data source wizard, select “Reporting and Analyses” at the bottom:
Select one table only and specify that table as the fact table:
Then you are about ready to start the reporting & analyses action.  Simply keep the default model (you
can customize it later)…
You are now ready to create interactive reports…
… and analyzer views:
So get started on this and make sure to give us a lot of feedback, your success stories and failures as well.  You can comment on the driver page or in the corresponding JIRA case PDI-8231
The future plans are:
  • Offer easy integration with the unified repository for our EE users so that they won’t have to enter XML or have to restart a server when they want to add or change the services list. (arguably an important requisite for anyone seriously considering this to be run in production)
  • Implement service and SQL data caching on the server.
  • Allow writable services and “insert into” statements on the JDBC client

Better Data for Better Analytics

Dear Kettle friends,

Thursday May 10th, in a few days, I’ll be joining my friend Kasper Sørensen (the founder and lead architect of DataCleaner, a Human Inference data profiling project) in our web seminar (webinar).  We’ll be going over a bit of history, our cooperation model as well as the architecture behind the new data quality features.

Register here

Kasper will also be doing 3 cool live demos on the subjects of data profiling and data quality.

I hope you’ll be able to join the crowd this Thursday May 10th, 10am PST (Los Angeles), 1pm EST (New York) or 7pm CET (Brussels).

We’ll be doing our best to answer your data quality questions simultaneously with the presentation.

See you there!


Big Kettle News

Dear Kettle fans,

Today I’m really excited to be able to announce a few really important changes to the Pentaho Data Integration landscape. To me, the changes that are being announced today compare favorably to reaching Kettle version 1.0 some 9 years ago, or reaching version 2.0 with plugin support or even open sourcing Kettle itself…

First of all…

Pentaho is again open sourcing an important piece of software.  Today we’re bringing all big data related software to you as open source software.  This includes all currently available capabilities to access HDFS, MongoDB, Cassandra, HBase, the specific VFS drivers we created as well as the ability to execute work inside of Hadoop (MapReduce), Amazon EMR, Pig and so on.

This is important to you because it means that you can now use Kettle to integrate a multitude of technologies, ranging from files over relational databases to big data and NoSQL.  You can do this in other words without writing any code.  Take a look at how easy it is to program for Hadoop MapReduce:

In other words, this part of the big news of today allows you to use the best tool for the job, whatever that tool is. You can now combine the large set of steps and job entries with all the available data sources and use that to integrate everything. Especially for Hadoop the time it takes to implement a MapReduce job is really small taking the sting out of costly and long training and testing cycles.

But that’s not all…

Pentaho Data Integration as well as the new big data plugins are now available under the Apache License 2.0. This means that it’s now very easy to integrate Kettle or the plugins in 3rd party software. In fact, for Hadoop, all major distributions are already supported including: Amazon Elastic MapReduce, Apache Hadoop, Cloudera’s Distribution including Apache Hadoop (CDH), Cloudera Enterprise, EMC Greenplum HD, HortonWorks Data Platform powered by Apache Hadoop, and MapR’s M3 Free and M5 Edition.
The change of Kettle from LGPL to Apache License 2.0 was broadly supported by our community and acts as an open invitation for other projects (and companies) to integrate Kettle. I hope that more NoSQL, Big Data and Big Search communities will reach out to us to work together to even broaden our portfolio. The way I see it, the Kettle community just got a whole lot bigger!

Where are the goodies?

The main landing page for the Big Data community is placed on our wiki to emphasize our intention to closely work with the various communities to make Pentaho Big Data a success. You can find all information over there, including a set of videos, PDI 4.3.0 preview download (including Big Data plugins), Hadoop installation instructions, PRD configuration information and much more.

Thanks for your time reading this and thanks for using Pentaho software!


Streaming XML content parsing with StAX

Today, one of our community members posted a deviously simply XML format on the forum that needed to be parsed.  The format looks like this:

    <DATE>Fri, 01 Jun 2001 22:50:00 GMT</DATE>

    <DATE>Fri, 01 Jun 2001 22:50:02 GMT</DATE>

Typically we parse XML content with the “Get Data From XML” step which used XPath expressions to parse this content.  However, since the meaning of the XML content is determined by position instead of path, this becomes a problem.  To be specific, for each CONVERSION block you need to pick the last preceding EXPR and EXCH values.  You could solve it like this:

Unfortunately, this method requires a full parsing of your file 3 times and once extra for each additional preceding element.  The joining and all also slows things down considerably.

So this is another case where the new “XML Input Stream (StAX)” step comes to the rescue.  The solution using this step is the following:

Here’s how it works:

1) The output of the “positional element.xml” step flattens the content of the XML file so that you can see the output of each individual SAX event like “start of element”, “characters”, “end of element”.  Every time you get the path, parent path, element value and so forth.  As mentioned in the doc this step is very fast and can handle files with just about any size with a minimal footprint.  It will appear in PDI version 4.2.0GA.

2) With a bit of scripting we collect information from the various rows that we find interesting.

3) We filter out only the result lines (the end of the CONVERSION element).  What you get is the following desired output:

The usage of JavaScript in this example is not ideal but compared to the reading speed of the XML I’m sure it’s fine for most use-cases.

Both examples are up for download from the forum.

The “XML Input Stream (StAX)” step has also shown to work great with huge hierarchical XML structures, files of multiple GB in size.  The step was written by colleague Jens Bleuel and he documented a more complex example on his blog.

Have fun with it!