Data federation

Dear Kettle friends,

For a while now we’ve been getting requests from users to support a system called “Data Federation” a.k.a. a “Virtual Database”.   Even though it has been possible for a while to create reports on top of a Kettle transformation, this system could hardly be considered a virtual anything since the Pentaho reporting engine runs the transformation on the spot to get to the data.

The problem?  A real virtual database would have to understand SQL and a data transformation engine typically doesn’t.  It’s usually great at generating it, parsing it not so.

So after a lot of consideration and hesitation (you don’t really want to spend too much time in the neighborhood of SQL/JDBC code unless you want to go insane) we decided to build this anyway, mainly because folks kept asking about it and because it’s a nice challenge.

The ultimate goal is to create a virtual database that is clever enough to understand the SQL that the Mondrian ROLAP engine generates.

Here is the architecture we’re in need of:

In other words, here’s what the user should be able to do:

  • He/she should be able to create any kind of transformation that generates rows of data, coming from any sort of database.
  • It should be possible to use any kind of software that understands the JDBC and SQL standards
  • It should have a minimal set of dependencies as far as libraries are concerned
  • Data should be streamed to allow for massive amounts of data to be passed from server to client
  • The SQL should be able to understand basic SQL including advanced WHERE, GROUP BY, ORDER BY, HAVING clauses. (anything that an OLAP engine needs)
Not for the first time, I though to myself (and the patient ##pentaho community on IRC) : “This can’t be that hard!!”.  After all, you only need to parse SQL that gets data from a single (virtual) database table since joining and so on can be done in the service transformation.
So I started pounding on my keyboard for a few weeks (rudely interrupted by a week of vacation in France) and a solution is now more or less ready for more testing…
You can read all details about it on the following wiki page:
The cool thing about Kettle data federation is that anyone can test this in half an hour time following the next few simple steps:
  • Download a recent 5.0-M1 development build from our CI system (any left failed unit tests are harmless but an indication that you are in fact dealing with non-stable software in development)
  • Create a simple transformation (in .ktr file format) reading from a spreadsheet or some other nice and simple data source
  • Create a Carte configuration file as described in the Server Configuration chapter on the driver page specifying
    • The name of the service (for example “Service”)
    • the transformation file name
    • the name of the step that will deliver the data
  • Then start Carte
  • Then configure your client as indicated on the driver page.
For example, I created a transformation to test with that delivered some simple static data:
I have been testing with Mondrian on the EE BI Server 4.1.0-GA, and as indicated on the driver page, simply replaced all the kettle jar files in the server/biserver-ee/tomcat/webapps/pentaho/WEB-INF/lib/ folder.
Then you can do everything from inside the user interface.
Create the data source database connection:
Follow the data source wizard, select “Reporting and Analyses” at the bottom:
Select one table only and specify that table as the fact table:
Then you are about ready to start the reporting & analyses action.  Simply keep the default model (you
can customize it later)…
You are now ready to create interactive reports…
… and analyzer views:
So get started on this and make sure to give us a lot of feedback, your success stories and failures as well.  You can comment on the driver page or in the corresponding JIRA case PDI-8231
The future plans are:
  • Offer easy integration with the unified repository for our EE users so that they won’t have to enter XML or have to restart a server when they want to add or change the services list. (arguably an important requisite for anyone seriously considering this to be run in production)
  • Implement service and SQL data caching on the server.
  • Allow writable services and “insert into” statements on the JDBC client