Data Mining

Perhaps you caught the news, perhaps not, but Pentaho just got a bit larger again by the acquisition of the Weka project.
I’m really excited about this because it means we can finally crank out a couple of new steps for Kettle without having to release the whole Kettle project under the GPL license. We could create new Weka plugins under GPL and then offer them to customers under a commercial license as well.

Some time ago I played around with Weka a bit and found it extremely hard to read data into the different available engines. The plan I have for the Kettle-Weka integration is to build a couple of steps that provide you with the best of both worlds: easy drag&drop data integration and state-of-the-art data mining modules. If it wasn’t for the very big workload I’m under, I would get started on this right away. Unfortunately though, the new meta-data architecture for Pentaho (+GUI) is taking it’s fair amount of time to develop.

Another option to aim for is the creation/inclusion of a data profiler for Pentaho Data Integration to do analyses of source data.
All in all, these are very exciting times for Pentaho. It’s an honour to be able to take part in it.

data mining with weka GUI sample

3 thoughts on “Data Mining”

  1. hi
    this is daniel,i like to work with fact tables ,how can i capture slowly changing dimensions in fact tables using kettle ,please suggest me


  2. Hi Daniel,

    You can create, populate and update a slowly changing dimension using the “Dimension lookup/update” step.

    For a detailed description of what a slowly changing dimension is (or what a fact is) you should seek somewhere else. (SFTW) These concepts are not overly complex, but basic knowledge of them is recommended when trying to create a data warehouse. That’s the case for any ETL tool, not just Kettle. 🙂

    All the best,


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.