Data Cleaner 2

Dear Kettle friends,

Some time ago while I visited the nice folks from Human Inference in Arnhem, I ran into Kasper Sørensen, the lead developer of DataCleaner.

DataCleaner IconDataCleaner is an open source data quality tool released (like Kettle) under the LGPL license.  It is essentially to blame for the lack of a profiling tool inside of Kettle.  That is because having DataCleaner available to our users was enough to push the priority of having our own data profiling tool far enough down.

Kasper worked on DataCleaner pretty much in his spare time in the past.  Now that Human Inference took over the project I was expecting more frequent updates and that’s what we got indeed.  Not only did version 2 come out recently, we also got versions 2.0.1 a few weeks back and today version 2.0.2.  All this indicates a fast-paced project.

DataCleaner was mentioned a few times in books about Pentaho software.  For example it was referenced in Pentaho Solutions as well as in Pentaho Kettle Solutions (chapter 6 – Data profiling).  This was done to allow folks that need to do a bit data profiling before they start with the data integration work, to get the job done.

So what’s happening with DataCleaner besides Kasper going all-out now that he works full time on the product? What purpose does it serve?

Let’s start with my favorite option: the “Quick Analysis” option.  You point it to a database table (or CSV file) and you let it fly.  Here’s the sort of thing it comes back with:

In essence it will give you most of what you need to know about the quality of your data before getting into the data integration work.  It’s offers a really nice and rich user interface.  In the previous screen shot you can for example click on the green arrows to display sample rows with that particular data characteristic.

Because not all profiling jobs are as easy as this one, DataCleaner has been featuring more “data integration” like features in version 2.0.  These will for example allow you to Filter certain rows based on a wide pallet of DQ oriented criteria such as dictionaries, JavaScript, Rules and much more.  The next screen shot shows the use of a filter to limit the number of analyzed rows:

Don’t expect any Kettle like drag&drop like data integration. This is specifically targeted towards on-line data quality and data profiling more specifically. However, that’s what the tool claims to be good at and it is good at that.

There’s obviously a lot more to tell about DataCleaner but I hope that this little blog post will make you at least interested and makes you want to give it a go yourself.

Since DataCleaner and Kettle are license-compatible I’ll be looking at creating a plugin to integrate DataCleaner into Spoon … once I find a bit of time to do so or if someone volunteers to jump right in.  Kasper wasn’t quite convinced it would be easy to do but not all things in life have to be easy.

You can download DataCleaner over here so download it now and make sure to let them know what you think of it.

Until next time,


4 thoughts on “Data Cleaner 2”

  1. Thanks for this post, time to have a look at Datacleaner again.

    Another nice tool in my personal toolbox ist google refine, formerly known as gridworks:

    It currently only supports xls or csv files but provides many nice features for profling, cleansing and enriching data.

  2. I also use Google Refine. Its ability to interact with Google translate is particularly useful. It is also very simple to use in the main though some of the expressions are difficult to get to grips with. I come across DataCleaner through the guys at (they do an excellent job there) as I did Google Refine. DataCleaner is a very, very, very good DQ tool. It struggled a little last night with 200K records, but it didn’t crash and after all, its only time, so I had a cuppa. I can’t wait for some decent documentation to be published (they are working on it) but from what I have seen, I will be using it in ALL future data projects.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.