Data Cleaner 2

Dear Kettle friends,

Some time ago while I visited the nice folks from Human Inference in Arnhem, I ran into Kasper Sørensen, the lead developer of DataCleaner.

DataCleaner IconDataCleaner is an open source data quality tool released (like Kettle) under the LGPL license.  It is essentially to blame for the lack of a profiling tool inside of Kettle.  That is because having DataCleaner available to our users was enough to push the priority of having our own data profiling tool far enough down.

Kasper worked on DataCleaner pretty much in his spare time in the past.  Now that Human Inference took over the project I was expecting more frequent updates and that’s what we got indeed.  Not only did version 2 come out recently, we also got versions 2.0.1 a few weeks back and today version 2.0.2.  All this indicates a fast-paced project.

DataCleaner was mentioned a few times in books about Pentaho software.  For example it was referenced in Pentaho Solutions as well as in Pentaho Kettle Solutions (chapter 6 – Data profiling).  This was done to allow folks that need to do a bit data profiling before they start with the data integration work, to get the job done.

So what’s happening with DataCleaner besides Kasper going all-out now that he works full time on the product? What purpose does it serve?

Let’s start with my favorite option: the “Quick Analysis” option.  You point it to a database table (or CSV file) and you let it fly.  Here’s the sort of thing it comes back with:

In essence it will give you most of what you need to know about the quality of your data before getting into the data integration work.  It’s offers a really nice and rich user interface.  In the previous screen shot you can for example click on the green arrows to display sample rows with that particular data characteristic.

Because not all profiling jobs are as easy as this one, DataCleaner has been featuring more “data integration” like features in version 2.0.  These will for example allow you to Filter certain rows based on a wide pallet of DQ oriented criteria such as dictionaries, JavaScript, Rules and much more.  The next screen shot shows the use of a filter to limit the number of analyzed rows:

Don’t expect any Kettle like drag&drop like data integration. This is specifically targeted towards on-line data quality and data profiling more specifically. However, that’s what the tool claims to be good at and it is good at that.

There’s obviously a lot more to tell about DataCleaner but I hope that this little blog post will make you at least interested and makes you want to give it a go yourself.

Since DataCleaner and Kettle are license-compatible I’ll be looking at creating a plugin to integrate DataCleaner into Spoon … once I find a bit of time to do so or if someone volunteers to jump right in.  Kasper wasn’t quite convinced it would be easy to do but not all things in life have to be easy.

You can download DataCleaner over here so download it now and make sure to let them know what you think of it.

Until next time,

Matt

Reading from MongoDB

How to read data from MongoDB using PDI 4.2

Hi Folks,

Now that we’re blogging again I thought I might as well continue to do so.

Today we’re reading data from MongoDB with Pentaho Data Integration.  We haven’t had a lot of requests for MongoDB support so there is no step to read from it yet.  However, it is surprisingly simple to do with the “User Defined Java Class” step.

For the following sample to work you need to be on a recent 4.2.0-M1 build.  Get it from here.

Then download mongo-2.4.jar and put it in the libext/ folder of your PDI/Kettle distribution.

Then you can read from a collection with the following “User Defined Java Class” code:

import java.math.*;
import java.util.*;
import java.util.Map.Entry;
import com.mongodb.Mongo;
import com.mongodb.DB;
import com.mongodb.DBCollection;
import com.mongodb.BasicDBObject;
import com.mongodb.DBObject;
import com.mongodb.DBCursor;

private Mongo m;
private DB db;
private DBCollection coll;

private int outputRowSize = 0;

public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
{
	DBCursor cur = coll.find();

	if (first) {
		first=false;
		outputRowSize = data.outputRowMeta.size();
 	}

	while(cur.hasNext() && !isStopped()) {
		String json = cur.next().toString();
		Object[] row = createOutputRow(new Object[0], outputRowSize);
        	int index=0;
		row[index++] = json;

	    	// putRow will send the row on to the default output hop.
        	//
    		putRow(data.outputRowMeta, row);
	}

	setOutputDone();

    	return false;
}

public boolean init(StepMetaInterface stepMetaInterface, StepDataInterface stepDataInterface)
{
	try {
        	m = new Mongo("127.0.0.1", 27017);
		db = m.getDB( "test" );
    		coll = db.getCollection("testCollection");

 		return parent.initImpl(stepMetaInterface, stepDataInterface);
	} catch(Exception e) {
	  	logError("Error connecting to MongoDB: ", e);
    		return false;
	}
}

You can simply paste this code into a new UDJC step dialog. Change the parts in the init() method to server your needs. This code reads all the data from a collection in a Mongo database.  The output of this step is a set of rows contain each one JSON string. So make sure to specify one JSON String field as output of your step.  These JSON structures can be parsed with the new “JSON Input” step and then you can do whatever you want with it.

Please let us know what you think of this and whether or not you would like to see support for writing to MongoDB and/or dedicated steps for it.  I’m sorry to say I have no idea of the popularity of these new NoSQL databases.

Until next time,

Matt

UPDATE: The functionality described in this UDJC code is available in a new “MongoDB Input” step in 4.2.0-M1 or later.

UPDATE2: We also added authentication for MongoDB in PDI-6137

P.S. To install and run MongoDB on your Ubuntu 10.10 machine, do this:

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv 7F0CEB10
sudo apt-get update
sudo apt-get install mongodb

My Android tablet

I’ve always been a fan of gadgets and so when it came time to buy my dad a replacement for his 6 year old Palm Pilot that recently broke down, we (me and my sisters) bought him an Apple iPad.  Just to make this clear and to get this out of the way, it was €600 well spent since he loves this device a lot.  Mostly he watches television on it and reads his newspaper.

I had the iPad about a week before we wrapped it up and in that small time frame I was impressed with the device, both in terms of user friendliness but also in terms of frustration.  I felt a lot of frustration because that device is as closed down as you can possible close down a computer.  My biggest gripe was of-course iTunes.  For my dad it must have felt quite natural to “synchronize” a dumb terminal hand-held device like a Palm or an iPad.  To me, it felt really awkward coming from the Android platform where mail and calendar is held in the cloud, where you can install applications from a web interface and with devices you can hook up to your computer to transfer files.

So naturally I wanted to buy an Android tablet for myself.  Having played with the 10-inch iPad I’m convinced that this is a really nice form factor for a tablet so I wanted one of those.  However, the problem there is that you basically have 2 main variations of Android tablets at the moment: the really expensive or the really cheap.  First let’s take a look at the really expensive. I don’t know about you but to me forking over +€600 for what is basically a gadget is too much.  In that respect I think that devices like the Galaxy Tab, the upcoming Xoom and many more are simply missing the mark with prices of beyond €800.  I already have a laptop and a smartphone.  This gadget will be used to browse the web, play games, read books… in my couch or in bed.

The dirt cheap category is filled with all sorts of equipment that looks nice but either has old versions of Android, a lousy single-touch screen, not enough memory (256MB), a battery life of half an hour or a dog-slow 5-year-old processor.  Sure, it only costs €100 but you know you’re never going to use it for more than a bit of testing.

Unfortunately, there are very few (Android or iOS) tablets to be found in the space in between, the price range between €250 and €450 where netbooks did so well.  It almost looks like all manufacturers want to make a quick buck from this new tablet hype.

At the end of my search I heard about the Point of View Mobii Tegra 2 tablet that came out around the end of last year.  A month ago I bought it and priced at a reasonable €350 it comes with the latest  NVidia Tegra 250 mobile chipset which is (it has to be said) bloody fast with a Dual core Cortex A9 processor at 1Ghz.  It has 512MB of RAM to work with and a multi-touch capacitive 10.2″ screen (1024×600).  While the screen itself is by far the weakest part of the deal, the device also comes with a MicroSD card reader, a 1.3M front-facing webcam, a USB port (host mode, meaning you can hook up your 1TB hard drive to is, or your keyboard & mouse) and an HDMI port (to play Angry Birds on your 50″ HD TV set)

As such, this device would be nearly perfect if it weren’t for the fact that the software that runs on it, Android 2.2 with some customizations from the manufacturer, is pretty bad.  After a month of usage this tablet has been a lot of fun to test and play with and it’s gained its spot in my life.  To me the tablet would be perfect with a better OS (Android 3?) and with a better screen (viewing angle isn’t great but not that bad to become an issue).  Something tells me that this will very soon become possible at the same price-point.

Fortunately though, software is something that can be fixed rather easily these days.  After all, this is open source Android we’re talking about here.  To save you folks out there that just bought this machine the trouble, I’m going to explain to you what you need to do to get it up to speed (literally).

UPDATE: before you start you might want to consider updating your device firmware (from February 11th) to v1.0.9.  To check go to the root screen, Menu, Settings, About the phone.  If you update you can skip the installation of the flash player and the screen calibration.  This update erases your tablet so make sure to back up your important data.

We’re going to install “some” extra software to make it usable… The information I found almost exclusively comes from Tweakers.net, a Dutch site.  Since Point of View is a Dutch company I guess that makes sense.  It also allows for a very hackable and upgradeable machine 🙂

Before we start, make sure to insert a MicroSD card into your new PoV tablet.  It will help it getting good performance and in general I think the software expects to find something there.  Since these things are dirt cheap and range from 1GB to 32GB, pick your pick.  I put 4GB in there and I still think that’s plenty for this device given the fact you can plug in your USB HD/Stick to watch movies.

To begin with, the machines usually ship with a badly calibrated screen making typing and touching the screen a bad experience.  So first download and install the “Module AP” application.  Before running, make sure your screen is clean and your device is placed (screen up) flat on your desk.  Note that after trying certain older Android games I had to re-run this app to re-calibrate the screen. (once or twice I think)

Next we want to obviously install Adobe Flash. Download and install that, not from the Android Market (we will install the Market later) but from this location.  For the Apple fans out there dissing Flash: too many websites use it, it’s not going away.  Most news sites play video using Flash and it works great and fast (even full screen high resolution) on a Tablet.  Let me spell it out for you: banning it from iOS was a big mistake.

Next the Android Market.  For the zip file and explanation see this blog post from Oudmaijer.  It should be fairly straightforward, I didn’t spend too much time on it.  Oudmaijer also has advice on installing alternative ROMs, the Google apps and much more for the adventurous.

Personally I already have a calendar and e-mail on every place I can possible think of it so I erased all these apps from my tablet again.  The machine is now strictly non-work related and I like it that way.  (note I can use the browser to read mail and see my Calendar while on the road just fine if I need to).

The lack of 3G is no longer relevant in my opinion now that I have a smart-phone that can Wifi-tether.  If I plug the phone into the USB port of the tablet it charges too.  If you read the various websites of the PoV Mobii you can read up on how to connect your 3G dongle or how to change the USB port from Host mode to Client mode.  I never bothered with either.

Here is some other cool software I installed on the device:

  • Amazon Kindle & Aldiko for a bit of reading.  Having my own book with me is great 🙂
  • Firefox 4 Mobile (Beta 5 or later) Excellent browser with Sync support.  The new béta 5 release of v4 mobile is absolutely a pleasure to work with.  Too bad it misses flash support but I’m sure it will arrive sooner or later.
  • Astro File Manager so you can browse your file system and so on.  Make sure to also install the networking, SMB, SSH/SFTP modules so you can browse your remote (Linux) PCs.
  • Advanced Task Killer Froyo: Especially in the beginning when you’re testing every new piece of software out there when none of it was designed to run on a dual core or on a 10″ screen it can be quite handy to kill all programs with one finger swipe.  Kill everything before running intensive games like a few of the ones listed below.
  • Better terminal emulator Pro: It doesn’t make a lot of sense to have on a phone but with a tablet and a USB keyboard at hand it can be surprisingly useful. BTEP also has support for scp, ssh and other useful commands so you can copy files the way you know you like it.  Heck, I was Unix sys-admin for Volvo in a previous life and Android is a Unix machine.
  • QuickPic : For picture viewing. The built-in software works for file viewing and looks very fancy.  It just seems to choke on large volumes of images.
  • z4root (see download link at the bottom of the post) to root your device.  You know you want to 🙂  Goes nicely with “Uninstaller for root” to remove unwanted built-in apps like “e-Mail” and so on.
  • Google Maps : I installed it of-course, but I don’t use it that much.
  • Rock Player: This media player handles all possible file formats so you don’t have to convert the HD movies on your hard disk or thumb drive.  Copy them over and watch them when you feel like it.
  • Adobe Reader: Nicely renders PDFs with very good performance. (almost immediately with quick page-turning)
  • Skype: works great but unfortunately without video. I removed it because people would ping me while reading a book.
  • Seesmic can be used for Twittering.  Tweetdeck thinks for some reason it can’t run on the tablet. (lazy developers I tell you)  Again, I un-installed twitter from this device.  It’s not because you can that you should install this sort of thing 🙂

These were some apps you can start with.  Please note that Android 2.2 isn’t very good for multi-tasking and multi-media yet.  It tries to do too many background tasks.  If you don’t want to be disturbed with any of that (what the default should be), turn off background synchronization in the settings.  It helps out a lot with responsiveness in the games listed next.  Updates for this device were promised in the form of Android 2.3 in the coming months so I’m sure that this situation will improve soon.

  • Asphalt 5 : This HD game fully supports the capabilities of your dual core tablet and is just a lot of fun to play.  It nicely shows off the potential of your tablet.
  • Angry birds & Angry birds seasons: if you have kids they’ll want it, they’ll need  it. Both game engines can stall on your brand new CPU (once or twice, not often).  If this happens, hit the power button for a while, select Home screen and kill the game with your favorite task killer.  Rovio should release an update for the new dual cores to fix this.
  • Glow Hocky and Air Hocky: HD full screen air hocky games. The first one is very flashy the second plain.  I think I prefer the first version.
  • Fruit Ninja: Just a lot of full screen fun on a tablet.
  • TurboFly 3D: More fast-paced 3D racing fun.
  • Robo Defense 2.0: This recently released version supports HD screens and tablets just fine.  Still a lot of fun, now with more extensions and upgrades.
  • Penguin Skiing: my son’s favorite, clone of the open source (Linux) variant.
  • Radiant: Excellent game with a retro look and feel. Think big pixels.  Originally purchased for my phone it works great full screen too. There is a HD variant as well but I didn’t try it yet.
  • Krazy Kart Racing: more high speed full screen 3D racing fun. Another favorite of Sam.
  • 3D Invaders: Even though it’s a beta it’s playable and fun.
  • Android Shogi: Great program with large opening book that is downloaded on request. Tactically less strong in the middle game but with a brutal mating engine.  I enjoyed playing Shogi again after all these years 🙂
  • PewPew: Tough multi-touch 2D game with a really cool (vector graphics) retro look.
  • Klondike Solitaire: In case you still have time left on that looong flight across the Atlantic.
  • Pinball Deluxe: See me, hear me, feel me. More full screen smooth graphics fun with a virtual pinball machine.
  • xkcd: On and off-line viewer of the well known geek comic.  Use this one, the others don’t support the 10″ screen.
  • Spaghetti Marshmallows: fun physics engine game. Again full screen high resolution support on this tablet.
  • Tank Hero: Nice fast-paced tank-busting game.
  • New! NVIDIA Tegra Zone: A new app in the market that lists Tegra2 optimized games and apps like Fruit Ninja THD and the upcoming Galaxy on Fire 2 THD awesomeness.

For all these games the same is true: don’t use built-in the G-Sensor.  Even though it’s possible, 10″ devices like this one or the iPad are simply too heavy for it.  After 2 minutes of pretending your screen is a steering wheel, the fun is over.  All the games listed above have a touch-steering mode which is actually a lot more fun. (If you think you’re G-Sensor is broken, unlock it with the switch at the top of your tablet :-))

The only app I can think of I’m still missing is a nice video-chat application since the tablet does indeed have a front-facing webcam.  In the future I could then give a tablet to the remote family members (grand-parents and so on) so we can all can video chat with them.  I know it’s technically possible now.

Finally, a word on the battery life of this thing.  As far as I can tell the battery life of this tablet is about the same or better than the iPad.  Perhaps that is because I no longer have any background connections taking place all the time checking for eMail,  calendar appointments or Google Talk/Skype connections.  In any case, I think it lasts around 5-6 hours tops if you are non-stop doing intensive gaming.  I haven’t tried to run HD movies yet but I’m sure the thing could last a movie or 2 easily.  For normal web-browsing and mostly stand-by usage I think the PoV Mobii Tegra lasts for days. (I never tried since I usually charge it overnight.  Obviously just like it is the case with the iPad, turning off the wireless LAN actually makes a huge difference in battery life.

There you are.  I hope you liked this “little” review. You’ll be up and running on your new tablet in no time.  While the Android tablet market space is just opening up, it’s already really interesting to be using it.

Just try not to have too much fun!

Until next time,

Matt

P.S. Before anyone asks: yes you can access your Pentaho BI server with the built-in browser and yes, open flash charts work great.  Haven’t tried 3.7.1 and Analyzer yet but I will do that soon.

P.P.S. If you somehow managed to install experimental software (I’m guilty of trying everything) and you can’t use the touch screen anymore… simply hook up a standard USB (US) keyboard and use the cursor keys to navigate and re-calibrate the screen.  Keep pressing the “Back” button to unlock your screen.