Reading from MongoDB

Hi Folks,

Now that we’re blogging again I thought I might as well continue to do so.

Today we’re reading data from MongoDB with Pentaho Data Integration.  We haven’t had a lot of requests for MongoDB support so there is no step to read from it yet.  However, it is surprisingly simple to do with the “User Defined Java Class” step.

For the following sample to work you need to be on a recent 4.2.0-M1 build.  Get it from here.

Then download mongo-2.4.jar and put it in the libext/ folder of your PDI/Kettle distribution.

Then you can read from a collection with the following “User Defined Java Class” code:

import java.math.*;
import java.util.*;
import java.util.Map.Entry;
import com.mongodb.Mongo;
import com.mongodb.DB;
import com.mongodb.DBCollection;
import com.mongodb.BasicDBObject;
import com.mongodb.DBObject;
import com.mongodb.DBCursor;

private Mongo m;
private DB db;
private DBCollection coll;

private int outputRowSize = 0;

public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
{
	DBCursor cur = coll.find();

	if (first) {
		first=false;
		outputRowSize = data.outputRowMeta.size();
 	}

	while(cur.hasNext() && !isStopped()) {
		String json = cur.next().toString();
		Object[] row = createOutputRow(new Object[0], outputRowSize);
        	int index=0;
		row[index++] = json;

	    	// putRow will send the row on to the default output hop.
        	//
    		putRow(data.outputRowMeta, row);
	}

	setOutputDone();

    	return false;
}

public boolean init(StepMetaInterface stepMetaInterface, StepDataInterface stepDataInterface)
{
	try {
        	m = new Mongo("127.0.0.1", 27017);
		db = m.getDB( "test" );
    		coll = db.getCollection("testCollection");

 		return parent.initImpl(stepMetaInterface, stepDataInterface);
	} catch(Exception e) {
	  	logError("Error connecting to MongoDB: ", e);
    		return false;
	}
}

You can simply paste this code into a new UDJC step dialog. Change the parts in the init() method to server your needs. This code reads all the data from a collection in a Mongo database.  The output of this step is a set of rows contain each one JSON string. So make sure to specify one JSON String field as output of your step.  These JSON structures can be parsed with the new “JSON Input” step and then you can do whatever you want with it.

Please let us know what you think of this and whether or not you would like to see support for writing to MongoDB and/or dedicated steps for it.  I’m sorry to say I have no idea of the popularity of these new NoSQL databases.

Until next time,

Matt

UPDATE: The functionality described in this UDJC code is available in a new “MongoDB Input” step in 4.2.0-M1 or later.

UPDATE2: We also added authentication for MongoDB in PDI-6137

P.S. To install and run MongoDB on your Ubuntu 10.10 machine, do this:

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv 7F0CEB10
sudo apt-get update
sudo apt-get install mongodb

20 thoughts on “Reading from MongoDB”

  1. Pingback: ehcache.net
  2. Hi Matt

    Thanks for the post. We are planning to move from MySQL to MongoDB very soon. It will be great if PDI can provide full support for MongoDB in Spoon. We need that badly.

    Thanks!

  3. This is great news. I am just starting a project which will require pulling data from MongoDB into a central data mart, and I was starting to dread asking for java resources to help build a custom plugin. I’m looking forward to the new steps you plan to create.

    Thanks,
    Kaushal

  4. Hi Matt,

    Firstly can I say you are an absolute star for working on the Mongo stuff!

    Couldn’t be better timing for us. Any update on the new steps? Would be ideal if they are ready soon.

    Cheers,
    Greg

  5. Hi Matt,

    Thanks for adding a MongoDB step! However, I can’t seem to find where to download the 4.2+ version. Will you point me in the right direction?

    Thanks,
    Shannon

  6. Hi Matt,

    Could we get a Pentaho component like the mongodb one that reads from couchDB? Being that they are similar shouldn’t take too much effort to develop? There will be tremendous call for this as couchDB becomes widely use…

    Regards,
    David

  7. There’s no “MongoDB Output” step but if MongoDB has an API to do bulk loading I’m willing to write it. In that case please create a feature requrest (http://jira.pentaho.com) with details on how you would like to see this step work.

    If you need it faster you could use the “User Defined Java Class” step to drive the MongoDB Java API.

    Thanks in advance,

    Matt

Leave a Reply to Greg Banbury Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.