Figuring out the layout of a delimited file

Hey Kettle friends,

Recently I got the question whether or not PDI would be able to automatically figure out, at runtime, what the layout of a delimited text file is simply by looking at the header of the file and by doing various scans of the file to determine the delimiter, data types of the columns and so on.

Now obviously you could also do without a header file but then the columns would have to be called something like “field1 to fieldN”.

First, we want to see which delimiter, from a list of possible delimiters, is used in the file.  We want to do this by reading every line of the file in question and by doing a character-by-character analyses of the line.  We’ll count the number of occurrences of all possible and then pick the most often used delimiter at the end and set it as a variable.

Detecting the delimiter
Detecting the delimiter

 

The hardest part of this is doing the counting in JavaScript.  We’re using an associative array:

 

// A simple algorithm: count every ,;/ and tab, see which one is used most often
//
var delimiters = [ ',', ';', '/', '\\t' ];
var counters = new Array();
var i;
for (i=0;i<line.length;i++) {
  var c=line.charAt(i);
  for (d=0;d<delimiters.length;d++) {
    if (c==delimiters[d]) {
      if (counters[c]==null) {
	    counters[c]=1;
      } else {
        counters[c]++;
      }
    }
  }
}
var maxC=0;
var max=0;
for (i=0;i<delimiters.length;i++) {   if (counters[delimiters[i]]>max) {
    maxC=i;
    max=counters[delimiters[i]];
  }
}

var delimiter=""+delimiters[maxC];
var count = counters[delimiters[maxC]];

So the above script populates an array of counters which is then evaluated to see which delimiter wins.  Obviously it should be easy to modify the list of possible participants in this delimiters game.

Now that we found the delimiter of the file, we can go ahead and determine the field names from the header of the file:

Detect field names
Detect field names

Again, we’re using a bit of JavaScript in the above :

var fields = line.split(getVariable("DELIMITER",""));
for (i=0;i<fields.length;i++) {
  var newRow = createRowCopy(getOutputRowMeta().size());
  var rowIndex = getInputRowMeta().size();
  newRow[rowIndex++] = fields[i];
  putRow(newRow);
}

var field=null;

As you can see, this script converts the line into different rows.

Previewing the field names detection
Previewing the field names detection

OK, but this still not solves the complete puzzle as we still need to figure out the hardest part: the data types, lengths, precisions, conversion masks of the various fields.  To do this we pass the column names we detected into another transformation:

metadata-detector-parser

 

So the “CSV file input” step is configured automatically by the parent transformation to read all fields as Strings.  Then there is another piece of JavaScript to allow us to make use of the same functionality being used to automatically detect data types in the “Get Fields” functionality of the “Text File Input” and “CSV file input” step dialogs.  In this case it’s done by a class called “StringEvaluator” which we’ll re-use here:

 

for (i=0;i<getInputRowMeta().size();i++) {
  var stringValue = getInputRowMeta().getString(row, i);
  var evaluator = evaluators.get(i);
  evaluator.evaluateString(stringValue);
}

At the end of the steps life, when the script ends, we have another block of code:

// write the results as new rows

for (i=0;i<getInputRowMeta().size();i++) {
  var stringValue = getInputRowMeta().getString(row, i);
  var evaluator = evaluators.get(i);
  evaluator.evaluateString(stringValue);

  var evalResult = evaluator.getAdvicedResult();
  var valueMeta = evalResult.getConversionMeta();

  var newRow = createRowCopy(getOutputRowMeta().size());
  var rowIndex = getInputRowMeta().size();

  newRow[rowIndex++] = getInputRowMeta().getValueMeta(i).getName();
  newRow[rowIndex++] = valueMeta.getTypeDesc();
  newRow[rowIndex++] = java.lang.Integer.toString(valueMeta.getLength());
  newRow[rowIndex++] = java.lang.Integer.toString(valueMeta.getPrecision());
  newRow[rowIndex++] = valueMeta.getConversionMask();
  newRow[rowIndex++] = valueMeta.getDecimalSymbol();
  newRow[rowIndex++] = valueMeta.getGroupingSymbol();

  putRow(newRow);
}

After some cleanup we finally have the data types and everything we need to properly read in the file.
You can fine the complete example over here.  You can use this on Kettle version 4.4.0 or above.

The output of the example is an XML file called layout.xml.  Obviously you can store it in different formats and/or inject it into another transformation.   You can then for example do complete normalization of every column value and store it in a database table.  Not that I would recommend such a generic structure but it’s possible.

Enjoy!

Matt

 

 

 

Pentaho 5.0 blends right in!

Dear Pentaho friends,

Ever since a number of projects joined forces under the Pentaho umbrella (over 7 years ago) we have been looking for ways to create more synergy across this complete software stack.  That is why today I’m exceptionally happy to be able to announce, not just version 5.0 of Pentaho Data Integration but a new way to integrate Data Integration, Reporting, Analyses, Dashboarding and Data Mining through one single interface called Data Blending, available in Pentaho Business Analytics 5.0 (Commercial Edition).

Data Blending allows a data integration user to create a transformation capable of delivering data directly to our other Pentaho Business Analytics tools (and even non-Pentaho tools).  Traditionally data is delivered to these tools through a relational database.  However, there are cases where that can be inconvenient, for example when the volume of data is just too high or when you can’t wait until the database tables are updated.  This for example leads to a new kind of big data architecture with many moving parts:

Evolving Big Data Architectures
Evolving Big Data Architectures

 

From what we can see in use at major deployments with our customers, mixing Big Data, NoSQL and classical RDBS technologies is more the rule than the exception.

So, how did we solve this puzzle?

The main problem we faced early on was that the default language used under the covers, in just about any business intelligence user facing tool, is SQL.  At first glance it seems that the worlds of data integration and SQL are not compatible.  In DI we read from a multitude of data sources, such as databases, spreadsheets, NoSQL and Big Data sources, XML and JSON files, web services and much more.  However, SQL itself is a mini-ETL environment on its own as it selects, filters, counts and aggregates data.  So we figured that it might be easiest if we would translate the SQL used by the various BI tools into Pentaho Data Integration transformations.  This way, Pentaho Data Integration is doing what it does best, not directed by manually designed transformations but by SQL.  This is at the heart of the Pentaho Data Blending solution.

The data blending service architecture
The data blending service architecture

To ensure that the “automatic” part of the data chain doesn’t become an impossible to figure out “black box”, we made once more good use of existing PDI technologies.  We’re logging all executed queries on the Data Integration server (or Carte server) so you have a full view of all the work being done:

Data Blending Transparency
Data Blending Transparency

In addition to this, the statistics from the queries can be logged and viewed in the operations data mart giving you insights into which data is queried and how often.

We sincerely hope that you like these new powerful options for Pentaho Business Analytics 5.0!

Enjoy!

Matt

–If you want to learn more about the new features in this 5.0 release, Pentaho is hosting a webinar and demonstration on September 24th – Two options to register: EMEA & North America time zones.

– Also check out the new exciting capabilities we deliver today on top of MongoDB!

Matt Casters
Chief of Data Integration at Pentaho, Kettle project founder

 

Join the Pentaho Benelux Meetup in Antwerp!

Dear Benelux Pentaho fans,

On October 24th, Bart Maertens and I are organizing the revival of the Pentaho Benelux User Group with a meetup in Antwerp.  We’re doing this in the fantastic setting of the historical Antwerp Central Station building smack in the center of Antwerp.  Since obviously this location is easy to reach by public transport and since there’s ample parking space, we hope that this location will be within reach for everyone.

Antwerp Central Station

Here is the entrance:

PBUG on the map
PBUG on the map
Attrium entrance, room is upstairs
Attrium entrance, room is upstairs

And the view from the balcony:

The view from above
The view from above

The final agenda hasn’t been completed yet but as a good user group we would like to focus on sharing Pentaho experiences (good or bad) among the attendees.  Because of that we will focus this meetup on use-cases and practical information, giving feedback to Pentaho (questionnaire) and so on.

Registration has opened today and is completely free for all.

Register now at: http://pbug13.eventbrite.com

We will update the schedule as soon as more speakers are know on the Eventbrite page.

We sincerely hope that you will like this PBUG initiative and that you find the time to join us.  To keep this event completely free of charge and accessible to all, Pentaho partner know.bi has agreed to pay for the room and Pentaho will pay for pizza and beer.

See you all in Antwerp!

Matt (& Bart)