What’s new in 4.2.0

Dear Kettle fans,

Instead of pointing to the impressive list of changes in JIRA I took the time out to build a high level overview of all the new big ticket items that are going to be in the upcoming version 4.2 of Kettle (Pentaho Data Integration).  Allow me to share it with you…:

  • The Excel Writer step offers advanced Excel output functionality to control the look and feel of your spreadsheets.
  • Graphical performance and progress feedback for transformations
  • The Google Analytics step allows download of statistics from your Google analytics account
  • The Pentaho Reporting Output step makes it possible for you to run your (parameterized) Pentaho reports in a transformation. It allows for easy report bursting of personalized reports.
  • The Automatic Documentation step generates (simple) documentation of your transformations and jobs using the Pentaho Reporting API.
  • The Get repository names step retrieves job and transformation information from your repositories.
  • The LDAP Writer step
  • The Ingres VectorWise (streaming) bulk loader step
  • The Greenplumb (streaming) bulk loader step (for gpload)
  • The Talend Job Execution job entry
  • Healthcare Level 7 : HL7 Input step, HL7 MLLP Input and HL7 MLLP Acknowledge job entries
  • The PGP File Encryption, Decryption & validation job entries facilitate encryption and decryption of files using PGP.
  • The Single Threader step for parallel performance tuning of large transformations
  • Allow a job to be started at a job entry of your choice (continue after fixing an error)
  • The MongoDB Input step (including authentication)
  • The ElasticSearch bulk loader
  • The XML Input Stream (StAX) step to read huge XML files at optimal performance and flat memory usage by flattening the structure of the data.
  • The Get ID from Slave Server step allows multi-host or clustered transformations to get globally unique integer IDs from a slave server: http://wiki.pentaho.com/display/EAI/Get+ID+from+Slave+Server
  • Carte improvements:
    1. reserve next value range from a slave sequence service
    2. allow parallel (simultaneous) runs of clustered transformations
    3. list (reserved and free) socket reservations service
    4. new options in XML for configuring slave sequences
    5. allow time-out of stale objects using environment variable KETTLE_CARTE_OBJECT_TIMEOUT_MINUTES
  • Memory tuning of logging back-end with: KETTLE_MAX_LOGGING_REGISTRY_SIZE, KETTLE_MAX_JOB_ENTRIES_LOGGED, KETTLE_MAX_JOB_TRACKER_SIZE allowing for flat memory usage for never ending ETL in general and jobs specifically.
  • Repository Import/Export
    1. Export at the repository folder level
    2. Export and Import with optional rule-based validations
    3. Import command line utility allow for rule-based (optional) import of lists of transformations, jobs and repository export files: http://wiki.pentaho.com/display/EAI/Import+User+Documentation
  • ETL Metadata Injection:
    1. Retrieval of rows of data from a step to the “metadata injection” step
    2. Support for injection into the “Excel Input” step
    3. Support for injection into the “Row normaliser” step
    4. Support for injection into the “Row Denormaliser” step
  • The Multiway Merge Join step (experimental) allows for any number of data sources to be joined using one or more keys using an inner or a full outer join algorithm.

Beyond this list there’s as mentioned a long list of bug fixes and small improvements to the various steps and job entries.  It’s impossible to thank the complete community for all the contributions they’ve made to make this release a smashing success.  If you think it feels more like a 5.0 version please remember that we’re pretty conservative about version numbering.  As long as we don’t break our own Java API we won’t go to another major version.

Also remember you can try out all these new features right now by using a CI build or once the RC1 build is posted on SourceForge later on.  Please help our QA team by posting any issues you might find in JIRA.

Last but certainly not least let’s not forget to mention the upcoming exciting features of the new Pentaho BI Server version 4.  I won’t spoil the surprise for you but I can tell you that certain things in that new release are looking really (really!) nice.  Next Thursday (Europe – 13:00 GMT/UTC, 9:00am EST, Americas – 1:00pm EST, 10:00am PST) you can join us for a web conference with live demo.  Please register here if you are interested.

Have fun with the new Pentaho software releases!


5 thoughts on “What’s new in 4.2.0”

  1. Hi Matt!
    Congratulations on the new PDI 4.2. I’m testing out the RC1 version and it looks very nice. Does the autodocumentation step work with Kettle DB Repositories? and Enterprise Repositories? (I remember the plugin was only for File base repos.)

  2. Hi Sebastian,

    The auto-doc step should also work with DB and EE repositories.
    We’re tracking a couple of issues with that support but those should be fixed this week. The general idea is that it works for all sorts of PDI ETL metadata.


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.