Good or bad looks: does it matter?

Computers in general and software in particular have always been a hobby of mine, almost a passion. I’ve been playing with them since the days of the ZX-81 Sinclair “computer”. I (ab)used a Commodore 64 until it literaly fell apart and worked on exotic old machine such as thoses with 2MB of ram and a 80286.

Those where the days when GUIs would make their entrance. It was way back when I was still studying (…hard to become a “beer expert”) and when animals could still talk.

It was a time when me and my friends looked down on software that was too shiny and/or had too much “bling”. In those days, because CPU power was very limited, that was pretty much the same as bloated software.

I’m explaining this background information to you so that you understand why the Kettle GUI called Spoon is as minimalistic as possible. If you don’t see 10 different views and +50 widgets in there its because I don’t want to have that in my software. All too often when working in IDEs like Eclipse I find myself clicking away as many views I don’t need as possible. I like my software clean with a nice overview. Information overload is not a good thing, although it can look impressive at times.

This is what version 2.4.0 looked like:

However, lately while building the 2.5.0 version, we got a few questions from the community to add some graphics, add some color, just to make Spoon look “more professional”. It all started because I replaced the core objects tree with a new ExpandBar SWT widget. Some folks in the community really like the extra images, others really hate them, some others probably couldn’t care less.
What about you? Which version do you like best? And does it really matter?

Without extra graphics:

With:

Until next time,
Matt

P.S. Just to make this clear, I added an option to turn off the graphics because I think that after 5 minutes they seriously get in the way 😉

Meet me at MySQL Santa Clara

Dear Kettle fan,

Next month, the MySQL Conference & Expo 2007 takes place in Santa Clara. I was invited to do 2 talks over there:

For a complete overview of all the great sessions that will take place, go here.

Feel free to join us for a Birds of a feather session on Tuesday evening, right after my talk. Joining me there are Julian Hyde (Mondrian founder & lead developer) who will also be doing a presentation, and BI star Nicholas Goodman himself (recently resurected from the dead after a long vacation) who is doing Pentaho training in the San Francisco area. It’s going to be informal & fun and I hereby promise that I will try to smuggle some Belgian beer into the country especially for this BoF event.
I’ll be in the SFO area a few days longer (21st-28th), so feel free to contact me to grab a beer and/or chat. It’s my first time in California and know very little about the place (except the governorator of-course), so feel free to show me around if you’re from the area.
Until next time,

Matt

Good old file handling

In a heavily webbed, automated, interconnected world with most data stored on relational databases, we can sometimes forget that there are indeed many situation where you simply want to FTP a file from one place to another.

That process in itself holds many dangers as I pointed out to someone on the forum today.  Let me re-cap that post here on the blog…

Suppose your files are coming in using FTP to a local directory.

A file is being written, let’s call it FILE_20070328.txt.
Now, in advance you don’t know the size of that file. Let’s say it’s 10MB and takes 30 seconds to FTP.
In your transformation you detect this file and start working. Chances are very high that you’ll be reading an incomplete file.  (See also this technical tip on variables and file handling)

There are 2 ways to solve this problem:

  1. You write to FILE_20070328.txt.tmp and rename when the FTP is done.  Rename is atomic for the filesystem and is therefor safe to use.
  2. You write all files you need to transfer and then FTP a small file, called a trigger file or sentinel file.

The first option is generating a problem on a different level though. Suppose you have a data warehouse to update and you expect 10 files from a remote system (CUSTOMER_20070328.txt, SALES_20070328.txt, etc.)

In that case the option has to include counting the number of available files, evaluating complex wildcards etc, just to make sure you get all files. When you also need to handle half-complete FTP attempts, partial files, etc it becomes messy pretty fast.

The trigger/sentinel option is by far easier and more ellegant in use and that is why we included that option in Pentaho Data Integration.

Obviously, things become more complex if you have multiple systems writing files to your input directory all of the time.  In that case try to remember to keep things as simple as possible.  Here is some extra advice for those poor souls that need to deal with such situations:

  • create different directories per source system (if you can)
  • consider generating text files where the header-row is the same as the footer-row. Then evaluate completeness of files by comparing the first and last row of the file (head/tail -1 file.txt).
  • use locks to indicate that files are being processed (touch /tmp/locks/kettle_running.lock, rm /tmp/locks/kettle_running.lock)
  • move files that are being processed to a different directory
  • archive files that where succesfully processed to a different directory
  • don’t ever delete files until your disks run full, give yourself a break: your warehouse can be reloaded that way if you make a mistake.

If you have other tips, just shoot!  Also feel free to add feature requests if you have any idea for a job entry to help out with your file handling problems.
Until next time,

Matt

MySQL Bulk export to file

A few days ago I had some good news on the new MySQL Bulk loader for which we added support in Kettle.

Today French i18n/translation hero Samatar checked in the code for the exact oposite, built on the “SELECT … INTO OUTFILE …” statements that MySQL supports.

Bulk load to file

As you can see, this job entry allows you to export the content of a MySQL database table to a flat file. Again, this is done completely by MySQL and therefor works at optimal speed (really fast!)
We added all kinds of helpers in the GUI so that you can easily select the table and the columns to export. All you need to do is give it a filename a separator and off you go! Not only that, you can use variables to specify almost all parameters of the job entry.

In short: another great option for those of us that work with MySQL and are in need of some extra firepower.

Until next time!

Matt

A nice chat

Earlier today I had a nice IM chat with someone.  He or she is referred to below as Question and I’m Answer.  There where interesting questions and perhaps others find the answers interesting as well.  I seemed a shame to let the information in the chat log go to waste, so I’m posting it here on my blog.
Question: I have a qestion for you about the possibility of creating custom transformations.
Answer: sure
Question: my company already has quite a buit of business logice that is coded in C and/or C++ and this logic then calls some Corba services.  Would it be possible for use to integrate that logic into Kettle?
Answer: Not directly, however, it’s not that hard to write wrappers in JNI (java native interface) and create a plugin for it.
Question: That was my idea.  I was just wondering if there were any other ways
Answer: What you need is some .dll or .so shared library to talk to and then you can go from there.
Answer: If these are standealone applications and you can talk to them over sockets, http, what not, that is possible as well.
Question: you’ve given me some good ideas.
Answer: In any case, you always need to write some java code.  I would go with JNI as it’s the easiest to do.
Question: i agree
Answer: Then you can even call those wrapped methods from Javascript.
Question: i totally agree
Question: when i was asked if it was possible, my first answer was JNI, and I was told to look for some alternatives (if there were any)
Answer: Well like i said, it highly depends.  If the only way for the application to communicate with the outside world is through C/C++ API, then you need JNI.
Question: I’ve used JNI to call C methods before.  It is pretty simple.
Answer: yeah
Question: I have some generic transformation questions for you, if you don’t mind my asking.
Answer: ok
Question: is there any limit on the number of steps and hops in a transformation?
Answer: no I’ve had reports of (crazy!) people that put 400 steps in a single transformation and they claimed it worked fine 😉
Question: lol, that is impressive.  honestly, i can’t think of 400 things i would want to do to Data in a single transformation
Answer: yeah, I had some screenshots a while back, can’t find them anymore.
Answer: That’s it right, it’s bad practice to say the least 🙂
Question: i see too much room for error (human error, not kettle) in 400 steps
Answer: Yes, it’s better to separate the logic a bit.  We even have mappings (subtransformation) to solve that these days.  Then again…  Oh well.
Answer: I guess they thought it was cool or something.
Question: the data that comes out could look nothing like the data that went in
Answer: The thing was one giant case switch: 30 filters in a row all splitting off to different functionality.
Question: I see that kettle handles flat file and Database connections.  What is the largest flat file that you have used as input for a transformation?
Answer: Hard to say, probably a couple of hundred MB or so.
Question: what if I wanted to run a GB file?
Answer: Hey, we only read one line at a time, it will be OK 😉
Question: 1 line at a time, do you mean 1 record at a time?
Answer: Yes.
Question: does that 1 record pass all the way through the transformation before the next record is read?
Answer: no, all steps work as separate threads at the same time in parallel.
Question: ok
Answer: The text file input step just puts rows of data (records if you like) in a buffer.  The next steps(s) read from that.
Question: ok.  So the next step reads rows from a buffer.  Does it read them 1 at a time from the buffer or does it grab several rows at a time?
Answer: It depends on the step, but usually it’s one row at a time. (that’s how I prefer it in any case to ensure data isolation)
Question: you say it depends, does that mean that it is possible to configure steps to read more than 1 at a time, or that some steps are defined to read more than 1 at a time?
Answer: It is possible for a step to own more than one row at a time yes.  For example the sort steps reads all available rows before it starts to write to the next step.  However, the API provides a method to read a single row.
Question: let’s say i have a reading step, some other step, and a writing step.  The reading step reads a row and places it in the buffer.  How does the next step know that there is a record in the buffer?
Answer: It blocks when the input buffer is empty or the output buffer is full.  A row is returned to the step when there is one available.
Answer: This is all synchronized of course.
Question: that’s good
Question: so the reading step puts rows into the buffer.  When that buffer is full, the next step begins to read records from the buffer.  is that correct?
Answer: No, not really, even if the buffer is half full (or half empty if you are a pessimist) the next step(s) read from it.
Question: ah
Answer: All steps work completely independend from each other, there is no interaction or communication between the steps at all.  It is very important.
Question: does that mean that the step is constantly checking the buffer for rows?
Answer: Actually yes it is.  Every couple of nano-seconds it sees if there is anything in there.  The sleep time between checks does become larger if the buffer stays empty for longer periods of time.
Question: ok
Question: that is the information i was looking for.
Answer: It’s very light weight though.  It’s just checking the size of a list and if that list is not flagged as “done” by the previous step.
Question:  that would have been my next question.  When the reader reads the last row, how does it tell the next steps that there are no more records.  You say it flags the buffer as “done” ?
Answer: yeah, just a boolean: class RowSet (setDone())
Question: i love this architecture.  The more i hear about kettle, the more i like it.
Answer:  Listen, can I post this conversation on the web somewhere, I’m sure other people would love this information as well.  They were good questions.
Question: yeah, that is fine with me
Answer: I’ll clean it up, remove the names of the involved parties 😉
Question: i would appreciate that
Answer: np
Answer: The thing about the architecture is that it’s kept extremely simpel.  If works by letting go control.
Answer: That however, is also the biggest drawback.
Answer: It’s the reason why will go to a different architecture in version 3.0.

I hope you found this informative or interesting.  If so, let me know and I’ll post more on ETL internals in the future.

Until next time,
Matt

MySQL bulk load

Pretty Sick Slick

The last week I was under the weather and a year ago that would have meant that development of Pentaho Data Integration (PDI) would pretty much stop. These days I’m happy to say that this is absolutely not true anymore. In fact, hundreds of commits where done in the last week.

MySQL bulk load

To pick one example, Samatar Hassan added a job entry that allows you to configure a MySQL Bulk load job entry:

MySQL Bulk load

This job entry loads data as fast as possible into a MySQL database by using the LOAD DATA SQL command. It’s not as flexible as the Text File Input step, but it sure is fast. In certain cases, it might actually be up to ten times as fast. In short: another great job by Samatar!

I’m being told that Samatar is also writing a bulk loader for Microsoft SQL Server and that Sven is working on an Oracle SQL*Loader wrapper.

Wait, there’s more…

In addition to that, I saw the following job entries appear in the last couple of weeks: File Compare, SFTP Put, Display Messagebox Info, Wait for, Zip File and last but not least: XSLT Transformation. We also added the Formula and Abort steps. I’ll get back to you on the Formula step later as it’s an interesting option, although far from complete.

Evil Voices

Evil voices among my readership might suggest to get sick a bit more often. However, because of the highly modular nature of PDI, it is perfectly possible to develop code in parallel in a safe way. I can assure you all that it is not that I’m now forced to allow other developers to contribute. Everyone that has a great idea and wants to donate code to the PDI project, is welcome to do so at any time. The latest avelanche of code is just more proof that open source works and that by opening up as a project you gain a lot in the long run.

Today there are around 48 people that have write access to the Subversion code repository, and around 5-15 people commit code in any given month.

Release management

That is all great, but it does make the release management a bit more difficult. I think that we should probably take into account a 2 to 3 week delay in getting all the new stuff translated, documented and tested a bit more. Of-course, you can help out with that as well. Or you can just let us know how you feel about all these new developments.

Another small problem is that by adding all these new features it’s almost ridiculous to do a (2.4.1) point release now.
Until next time,

Matt