Getting ready for MySQL Santa Clara

Few!  After a long trip (10 hours flight) I’m spending all the time left on preparations for my talks on Tuesday and Wednesday.

Hard work.  You know, like going to a basball game, watching the San Francisco Giants beat the Arizona Diamondbacks with a nice homerun by Barry Bonds.

Zito, the other Barry, pitched a really nice game to help win the game with 1-0.
Aside from all that fun, from an geek viewpoint, I think the video screens in the ballpark are simply awesome.

Until next time,


Handling 500M rows

We’ve been doing some tests with medium sized data sets lately.  We extracted around half a year of data (514M rows) from a warehouse where we’re doing a database partitioning and clustering test.
Below is an example where we copy +500M rows from one database to another one that is partitioned. (MS SQL Server to MySQL 5.1).  This is done using the following transformation.  In stead of just using one partitioned writer, we used 3 to speed up the process. (lowers latency).
Partitioned copy

Copying 500M rows is just as easy as copying a thousand, it just takes a little longer…

It would have completed the task a lot faster if we wouldn’t have been copying to a single table on DB4 at the same time. (yep, again 500M rows) This slowed down the transformation to the maximum speed of DB4.  That being said, if you still had any doubt about Pentaho Data Integration being able to copy large volumes of data, this blog post should pretty much clear those doubts from your mind.
I’m posting these examples to boost your interest for my afternoon talk at the MySQL conference in Santa Clara.  I’m going to present some query performance results on the partitioned database, showing near-linear scalability.

Until then!


Mind the gap

Being part of the ICT BI community for a long time you tend to be aware of the problems that are still waiting to be solved in a lot of companies.

The article at the Register titled “Too many users fending for themselves on BIagain makes it painfully clear that we (the ICT BI guys) still have a long way to go before we finally bridge the gap between ICT and the Business User.

In the past, when I was still a well dressed BI consultant, I have many times been in situations where an ICT manager would hire me to save him from the “angry users mob“.  A lot of times it’s exactly as stated in the article: users are asking and fending for themselves and the ICT dept and at the same time the ICT department is doing it’s best to keep all this under control by denying as many requests as possible.
At this point, FWIW, I would also like to offer my view on a few the causes of this problem:

  • ICT is not doing enough in terms of requirements analyses and not executing fast enough to keep these requirements relevant.
  • After all these years, after sooooo many Kimball modelling success stories, no slowly changing dimensions and stars are being built.  There is always one hot-shot in every organisation that will claim they “need it to be 3rd normal form”.  In all (100% and at least 10) of the emergency “build it now or we’re dead” cases that I have experience with, there was no real data warehouse in place.
  • BI Software in general is too expensive to acquire and maintain.  OK, I should have said “was too expensive to maintainhere :-)  As a result companies expect a lot, but it’s just software, not a solution.  You can’t buy a warehouse.  On top of that, most of the time, the user creating reports for himself only works for the very simple cases.  Self-service BI is for the most part a myth created by the established BI vendors to sell more software.
  • Companies are not listening when you say (see the article!) that they need liasons people to communicate.  It reduces the gap, it’s good, it works, OK?  It doesn’t finish when a solution is in place.  Corporations change, people change, views change… BI requirements change and most certainly ICT changes very rapidly.  It only makes sense to do some coordination here!

I was glad to see that my views on the situation are reflected, but at the same time I was somewhat sadned by the fact that the more things change, the more they stay the same…  As the lead developer of an ETL tool, I have to acknowledge that not everything can be handled by technology.  This is probably the main cause of frustration for lots of ICT people out there in the field right now 🙂
Until next time!