Handling 500M rows

We’ve been doing some tests with medium sized data sets lately.  We extracted around half a year of data (514M rows) from a warehouse where we’re doing a database partitioning and clustering test.
Below is an example where we copy +500M rows from one database to another one that is partitioned. (MS SQL Server to MySQL 5.1).  This is done using the following transformation.  In stead of just using one partitioned writer, we used 3 to speed up the process. (lowers latency).
Partitioned copy

Copying 500M rows is just as easy as copying a thousand, it just takes a little longer…

It would have completed the task a lot faster if we wouldn’t have been copying to a single table on DB4 at the same time. (yep, again 500M rows) This slowed down the transformation to the maximum speed of DB4.  That being said, if you still had any doubt about Pentaho Data Integration being able to copy large volumes of data, this blog post should pretty much clear those doubts from your mind.
I’m posting these examples to boost your interest for my afternoon talk at the MySQL conference in Santa Clara.  I’m going to present some query performance results on the partitioned database, showing near-linear scalability.

Until then!


5 thoughts on “Handling 500M rows”

  1. Excuse me, I don’t want a link to download kettle, I want a link to download your process (the .ktr .kjb .sql… all a need to test on my plateform!)

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.