Loading BigQuery data into Neo4j

Hi Kettle & Neo4j fans,

One of my colleagues asked how to read from Google BigQuery using Kettle and then do load the data into Neo4j. The BigQuery connection type is listed so it should be fairly straightforward but it’s worth mentioning how to do it here…

First thing you need to do is download the JDBC driver since it’s not included in your Kettle download. Go to the Simba JDBC driver downlaod website and get it under the JDBC 4.2-compatible label.

From zip files you copy all the libraries (.jar files) to your Kettle distribution under the lib/ folder… except the commons-* and jackson-* files as these are already present.

Then start up Spoon and create a new Database Connection. For the hostname and port you specify https://www.googleapis.com/bigquery/v2 and 443 respectively. The project ID is simply the ID of your GCP project. You can see it in the URL if you go to your BQ project.

Authentication

Authentication is something which is set up separately and can be done in various way. In my case I’m pointing to a JSON file with a key:

GOOGLE_APPLICATION_CREDENTIALS=/home/matt/parking/google-key.json

So for the connection I’m setting OAuthType option to 3 as detailed in the driver documentation. Please refer to the documentation and authentication procedures at GCP for more details.

Timeout

Straight away I hit the problem that the default timeout is 10 seconds and you often need a bit more if you’re hitting larger data sets.

So in my case I also had to set a Timeout option. I set it to 60 seconds. So the options tab in my Database Connection looks like this:

Querying BigQuery

Now we’re finally ready to use our connection to get data from BigQuery. Since BQ has a particular way of specifying data sets and tables make sure to test your queries in the BQ console and then simply copy it into a Table Input step:

Loading into Neo4j

This is the easy part. We can simply use the Neo4j steps to do this. In our case if we simply want to create customer nodes with their properties we can use a Neo4j Output step. Use a static Customers label and the helper button to grab the properties from the input.

That’s all there is to it. Enjoy BigQuery and Neo4j!

Cheers,
Matt

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.