From Rags to Riches with “Big Data”

Whom is the One that sits at the Throne that qualifies when just “data” becomes “Big Data”?  Seems like, around every corner is a new “Big Data expert”, even when no one agrees what Big Data actually is.

As fun as it is to sit down with a glass of bourbon and contemplate with fellow data geeks on the equivocal topic, I myself just sit down and start powering through whatever amount of data is being thrown at me.  I’ve worked with data sets that vary in size, from 1GB, up to what is currently 11+ GB of data within my current project.  Regardless of size or complexity or nature of the data, I spin up my new faithful, Keboola Connection, and let it take my limited skill set and process it all to create something meaningful.

It really is a rags to riches story.  I by no means have any qualifications as a data scientist as I’m sure Statistics and Comp Sci majors and PhD’s would unanimously agree.  By which I would argue that I do the same thing, just much less glamourously.  The rags part of the story: here is me, business major, and would have flunked my Access course had it not been for my partner who did the entire class project for me.  I’ve hacked together some SQL knowledge through means of Google search, and begin to play with my client data.  The riches part of the story, is the insights and added value that I’ve created using the raw data.  The “transformation” if you will, is Keboola Connection.  Now, I won’t give you a BS spiel that it is some tool that you shove in data and it spits data dollars.  But when building transformations to get meaningful insights, it makes it a heck of a lot easier when it comes with the tools and raw power to do the job.

Above is a screenshot of the current project I’m working on, totalling about 30M rows of raw data (the rest is meaningful output).  Below is a screenshot of a run of the full chain of transformations I’m running, in 15 minutes.  Let me express again, I have no Comp Sci background.  The queries here are largely inefficient, and I probably have added about 50% additional steps from what I’ve actually needed to.  But the value for efficiency begins to dwindle when we are talking about a difference of just minutes while processing 30M rows.  Who cares?  It only needs to be updated once per day!

So when someone says “I crunch Big Data”, where does Big begin?  After a certain processing time threshold?  Or when there is a large data set?  Well KBC handles both exceptionally well.  The saying, “Work smarter not harder” occurs when you pick the right tools for the right job.

I’ll probably keep running this project in Redshift as it seems to suffice.  But maybe once the project grows 20x in size, I’ll consider upgrading to Snowflake, and go grab a glass of bourbon to contemplate Big Data with fellow Keboola geeks, while the Data Scientists are stuck waiting for their data to finish processing.