Non-glamourous part of data science

Data scientist seems like a glamorous job, perhaps a 21st century magician who can connect the missing link and predict the future. It is all true, until you have the data in manageable format .

I remember during statistics class back in college, nothing was dealt in detail about data cleaning, except for introduction to regular expressions. Then, I took advanced machine learning course offered by CS department.

You will spend about 80% of your time as a data scientist cleaning the data.

We were always given nice csv files, and needed to parse some strings to extract numerics and such, which can be done in
no longer than 10 lines of code.

That, I think is the biggest gap in between statistician and data scientist. In a traditional non-big data environment, data has been organized in to SQL table or some sort, that you can begin with fun part from the beginning.

One experience that I have is analyzing web log files, that is delimited with pipe symbol. Some of the fields contained pipe as well, and standard parsing will cause index error of columns. Those situations require some thinking, and complex coding (either regex or post-processing) to clean the data.

I think if you are Yann Lecun or Geoff Hinton (both of them are one of the biggest names in data science), there will be “data assistants” who clean the data on their order. That is not the case for the entry level data scientists. You will need to prove you can wrangle the data and delivering some meaningful insights.

Bought the book Mastering Regular Expressions recently, and studying awk in depth. I think regex and awk can be two main tools to allow Unix / Linux level processing to handle some of the tasks before processing anything in database or in-memory.