It is projected that by 2018, 50,000 GB of data will be created every second. This translates to 175,781 TB per hour and about 1.5 million Petabytes each year. Unfortunately, all of this data is not worth the cost of the servers it is stored on without the right humans to process it, analyze it, and build the models needed to separate the nuggets of valuable insight from the torrents of noise they are buried in.
However, McKinsey predicted a shortage of 140,000 – 190,000 analytics practitioners and managers in the US alone by year 2018 and, if you’ve ever tried to hire a Data Scientist or Analyst, you have probably already felt that supply pinch already. Due to this high demand, the Harvard Business Review labeled Data Science the sexiest job of the 21st century.
As history would have it, Data Science is more than 30 years old. However, the term “data science” was initially used as a substitute to Computer Science. In 2001, S. Cleveland proposed that Data Science be made an academic discipline which links Computer Science with data. The outcome was the journal “Long-lived Digital Data Collections: Enabling Research and Education in the 21stCentury”, which defines data scientists as “the information and computer scientists, database and software engineers and programmers, disciplinary experts, curators and expert annotators, librarians, archivists, and others, who are crucial to the successful management of digital data collections”.
Although this description fits the overall purpose of a Data Scientist, it fails to elucidate exactly what Data Science is all about today. Data Science has actually boldly stepped out of the academic hallway to penetrate literally every other industry in the world and is becoming the next big thing that these industries depend on to forge ahead. Simply put, big data is the heartbeat that no modern company or industry can survive without, and Data Scientists are the only ones that can make sense of it all and translate it into meaningful business insight.
Given the ever-increasing demand for people to do the sexiest job of the century, why are Data Scientists so rare as to be likened to unicorns? Well, it is quite true that the practice of Data Science requires math, statistics, engineering, design, forecast, analysis, communication, and business management. A Data Scientist is a practitioner of Data Science, and is therefore expected to have knowledge in all, or at least most, of those fields. Much like a mythical creature, finding one person with all of these skills is pretty close to impossible. Finding 140,000 of them? Well, that might take a while.
According to Ricardo Vladimiro, a Game Analytics and Data Science Lead at Miniclip, Data Scientists create data products. In other words, they are the engineers behind the interfaces used by human beings and machines for data. This explanation can be further broken down into five specific tasks that Data Scientists typically do on a daily basis. So that you get a clear picture of these tasks, here is a quick look into what each of them entails.
Abbreviated as ETL, this process involves extraction of data from various sources, transforming the data into the required format for analysis and loading it into an end target, for instance a data warehouse, for processing. The data can be extracted from a number of sources, including public APIs, clickstream capture, web scraping and third party vendors. The heterogeneous data is then transformed such that it can be loaded in a data store on a Hadoop cluster and queried homogeneously.
Hadoop is a massively scalable storage and batch data processing system that is widely used by companies across many industry sectors. If you are preparing for a Data Science interview, there is a good chance you will be asked about your familiarity with Hadoop and the various technologies in its ecosystem. This book offers a comprehensive overview of many Hadoop technologies, and is a great starting point for beginners or those preparing for tech or Data Science interviews.
Apache Spark is another ETL tool that is in very high demand right now, and usage of which has been correlated with significantly higher data scientist salaries in O'Reilly's 2016 salary survey. Spark is a fast, in-memory data processing engine, with powerful and easy-to-use development APIs which allow for efficient data streaming in machine learning or SQL workflows that use very large datasets. To get started learning Spark, this guide written by the authors of the project is currently the best intro book on the market. While the content could be better organized, this book does a good job of covering Spark and its complementary language, Scala, with detailed code examples in Scala, Java and Python.
Exploratory data analysis (EDA) is an important step in the data science cycle. The purpose of EDA is to begin exploring the data and to form hypotheses that will guide your collection of new data or design of new experiments for further analysis. Basically, this step allows you to test your intuition about what you might find, as you begin to scratch the surface of the data in front of you.
At this stage, you will start to see patterns in the data, try different modeling techniques, design experiments to get a better understanding of the data, and come up with an approach for continued analysis.
A good place to get started learning EDA techniques in Python is Joel Grus’ seminal book Data Science from Scratch: First Principles with Python, which gives thorough explanations of the statistics and machine learning concepts used, and easy-to-follow code examples in Python. While this book is a great intro to data science, it does assume some familiarity with programming in Python. If you are just starting out with Python programming, I strongly recommend Zed Shaw’s Learn Python the Hard Way, which, despite its name, is hands down the easiest introduction to Python for beginners that I have come across.
It is a fact of Data Science that not all data is useful for analysis. In fact, “big data” is rarely, if ever, analyzed or modeled as-is. Part of a data scientist’s job is to clean and effectively sample big data into more relevant, smaller data, which is then mined for insight.
Data in the real world is incomplete, noisy and inconsistent, and the quality of any analysis strongly depends on the quality of the inputs. Here is a free, comprehensive guide to get started with data cleaning and preprocessing.
In data science, machine learning is not just a part of what Data Scientists do, but it is typically one of the top factors that differentiate Data Scientists from Data Analysts. Machine learning is a complex subject matter that requires a lot of effort to master, but is incredibly powerful for deriving real value out of big data.
Have you ever wondered how Google ranks your search results, how Amazon decides which products to recommend on your homepage, or how Match.com matches you to potential partners? Machine learning algorithms are everywhere on the web nowadays, and Data Scientists are the ones responsible for building and maintaining them. While there are plenty of web resources about the various ML algorithms, Toby Segaran’s Programming Collective Intelligence is still the best and most comprehensive resource I’ve found for thorough explanations of the most widely used ML techniques out there.
Data visualization, along with the story that their analysis findings tell, is another critical piece in Data Scientists’ work. Whether a Seaborn graph, a d3.js dynamic visualization or a well-designed infographic, pictures help convey meaning and insight instantaneously. They also form a necessary part of the story, which is the vehicle delivering the value of every Data Science project. The rule of thumb here is to present the value of your findings in a way that is not only plainly accessible to lay people, but that will enable your audience to rally around your recommendations and see to it that they are implemented such that they drive real business value.
While there are many tutorials on the web to help you get up to speed with tools like Seaborn and d3.js, learning to weave results in a cohesive and compelling story is more of an art than a science, and good learning aids are scarce. However, one book that stands out from the pack is Wired for Story by Lisa Cron. This book is aimed primarily at writers and other practitioners of the written word, but I believe that Data Scientists could easily and effectively use the story structures and techniques that Cron describes to craft more effective presentations of their work and more readily reach their stakeholders.
There you have it: five tasks that data scientists do on a daily basis, and a brief reading list for getting better at each of them. As DJ Patil, Chief Data Scientist of the USA, put it, the dominant trait among Data Scientists is their passionate curiosity to unearth the best solution to a problem, ask relevant questions, and refine them into hypotheses to be tested until a valuable piece of insight is found. The world today and tomorrow is all about data, and Data Scientists are sorely needed as trusted advisors to the executive team of any company that wants to stay relevant in this ever-changing landscape.