Table of Contents
- What is Data Merging and Cleaning?
- What is Data Exploration and Hypothesis Formation?
- What is Hypothesis Testing?
- How to Interpret Results for Your Data Science Project
Here is a sample plan for crafting a Data Science portfolio project:
What is Data Merging and Cleaning?
If your data is from multiple sources or in different formats, now is the time to merge it all together and resolve any issues like missing data, incoherent, or inconsistent entries and outliers.
What is Data Exploration and Hypothesis Formation?
When starting a new project, the natural first step should always be an exploration of the data in order to gain an intuitive understanding about each variable and about the relationships between them. In data science, this step is known as exploratory data analysis, or EDA for short.
EDA includes, but is not limited to, data summarization, aggregation, visualization, and correlation. This is the stage when you start asking questions of the data and forming hypotheses of what the answers might be. At the end of this stage, you pick 1-2 questions or hypotheses and proceed to delve deeper into the advanced analytics needed to obtain an answer.
What is Hypothesis Testing?
At this step in the project, you have some intuition about the data, and you have a clear hypothesis or question for further research. Now, it’s time to do a deep dive and look for evidence that answers our question or supports your hypothesis. At this step, you do statistical analysis, feature and model selection and engineering, model training and predictions, cross-validation analysis, and model performance analysis.
Did your chosen model methodology do well on the testing data? Can you do better? This step is typically not a one-off process. You will often want to get back to the beginning and try different approaches until you come up with a satisfactory and defensible answer to your research question or evidence for your hypothesis (or against it).
If you need a refresher on statistics, this book is a great source:
How to Interpret Results for Your Data Science Project
Finally, once you are satisfied that your science is sound, and your results are tested and validated, it’s time to communicate your results and make business recommendations based on them. Whether you are trying to convince your CMO that one marketing strategy is far superior to another, or your CEO that the new recommender system you have built will deliver a much better user experience and increased profits compared to the old one, now is your time to shine!
While all the hard work that you have put in so far is very important to the success of your project, this step can make it or break it. Why? Because all your work so far is very likely either not visible or not accessible to the decision makers in your company, and people tend to mistrust what they don’t understand. Present your results to your stakeholders in the wrong way, and your project has a good chance of not gaining any traction or never seeing the light of day for its intended purpose.
Telling a compelling story with data and getting buy-in for your recommendations is a very broad and important topic, but here are a few bullet points to steer you in the right direction:
Plain English, Please!
When communicating your results to business stakeholders and making recommendations, use plain, jargon-free language, and steer clear of technical details as much as possible.
Start with the bottom line of your results or recommendations, and with how they would best be put into practice for maximum impact.
Evidence, Evidence, Evidence!
Link the recommendations you make to specific results from your analysis. Give some background on your thinking and the logic that brought your to that particular conclusion.
The more evidence you can provide, the better.
If it’s Too Good to be True, it Probably Isn’t
No data science project is every perfect. As data scientists, you make assumptions and inferences, and you define thresholds for success for your projects based on them. There is always a caveat to your findings and limitations to your analyses.
When communicating results to stakeholders, it is important to be upfront about the assumptions you’ve made in your work, and to be specific about the scope of your recommendations.
Check All Angles Before You Move On to the Next Project
While you have done your best to test and document the solution to your chosen research question or hypothesis, a data story is never finished.
If you look back to your EDA step, there are probably several other interesting questions and hypotheses that you uncovered in the data, and that you were not able to pursue in the current project. What are they? What other directions or angles are left to be pursued, that would add to the company’s understanding of the subject being researched?
What's Next? Share with the World!
Congratulations! You have completed your first data science portfolio project!
After many hours of work, and hopefully many rounds of feedback from friends and mentors, your data story is informative and compelling. The next step at this point is to share your work and insights with the world. At a minimum, post your work and results on GitHub, or Kaggle (if using data from there). Even better, write down your project's story, from motivation to results, and post it on your personal blog, or on LinkedIn. For an even wider (and more specialized) audience, consider submitting your story to sites like KDnuggets or Data Science Central.
Answering relevant questions and engaging with the community on Stack Overflow or Quora will also help establish you as an expert in the field, while also increasing your visibility in all these professional channels that recruiters and hiring managers often go to when looking to add data scientists to their teams.
As with anything, when starting to build a data science portfolio for the first time, the first project is always the hardest, but you got this: if anyone can do it, it’s you!