The most impactful errors made in data science projects have nothing to do with computational errors, as you might think. In fact, experimental design errors are typically what can greatly skew your conclusions. When building a data science project, designing the experiment is always the first step. Experiments are generally designed to identify a causal relationship between variables. Without a good experiment design, no causality can be inferred from your results, as they may simply indicate correlation between variables. And as any good scientist knows, correlation is not equal to causation.
What are the Components of a Good Experiment?
1. A hypothesis
The hypothesis of your experiment is the statement you want to test. A null hypothesis is a default statement that you make that assumes there is no relationship between the compared variables of your test.
“A hypothesis may be simply defined as a guess. A scientific hypothesis is an educated guess.”- Isaac Asimov, Book of Science and Nature Quotations, 1988
2. The applied treatment
The treatment that your experiment uses can be applied differently to two distinct sample groups:
- The treatment group, or the group to which the treatment is applied, and
- The control group, or the group to which no treatment is applied. However, the control group itself is statistically and substantially identical to the treatment group.
For example, you may want to identify the effect of a particular teacher intervention on a group of students’ progress during a career development course. In this case, the treatment group is represented by the students who received the teacher’s intervention, and the control group comprises the same number of students who did not receive the teacher’s intervention.
3. Random assignment
In the given example, a well-designed experiment would randomly assign students to the treatment and control groups.
The Ideal World Versus the Real World
The following list includes characteristics of a perfect data science experiment:
- You define a clear hypothesis a priori—in other words, based on theoretic reason, and not intuition or experience.
- You attentively design your experiment.
- You randomly sample from the population of interest to make sure that your final conclusion can be generalized.
- The data that you already have directly addresses your hypothesis, and there is no missing data.
- The data set creation, merging, and data pulls processes go smoothly.
- The analyses that you conduct are robust.
- The conclusions are clear and parsimonious (that is, simple to explain).
- You have gained knowledge through the experiment.
- Your results are clearly communicated through a report or a data product.
The main goal of any experiment is to find causality. Causality is hard to find in the real world, which is exactly why you need to be attentive with your experiment’s design. However, there is no need to fall into the trap of perfectionism, because there is no such thing as a perfect data science experiment.
In the real world, many issues may keep your data science experiment from perfection. For instance, ideal experiment design and randomization are not always possible. Often you end up taking whatever results you can. You also may find that the sample population you chose is biased from the beginning, which does not allow you to generalize your conclusions. For example, in clinical trials, participants are more likely to follow the course of the treatment than they would under normal circumstances. Getting the exact data set features you might be looking for is also complicated in the real world. In these situations, you may need to make assumptions from other features you can reasonably access. For instance, in nutrition research where you may want to measure caloric intake, you typically must rely on self-reported data, which is commonly known to be imperfect.
Let’s take a look at what’s happening within an experiment done by a mid-sized e-commerce company. This case study is a good example of some experimental design mistakes that can happen in real life.
The company’s centralized data science team is working on a project in collaboration with a business partner from the marketing team. The business partner manages the video ads and wants to know how much money the company is gaining through their video marketing.
Together with a person from the data science team, they design an experiment with the following plan:
- Cut off all online advertisement spending for certain groups based on their location, including video ads, for one month. Location was determined using US Census tracts.
- After one month, restart spending on video ads for YouTube, specifically for a group of users based on their US census tract (tracts were determined using IP addresses).
- No other online ads will be run during this experiment.
- Include a control group consisting of users from substantially similar census tracts (by socioeconomic measures). The control group will not be shown any ads.
Can you spot what is wrong with this experiment?
Let’s dig in with a little background. First, census tracts, which are known to be inaccurate, are made up of geographic zones called metropolitan statistical areas, or MSAs. In the case study, some treatment group MSAs were bordering the control group MSAs. Because of the likelihood of MSA border inaccuracies, some of the control group subjects may have been exposed to ads. Moreover, determining location using IP address is a potentially misleading method to use, as IP addresses can be manipulated by users to mask their geographic location. While the company may think they’re showing ads to people in their selected tracts, they may be testing someone on the internet across the country! Finally, assuming that all online ads for control group MSAs were suspended may also be deceiving. If only one manager supervising transactional channels is not notified of the ongoing experiment, the control group may be shown ads.
So what happened with this experiment? After about two months, the data scientists observed there was no apparent effect on revenue due to showing ads to their treatment group and naturally, they started questioning the experimental design and the initial assumptions. While their experiment assumed that all other online ad spending was turned off for MSAs involved in the experiment, in real life other online ads from the brand may have been running simultaneously, to which the control group could have been exposed. If this occurred, these unwanted ads would produce the effect of a confounder, which acts as a variable that can affect the true relationship between the variables of interest.
In this article, we’ve discussed what an ideal data science experiment looks like and we’ve given examples showing how many things can actually go wrong in an experiment in the real world. Hopefully you can use what you’ve learned to build an effective experiment of your own, including understand the mistakes you’ll want to avoid.