How To Scope a Data Science Project

Scoping data science projects involves various approaches, depending on the organization’s goals, needs, and priorities.
By Claudia Virlanuta • Updated on Mar 2, 2023
blog image

Here is a recap of the key points you need to scope a data science project:

Data science is about solving real-world problems using data that impacts society and organizations. Many talented, passionate, and intelligent people with data science skills are called upon to help tackle these problems. But not all of them have the skills needed to formulate a well-scoped data science project.

Scoping data science projects may involve various approaches, depending on the organization’s goals, needs, and priorities. By applying some core concepts and best practices, you can achieve the basic criteria for scoping out high-impact data science projects.

Let’s start with an initial screening process. 

 

 

What is the Initial Screening Process?

Every project starts with knowing what problem you are trying to solve and how it adds value to the business or company. The following criteria can help identify whether a project is good:

 

  • Impactful - The problem you are solving is real, significant, and has a valuable impact.
  • Solvable - The organization has access to the correct data.
  • Actionable - The organization prioritizes the problem and is willing to take action and commit resources to carry out the project.

Once the project meets the initial screening criteria, you can begin the scoping process. 

 

What Problem is Your Project Addressing?

The first step in the project scoping process involves understanding the problem, its cost and impact, the gaps in how it is being handled today, and whether it is a priority. Furthermore, the exercise of determining a problem opens up opportunities for growing the business or addressing issues where data science can be applied.

 

When scoping a problem, you can first ask organizations to describe the problem. Sometimes, they cannot define exactly what the problem is or what they want to achieve, so you will have to help them navigate and find out what they want. For example, an organization or community group venturing into the business of selling new products made from recycled materials wants to predict an increase in sales in the next three months to determine whether to produce more of the same product. They are experiencing slow growth in the business and had no significant traction over the last quarter. The problem caused them to cut costs in manpower and in some areas of their operations. At this point, you can investigate the root cause of the problem by probing into the characteristics, behavior, and dimensions of the target market where the product is being sold. What type of customers are they selling to (age, status, etc.)? What do they think about the product? Where are they located?

 

Afterward, you can follow up by asking for more information on why the problem is a priority right now and how they have been handling it. What is the value of selling the new products over the old products that are selling well? Why is it important? How are they selling it? For instance, you find out that the new products made from recycled materials have helped bring more jobs to underprivileged communities. If the business becomes successful, then it will help alleviate poverty while helping reduce waste at the same time. With attention to how the organization handles the problem and what is important to them, you can identify the gaps and get insights on the possible actions on how to achieve the goals more effectively.

Besides understanding the problem, it is also important to identify the groups of people or stakeholders within or outside of the organization who need to be involved in the project activities.

 

 

Who Are the Stakeholders?

The stakeholders are the people who understand the problem, know what data is available, and then use the system’s output to take action. These people are the ones who are essentially affected by the outcomes of the project.

Normally, the stakeholders involved in data science projects are those from inside the organization, such as managers, process owners, IT professionals, etc. For example, an executive in the organization is the project sponsor who has the authority to assign resources, impose rules, or make decisions to implement the project. A customer, contractor, supplier, government official, or community group may be affected or could potentially intervene if they found out that the outcome of the project impacts them. It is important to ensure that all stakeholders are engaged early and aligned with the objectives and activities as they can make or break the success of the project.

diagram with example of stakeholdersExample of stakeholders in a data science project

Image Source: https://www.analyticsvidhya.com/blog/2019/08/data-science-leader-guide-managing-stakeholders/

After the problem and stakeholders are identified, you can then define goals, set actions, gather data, and conduct analysis.

 

What You Need to Define Clear Goals

The project goals must be clear, specific, measurable, and relevant in order to achieve the most desirable outcome. Some projects may start with ambiguous and abstract goals, and then get a little more refined until they become definite, realistic, and achievable as the project progresses. Going back to the example earlier, an organization wants to predict an increase in sales over a certain period to determine whether to produce more of the same product. However, it is not clear whether the product is doing well in the current market. Are the current sales and marketing channels effective? Is the cost to produce more products within budget? Is it worth investing more in sales and marketing campaigns? Do they need to switch to a different target market or location? Depending on the conditions or circumstances of the organization, you can explore various approaches to defining goals.

One way of defining goals is by increasing, maximizing, or reducing the outcome or metric you want to change relative to the problem. For example, increasing efficiency by [x]% or reducing cost per unit by [y]% over a period of time. Some implications, conflicts, or other types of issues may arise during the scoping process. At this point, it is important to also define explicitly the constraints, limitations, or risks. Then, you can prioritize them according to significance or feasibility, and possibly make some necessary trade-offs.

After the goals are clear and defined, the next step is to list the set of actions to take in order to achieve the goals.

 

 

Article continues below

How to Set Actions for the Project

The actions may vary depending on the organization’s tools, processes, resources, and limitations. For example, an organization or business that is inclined to a traditional or conservative approach may not be open to investing aggressively in marketing. They believe that cutting costs and focusing on maintaining current sales is more beneficial to keep the business going.

Ideally, starting with existing actions and building up from there is easier to implement than an entirely new set of actions that the organization is not familiar with. However, with the “slow growth problem in business” example, it makes more sense to shift to a more aggressive approach. Implementing a data science project for digital marketing instead of traditional marketing (non-digital media) may be more effective and less costly if proven that the target market is also accessible on digital platforms. As digital marketing is a broad endeavor in itself, breaking down the actions into smaller work items or using a phased approach can help make the project more attainable within a short amount of time. 

In addition, you must also take into account who will perform the actions or who will be impacted by the actions. Another point to consider is the channel or medium through which the action can be taken that may have capacity limitations. Make sure that the actions are within the organization’s control so as not to add more complications to the problem or result in undesirable outcomes. 

Once you understand the problem and determine the goals and actions, the next step is to figure out what data sources to use and which of them the organization has access to.

 

How to Gather Precise Data 

First, make a list of data sources that are available and accessible within the organization. Some organizations may not have a comprehensive list of data sources or they may not be readily available. Thus, you must assess and investigate further how to obtain more data by reaching out to specific individuals, groups, or departments within the organization.

Once you have the list of data sources and have access to them, collect detailed information that can be useful for the analysis. Below are some of the characteristics of data that you might need:

 

  • Granularity or Detail - The scale or level of detail in a set of collected data. For example, collecting data about customer demographics, the platform they use, or where they are located.
  • Frequency - The rate at which changes in the data occur or how often the data is updated. For example, is the data collected in real-time, daily, weekly, monthly, or annually?
  • History - The historical record of data or how far past events were captured. For example, a record of sales data in the past year, quarter, or month.
  • Identifiers - The unique identifiers of data that allow you to link them to other data sources, such as social security numbers, ID numbers, addresses, etc.
  • Owner - The organization, department, group, or individuals who control access to the data. For example, software engineers, sales managers, or business owners who have access and the authority to restrict or allow access to the data systems.
  • Storage - The space or location where data is stored, such as databases, spreadsheet files, hard copy files, or other data stores.
  • Ethics - The possible ethical issues associated with accessing data sources, which may require consent from the person whose data is being used. Some organizations enforce security protocols that are required for accessing and using the data.

Finally, once you have access to all the necessary data, you must determine how the data matches the actions to perform in order to achieve the goals. For example, if the actions are influenced by activities that change once a day, then you need your data to be updated every day. By finding out the characteristics of the data required to achieve the goals, you can decide which data to use and be ready to move forward with the analysis.

 

 

How to Conduct Analysis on the Project

The final step of the scoping process is establishing the analyses that will be conducted according to the actions, data, and goals. While a data science project is prominently associated with analysis, you must keep in mind that the goal is not the analysis itself. The purpose of the analysis is to be able to come up with a conclusion, explain why the problem happened, and recommend a course of action to solve it.

data science workflow diagram Image Source: https://images.squarespace-cdn.com/content/v1/55fdfa38e4b07a55be8680a4/1615903072761-9H9XREOMPJFBGI8QAW1U/Data+Science+Workflow+Image.jpg

 

There are several methods and tools to use for conducting analysis, which can be broken down into five types:

 

  • Description - This type of analysis focuses on understanding events and behaviors that have happened in the past, typically known as a descriptive analysis. For example, exploring data around past sales and finding out which types of products have sold well and during which period or season of the year.
  • Detection - This type of analysis focuses more on ongoing events rather than historical data, such as detecting fraudulent payment transactions or other types of anomalies.
  • Prediction - This type of analysis focuses on future behaviors or events, also known as predictive analysis. For example, predicting which types of products would sell in a particular season of the year, or predicting the likelihood of getting more sales from a certain customer age group.
  • Optimization - This type of analysis focuses on putting together the outputs of other types of analysis and transforming them in a way that helps make decisions more effective. For example, identifying the age group of top customers who buy the greatest number of products, and then combining them with the location that they are buying from.
  • Behavior Change - This type of analysis focuses on the cause of change in behaviors of individuals, groups, or organizations to draw a conclusion. For example, customers from a younger age group may be inclined to buy products made from recycled materials due to the increasing awareness of the climate change impact on future generations.  

 

What Ethical Considerations Affect a Data Project?

Data privacy, confidentiality, and security issues are only a few of the common risks when dealing with data. For this reason, it is essential to ensure that proactive measures are in place to prevent or mitigate the impact of such issues that affect people’s lives. Rather than an afterthought, ethical considerations must be embedded in every phase of a project to avoid causing harm to society, communities, or the public at large.

 

Privacy, Confidentiality, and Security

Data privacy is a growing concern and most common in the applications and effects of data science. Notably, the industries with high-stakes decisions involved are healthcare, government, and education. It is the organization’s obligation to protect and secure confidential information from unauthorized use, disclosure, theft, or any kind of unlawful act. Apart from legal requirements on how data can be used and the systems in place on how to protect it, another important consideration is how an individual might feel about their personal data being used, and how publicly the data is made available.

 

Transparency

Another possible ethical issue to consider is the transparency of whose data is being used, whether they know that their data is being used, and how it affects them. When building and deploying data science systems, make sure to keep in mind the perspectives of the different stakeholders who will be affected by the decisions that take place as the systems are implemented.

 

Discrimination, Equity, and Fairness

Similarly, it is equally important to understand the possible disparities that may result from your project and how they might impact people or specific groups. In order to mitigate the impact, organizations can explore different tools and approaches to reduce inequities and biases in the system, such as using Aequitas and Fairness Tree to audit machine learning models for discrimination and unintended bias. When implementing data science projects, it is essential to incorporate machine learning models, in which the data, processes, and outcomes are evaluated for fairness based on established ethical principles.

 

Accountability

Now that you know how the different nuances of data science significantly impact society, organizations must establish who is accountable and responsible in all aspects of the project. These responsibilities may include keeping the data safe, monitoring unforeseen consequences, knowing what system behaviors to disclose to the public, handling inaccurate data, and making decisions on how to respond to the issues. To mitigate the impact of potential harms or ethical issues, the person accountable within the organization must have contingency plans in place, provide assistive technology interventions, and make necessary improvements to the models or systems.

 

Social License

In addition to the ethical considerations for transparency and accountability, organizations must also think about how people might respond to the project if it reaches widespread coverage. For example, the organization uses the information that the products they sell are recycled and made by underprivileged people as part of their marketing strategy. Would people react positively or negatively? Are there any groups or certain individuals among the overall population who might object? What possible concerns might they have?

For the most part, I have covered the critical ethical considerations to apply when implementing a data science project. Other additional considerations may arise in the future, so it is important to keep in mind that proper evaluation, thorough assessment, and completed staff work must be done in order to carry out a successful data science project.

The scoping process is iterative and can be refined during or after executing the project. Hence, it is beneficial to keep an adaptable and flexible mindset when implementing a project, while keeping an eye on opportunities to improve the process and ethical considerations to incorporate along the way. In other words, when scoping a project, set realistic targets to prevent long-term costs and failures, but also be ready to make quick and controllable adjustments in between. However, it is crucial to start with a well-defined problem as well as achievable and actionable goals by utilizing just enough information and resources available to mitigate unnecessary risks.

 

 

Here is a recap of the key points you need to scope a data science project:

  • Problem - What is the problem? What is its impact? What is its priority?
  • Stakeholders - Who are the people involved?
  • Goals - What are the goals? How will you know if the project is successful?
  • Actions - What actions do you need to take to achieve the goals?
  • Data - What data do you need? What information do you have access to?
  • Analysis - What analysis tools and methods should you use? How will validate the output?
  • Ethical Considerations - What are the possible ethical issues? How will you address them?

 

 

Claudia Virlanuta

CEO | Data Scientist

Claudia Virlanuta

Claudia is a data scientist, consultant and trainer. She is the CEO of Edlitera, a data science and machine learning training and consulting company helping teams and businesses futureproof themselves and turn their data into profits.

Before Edlitera, Claudia taught Computer Science at Harvard, and worked in biotech (Qiagen), marketing tech (ZoomInfo), and ecommerce (Wayfair). Claudia earned her degree in Economics from Yale, with a focus on Statistics and Computer Science.