If you are exploring how to use machine learning and AI-based solutions in your business, you are in good company. According to Gartner, in spite of the global impact of the Covid-19 pandemic, 47% of AI investments were unchanged since the start of the pandemic, while 30% of organizations polled actually planned to increase their investments in AI.
When considering an AI solution to a business problem, you usually have two options to explore. The first is buying an off-the-shelf solution. If you can find an adequate ready-made option, this is a fine solution.
The second option is building a custom solution. This option is oftentimes preferred by companies who are serious about AI, thanks to the opportunity for precise tailor-fitting to the problem at hand. This is also thanks to the fact that custom solutions are typically more cost-effective than managed ones in the long run, even while requiring a larger investment upfront.
There is only a teeny-tiny little problem with enterprise AI projects. About 90% of them never make it to production. You read that right - 9 times out of 10, months of hard work go to waste. The reason is usually the fact that, in the excitement of getting something off the ground, steps are skipped and corners are cut.
In this article, we will discuss the eight steps that usually go into building and deploying a custom AI-based solution to a business problem. Follow all of these steps in order, and you will maximize your chances of building an AI solution that gets deployed, is used productively, and helps you maximize your return on investment.
The first and most important thing to do when building an AI solution is to describe exactly what problem you are trying to solve.
What does your ideal solution look like? In other words, imagine what the world looks like after a solution for this problem has been deployed. Will your team have to do less or no manual work to complete a task? Will you get new data and insight from a particular process faster than before?
Next, define how your solution will be used:
Once you know how the solution will be used, it is time to figure out how to measure its performance on the task at hand. In addition to deciding which performance metric makes the most sense for the problem being addressed, you will also need to figure out what performance threshold you need to pass in order to have a useful solution.
The next step is understanding what kind of data you need, and where you can find it. From a business perspective, you will need to understand what data could be useful for building your solution, where to find this data, and how to negotiate access to it.
From a technology perspective, you will want to create a workspace for your model training needs. Ideally, you want to have enough space to hold the data in one place. If this is not possible, you can build the needed ETL data pipelines to get data from your data lake as needed, and convert it to a useful format. You might also want to make sure that you have installed all the required software and packages to your machine learning workspace.
Once you have lined up your workspace and your data sources, it is time to explore the data and answer the following questions:
If possible, it is a good idea to build some simple visualizations of your data (or a sample of it). For instance, binned histograms will help you see how your data is distributed. For supervised learning, scatter plots of pairs of variables will help you study correlations between the variables. This, in turn, will help you identify the variables that could be most predictive of your target.
As any data professional will tell you, data cleaning is the least sexy and most time-consuming part of the machine learning process.
It is usually a good idea to not modify the original data. Instead, make a copy of the data and modify the copy. Also, instead of writing ad-hoc code each time, try to write functions for data transformations, as they are more reusable.
Data cleaning can be a bit of an art, and also highly specific to each data set. However, at a minimum, you should remove outliers (or otherwise deal with them), fill in or remove missing values, and remove duplicates.
Next comes the feature selection step. At this point, you should only keep those features in your data set which are relevant to your target (i.e., the thing that you want to predict). You can go ahead and drop all other features from the data - it will save space and speed up the model training process.
You will probably need to do some feature engineering and possibly feature scaling at this point. This might include converting continuous features to discrete, or creating derived features by transforming existing ones (e.g., getting ‘age’ from ‘date-of-birth’). In terms of feature scaling, standardization means transforming your data to have a mean of 0 and a standard deviation of 1. Normalization (a.k.a, min-max normalization) will scale the data to have values between 0 and 1. However, normalization does not do anything to handle outliers. Standardization does help with outliers, though, which is why it is usually preferred for scaling.
If you have still have lots of data at this point it might be a good idea to sample some smaller training sets. This will allow you to test many different models faster. That said, you may still want to train neural networks and random forest algorithms on as much data as possible - a smaller training set may put these algorithms at a disadvantage.
Now, you need to measure and compare the performance of the various trained models. Which are the most significant variables for each algorithm? Do they make sense? Analyze the errors made. Is there any other data that you could add to the training set that would help to avoid these errors? Select the top couple of models that look promising. The trick here is to pick those models that make different kinds of errors.
Use cross-validation to fine-tune hyper parameters. Hyper parameters are model-level parameters, such as the number of trees in a random forest.
Also, try treating your data transformations as hyper parameters that you can switch on or off, or whose values you can change. For instance, one hyper parameter can be the way you handle missing data. That way, you can retrain your model to see how it performs if you fill in missing data with, for example, all zeroes versus averages versus just dropping the rows with missing values. When you are satisfied with the model’s performance on the training set, it is time to measure its performance on a test set, and estimate the generalization error.
Congratulations! You now have a working model that more or less solves the problem you set out to solve. You might think you are done at this point, but not quite.
When it comes to deploying, monitoring and maintaining a machine learning solution in production, automation is your friend. Ideally, in addition to automating the solution itself, you should also automate the machine learning model deployment process, as well a system to monitor the performance of your model in production. Over time, models can experience degradation in performance due to new data being introduced in the training process. Set up a production monitoring system that will alert you when the performance goes below a certain threshold. At that point, you can retrain the model to bring its performance back to optimal levels.
It is estimated that as much as 90% of machine learning models never make it to production. This is usually due to one or more of the steps above being skipped, or not followed properly. By following the steps outlined above, you will significantly improve the odds that your machine learning-based solution will be a useful one, as well as one that is used by its intended audience frequently and productively. If you want to learn more about the ins and outs of managing AI projects, you may find our AI for Leaders course interesting.