10 common mistakes in Data Science projects and how to avoid them
2 weeks ago
While Data Science is an established and highly esteemed profession with strong foundations in science, it is important to remember that it is still a craft and, as such, it is susceptible to errors coming from processes that may be not suited to the problem at hand, the amount of data made available by the customer or industry standards. Before diving straight into crafting a new solution, ensure that you do not make one of these 10 common errors.
1. Lack of clear objectives
You are ready to make your analysis, but feel that something is missing? It might be that you were just planning to tinker with the data, look for correlations, see what possible results you can get with the data and suggest some ideas to the customer. The problem is you are starting the wrong way around. If you do not know what to do, you should stop and spend more time with your customer to learn what their business need is when it comes to Data Science. This is often the hardest step, but it pays off tremendously in the long run.
2. Not solving the biggest pain point of the client
Even if you know more or less what the customer wants, it is still possible that, in the process of going for the lowest hanging fruit, you miss a chance create the most significant improvement. In the presales phase, this may be the difference between winning a project and losing an opportunity due to the fact that the customer may decide that the costs of the solution will be too big relative to the impact the solution will make. The chances of that happening decrease drastically if you make sure that solving the issue is, for example, the difference between making their original idea usable and dropping the idea altogether.
3. Ignoring data privacy, compliance and biases
Diving straight into the solution may have another unpleasant result – that you have built a model, but cannot sell it to your customer because it does not comply with the regulations. There may be many reasons for that – that the data that you used is not publicly available, the dataset has intrinsic biases or the model weights are not available for commercial use. It may also be more complicated than that - the model may be available for commercial use, but each modification has to be published under the same licence, making it impossible to use in your customer’s system.
4. Using too complex models
We are living in the era of deep and large models and it is easy to forget that machine learning comprises more than neural networks and LLM APIs. But when handcrafting a tailored solution, these are often overkill - knowing your data and the goal may lead you to a well-performing, simple, interpretable and fast model based on classical techniques. Interestingly, this is one of the results of the No Free Lunch Theorem - all algorithms perform as well as any other on average and only the domain knowledge can help you to choose the right solution.
5. Insufficient feature engineering
In recent years, we have observed a shift in machine learning from model-centric to data-centric. It is related to the realisation that there are many models that can work in every scenario and instead of tinkering with the model, it is better to fix it and change the way that the data is provided. This is also useful for deep learning models, even though one of their main tasks is to provide the right features for the task. For example, when detecting a particular kind of movement in videos, it might be better to first extract the movement indicators using a classical algorithm and subsequently provide a model with images instead of videos, which can in turn reduce the number of parameters in the model.
6. Ignoring model explainability
Even if explainability of the model is not that important in terms of the end goal of the solution (compared to, for example, predictions being 100% correct), it is still important to know how to interpret the results in order to be able to improve the model further. A novice error is to rely solely on numerical values when working on a solution and, while they are important for model comparison, they do not give insights into why the model makes particular kinds of errors. For example, when working on a classification problem in computer vision, knowing that the model makes most errors in particular categories is only the first step in being able to improve the model. The next steps would involve looking at particular examples and comparing the error samples with correct ones, with the possible use of techniques like GradCAM or Saliency Maps. Without insights of this kind, a Data Scientist is bound to enter an infinite retraining loop without tangible results.
7. Overlooking model deployment challenges
It is always necessary to restrict the solution to the final requirements of the system that will be used in production, but this is especially evident in the era of Large Language Models. These models not only come in various sizes, each of which yield different results, but are available in different ways, for example, through an API or available for download on huggingface. It would be a great pity to build a shiny, complex RAG application with the newest GPT-4 version to learn, at the end of the project, that actually the LLM needs to be local and cannot exceed 8B parameters.
8. Lack of model monitoring
Like software projects, Data Science projects have their maintenance phase and this is called ‘model monitoring’. It is needed for a very important reason - future data may be different from what you have used in the exploration and training phases, and this is something that you simply cannot account for when developing a solution. Let us consider a simple example of a spam detector. You can imagine that something considered ham at one point becomes spam later (concept drift), maybe because of the number of similar messages that you receive subsequently. Or there are completely new types of spam invented by people who are trying hard to get to your inbox (data drift). If you finish your project without a clear idea of model maintenance, there is a high chance that the client will not use your solution for long.
9. Poor communication of results
At the beginning of the project, it is crucial to identify correctly the client's needs in order to make a start and, at the end of the project, it is equally crucial to communicate the results in an actionable way. After months of development, it is easy to think about the project in terms of delivering an optimised metric, such as precision or recall, but a non-technical person may need an explanation that they can understand and relate to.
10. Neglecting cross-functional collaboration
As Data Scientists, we often do not need to have (and realistically cannot have) extensive knowledge in all possible domains, but we have to know what questions to ask and of whom. After establishing the project goal with the stakeholders, it is vital to work closely with them in order to gather the domain knowledge needed to complete the task and to make the solution useful.
As you can see, there are many kinds of errors that a Data Scientist has to be careful to avoid. Interestingly enough, many of them are not strictly technical, but require a broader understanding of managing a commercial project and an ability to get information about requirements from stakeholders. Avoiding any such mistakes is just as important as your technical development when you want to take your Data Science skills to the next level.