ML Ops Fundamentals
Overview
Week 1
current challenges data scientists face when operationalizing their models and making them available in production.
- Keeping track of the many models we have trained is
difficult.
- They want to keep track of the different versions of the
code,
- the values they chose for the different hyperparameters,
- and the metrics they’re evaluating.
- keeping track of which ideas had been tried, which ones worked, and which ones did not.
- pinpoint the best model, which is possibly trained two weeks
previously, reproduce it, and run it on full production data.
- Reproducibility is a major concern because there are scientists
who want to be able to re-run the best model with a more thorough parameter sweep.
- Putting a model in production is difficult unless it
can be reproduced, because many companies have that as a policy or requirement.
- for a production application, the model needs to be updated on a regular
basis as new data comes in, so traceability becomes paramount.
What can be done to mitigate these?
- consider the whole system in terms of time, resources, and quality.
- How do you reduce the time between analyzing the problem, creating the models, and deploying
the solution, while maintaining the quality of the output?
- In software engineering, this approach called DevOps. We can borrow the term in
machine learning and call it MLOps.
Continuous integration of source code, unit testing, integration testing, and continuous delivery of the software to production, are important processes in Machine Learning Operations too. But there is another important aspect to MLOps. That’s right, data. Unlike conventional software that can be relied on to do the same thing every time, an ML model can go ‘off.’ By this we mean that its predictive power wanes as data profile changes, which they inevitably do. So we can build on continuous integration and continuous delivery, and introduce a new term, Continuous Training, or CT. Continuous training is the process of monitoring, measuring, retraining, and serving the models. MLOps differs from DevOps in important ways too. Continuous integration is no longer only about testing and validating code and components, but also about testing and validating data, data schemas, and models. It is no longer about a single software package or service, but a system - the ML training pipeline - that should automatically deploy another service: the model prediction service. Uniquely, ML is also concerned with automatically monitoring, retraining, and serving the models. Another concept that transfers well from software development to machine learning is ’technical debt.’ Software developers are familiar with time, resources, and quality trade-offs. They talk about technical debt, which is the backlog of re-work that builds up because sometimes they have to compromise on quality in order to develop code quickly. They understand that, although there may have been good reasons to do this, they have to go back and fix things later. This is an engineering version of the common saying, ‘putting off until tomorrow what is better done today.’ There is a price to pay. Machine learning could arguably be considered ’the high-interest credit card of technical debt.’ This means that developing and deploying an ML system can be relatively fast and cheap, but maintaining it over time can be difficult and expensive. The real challenge isn’t building an ML model; it is building an integrated ML system and continuously operating it in production. Just like a high-interest credit card, the technical debt with machine learning compounds and it can be incredibly expensive and difficult to pay down. Machine learning systems can be thought of as a special type of software system. So operationally they have all the challenges of software development, plus a few of their own. Some of these include: multi-functional teams, because ML projects will probably have developers and data scientists working on data analysis, model development, and experimentation. Multi-functional teams can create their own management challenges. Machine learning is experimental in nature. You must constantly try new approaches with the data, the models, and parameter configuration. The challenge is tracking what worked and what didn’t and maintaining reproducibility while maximising code reusability. Another consideration is that testing an ML system is more involved than testing other software systems, because you’re validating data, parameters, and code together in a system instead of unit-testing methods and functions. ML systems deployment isn’t as simple as deploying an offline-trained ML model as a prediction service. ML systems can require you to deploy a multi-step pipeline to automatically retrain and deploy models. And finally, concerns with concept drift and consequent model decay should be addressed. Data profiles constantly change. If something changes in the data input, the predictive power of the model in production will likely change with it. Therefore, you need to track summary statistics of the data and monitor the online performance of your model to send notifications, or roll back when values deviate from your expectations. Technical debt builds up in the ML system for many reasons, so we’ll be looking at ways to mitigate that throughout this course.