Models Decay. Systems Do not Have To

Building an ML model in a notebook is the easy part. This article walks through what actually happens when you try to ship that model to real users, the practices, tools, and mindset shifts that fall under the umbrella of MLOps (and now LLMOps)

ML in research vs in production

Understanding ML systems will be helpful in designing and developing them. In this section, we’ll go over how ML systems are different from both ML in research (or as often taught in academia or school ) and traditional software

As the use of ML in the industry is still a bit new, most people with ML expertise have gained it through academia: taking courses, doing research, reading papers. If that describes your background, you might find it hard to understand how to use ML systems in the real world and deal with all the solutions to the problems they cause. ML in production is very different from ML in research. the next table shows the major differences

	Research	Production
Objectives	Model performance (accuracy, ..)	Different `stakeholders` have different objectives
Computational priority	Fast training, high `throughput`	Fast `inference`, low `latency`
Data	Static	Constantly changing over time
Fairness	Good to have (sadly)	Important
Interpretability	Good to have	Important

Stakeholders such as :
- ML team : highest accuracy
- product : fastest inference
- sales : sell more
- manager : maximizes profits

moving from research to production isn’t just a technical step but it is a cultural shift. according to NVIDIA[2] MLOps is often confusing because it is :

broad : it covers everything from data management to business culture
diverse : there isn’t one “standard” way to do it yet, so different companies use different languages to describe the same problem
complex : practitioners often focus on the “accidental complexity” of their specific tools rather than the essential complexity of the problem itself

MLops is a set of practices that helps data scientist and engineers to manage the ML life cycle more efficiently , it aims to bridge the gap between development and operations for ML. the goal of MLOps is to ensure that ML models are developed, tested , and deployed in a consistent and reliable way

Ops in MLOps comes from DevOps short for Developments and Operations. To operationalize something means to bring it into production, which includes deploying, monitoring, and maintaining it.

The main differences between MLOps and DevOps is that :

Devops is a set of practices that helps organizations to bridge the gap between software development and operations teams. MLOps is a similar set of practices that specifically addresses the needs of ML models, main key differences includes :

scope : DevOps focuses on the software dev life cycle while MLOps focuses on the ML life cycle
complexity: Machine Learning models are more complex than traditional software applications which require specific tools and techniques for development and deployment
data : ML models rely on data for training and inference, this can introduces challenges for managing and processing data
regulations: ML models are sometimes be subject to regulatory requirements which may impact the development and deployment process

even though they have some differences, they share a lot of common principles, such as the importance of automation , continuous improvements, collaboration. Google [1] mentioned that organizations that have adopted DevOps practices can leverage those practices when implementing MLops

Benefits of MLOps

MLOps offers numerous benefits to organizations that adopt it, including:

Improved efficiency: automates the ML life cycle, reducing time and effort
Increased scalability: enables organizations to scale their ML operations more effectively, handling larger datasets
Improved reliability: reduces the risk of errors, ensuring that ML models are reliable and accurate in prod
Enhanced collaboration: provides a common framework and set of tools for data scientists, engineers and operations teams
Reduced costs: can help organizations reduce costs

Vocabulary : CI: testing code + data + model CD: deploying model as a service CT (Continuous Training): the ML-specific addition

Basic components of MLOps [1]

MLOps consists of several components that work together to manage the ML life cycle, including:

1. Exploratory data analysis (EDA)

Data visualization
Data cleaning
Feature engineering

2. Data prep and feature engineering

Data preparation and feature engineering are critical steps in the MLOps process. Data preparation involves cleaning, transforming, and formatting the raw data to make it suitable for model training.

A feature store is a centralized repository that stores, versions, and serves features for both training and inference. It solves a classic MLOps headache: making sure the features your model sees in production are computed the exact same way as during training (avoiding training-serving skew).

3. Model training and tuning

Selecting the right ML algorithm
Training the model
Tuning the model
Evaluating the model

4. Model review and governance

Model validation
Model fairness
Model interpretability
Model security

5. Model inference and serving

Model registry: a versioned catalog of trained models with their metadata (training data, hyperparameters, metrics, lineage). It answers the question “which exact model is running in production right now, and can I reproduce it?”
Model deployment
Model serving

6. Model monitoring

Our models are designed to decay , the data will shift and the performance will drop and yes we can fight back. You can’t really fix what you can’t see for that we need monitoring which involves continuously monitoring the performance and behavior of the ML model in production. Tasks may include:

logging : capture events and errors
metrics : track key performance indicators (KPI)
dashboards : visualize system health
alerting : notify on anomalies/ model issues
check the section [Myths of ML deployment] about data and concept drift

7. Automated model retraining

Automated model retraining involves retraining the ML model when its performance degrades or when new data becomes available. Automated model retraining includes:

Triggering model retraining
Retraining the model
Evaluating the retrained model

From Notebooks to Production-Grade Code Real-world MLOps requires moving beyond messy notebooks to structured software. To ensure the reliability mentioned above, engineers should adopt robust design patterns like : type-safe data contracts (Pydantic), abstract interfaces for swappable models (ABCs), proper testing , and modular pipelines. I’ ll dedicate a future blogs to these patterns soon.

traditional baseline : core mlops lifecycle

you see one direction of the arrow but in reality it is much more complex than that

The next workflow shows 7 human processes each of which informs the next:

Notice the colored dots: different personas own different stages. Let’s meet them

Human-Centric ML Workflow

An example of a human-centric machine learning discovery and development workflow, with the relevant tasks, personas, and interactions

Who are the players in this Game ?

As AI matures, roles have become more specialized, according to Nvidia here is how the team usually looks :

Data Scientists: The “experimentalists” who find patterns in data
Data Engineers: The “plumbers” who make data available, secure, and organized at scale
Machine Learning Engineers: The “builders” who focus specifically on developing and optimizing the production infrastructure
Application Developers: The “integrators” who take the ML services and bake them into the final product

Gartner ML Pipeline

Gartner’s view of the machine-learning pipeline , source (NVIDIA blog)

Myths of ML deployment

The main myths and their realities: see Stanford MLsys seminars [Video] Machine learning production myths

Myth	Reality
#1 You only deploy one or two ML models at a time	Enterprise applications require compound systems and ensembles of dozens of specialized models and tools
#2 If we don’t do anything, model performance remains the same.	Models degrade due to data `distribution shifts`; `continuous monitoring` is mandatory.
#3 You won’t need to update your models as much.	Continuous learning and online evaluation drive weekly, if not daily, updates to context and prompts.
#4 Most ML engineers don’t need to worry about scale.	LLMs introduce unprecedented computational bottlenecks and latency challenges in production.
#5 (LLMops-related) LLMs are plug-and-play and don’t need ‘Ops’.	LLMs are non-deterministic; they require Prompt Versioning and LLM-as-a-Judge evaluations to ensure that a simple prompt change doesn’t break the entire system’s logic

Myth #1: Uber has thousands of models in production. At any given moment, Google[1] has thousands of models training concurrently with hundreds of billions parameters in size. Booking.com has 150+ models. A 2021 study by Algorithmia shows that among organizations with over 25,000 employees, 41% have more than 100 models in production

Fig shows a wide range of the tasks that leverage ML at Netflix.[3]

Different tasks that leverage ML at Netflix. Source: [Video]

Netflix ML Tasks

Myth #2 mentions “data distribution shifts”, that umbrella term actually splits into two distinct phenomena:

Concept Drift: The statistical relationship between the input features and the target variable changes over time, meaning the “logic” of the model’s predictions is no longer valid. Data Drift: The underlying distribution of the input data changes (e.g., due to seasonal shifts or sensor changes), even if the fundamental relationship between those features and the output remains the same.

Matching the Solution to the Data

Not all MLOps is the same. The requirements change based on what you are processing

Tabular Data: Often allows for automated labeling and simpler pipelines
Unstructured Data (Video, Audio, Language): Requires much more manual human effort to label and complex infrastructure to process
High-Stakes Systems: If the ML controls financial portfolios or medical decisions, it requires intense simulation and safety validation before it ever touches a user

The 4 stages of ML maturity

scalability is not a feature you can add, it is a state you achieve by evolving through the 4 stages of continual learning (framework adapted from Chip Huyen’s Designing Machine Learning Systems). from manual scripts to full automation is a cultural transformation that bridges the gap between the data scientist’s focus on patterns and the ML engineer’s focus on infrastructure

stage 1: manual script based retraining; it is stateless where data is extracted manually and models trained in notebooks
stage 2 : automated retraining; retraining is scheduled or trigged by time or metrics, but the deployment of the resulting artifacts is still manual
stage 3 : automated deployment ; now the pipeline includes automated testing and canary releases where the system is allowed to update the model in prod without human intervention
stage 4: full automation (continual learning ); system transitions to stateful training where the model updates continuously in response to performance triggers or new data streams

ML Maturity Stages

“To be deployed, your model will have to leave the development environment. Your model can be deployed to a staging environment for testing or to a production environment to be used by your end users.”

The LLMOps pivot: hallucination and prompt versioning

LLMOps: is a specialized subset of MLOps which focuses specifically on the challenges and requirements of managing LLMs. while MLOps covers the general principles and practices of managing machine learning models, LLMops addresses the unique characteristics of LLMs such as their large size complex training requirements and high computational demands [1]

and as Nvidia[2] defines it as a subset of the broader GenAIOps paradigm, focused on operationalizing transformer-based networks for language use cases in production applications. Language is a foundational modality that can be combined with other modalities to guide AI system behavior. For example, multimodal systems combine text and image data for visual content production.

The LLMOps workflow

According to Google [1] LLMOps involves a number of different steps, including:

Data collection and preparation: LLMs require large amounts of data to train. This data must be collected and prepared in a way that is suitable for training the model.
Model development: LLMs are developed using a variety of techniques, including unsupervised learning, supervised learning, and reinforcement learning.
Model deployment: Once a LLM has been developed, it must be deployed to a production environment. This involves setting up the necessary infrastructure and configuring the model to run on a specific platform.
Model management: LLMs require ongoing management to ensure that they are performing as expected. This includes monitoring the model’s performance, retraining the model as needed, and making sure that the model is secure.

Benefits of LLMOps [1]

Performance: tunes and optimizes LLMs so they respond faster and more accurately in production.
Scalability: gives you a flexible framework to handle growing traffic, bigger models, and shifting requirements without rewriting everything.
Risk reduction : catches issues early through monitoring, guardrails, and security checks before they turn into outages or data leaks.
Efficiency: automates the boring parts of the LLM lifecycle (data prep, deployment, monitoring) so your team spends less time on glue code and more time on what matters.
GenAIOps as Nvidia [2] mentioned : GenAIOps encompasses MLOps, code development operations (DevOps), data operations (DataOps), and model operations (ModelOps), for all generative AI workloads from language, to image, to multimodal. Data curation and model training, customization, evaluation, optimization, deployment, and risk management must be rethought for generative AI

GenAIOps capabilities include:

synthetic data management
embedding management
agent/chain management
Guardrails
prompt management

the transition to LLMs means we need to change how we monitor changes in data and focus more on understanding the subtleties of Natural language and human intent. the unit of logic has moved from the code to prompt. the main problem with GenAI is hallucination, we mitigate this using RAG which grounds the model with facts by fetching relevant context from entreprise documents . in this architecture the llm is not a database but a reasoning engine.

New Lifecycle: (RAG)

In this new paradigm, the lifecycle expands. we are no longer just managing a model; we are managing the Context This introduces the Vector-Retrieval Skew: a situation where your vector database index becomes out of sync with your primary data sources, leading the “reasoning engine” to provide outdated or incorrect facts. Managing this requires the same iterative process for traditional pipelines, but applied to embedding quality and retrieval accuracy [read more about RAG here ]

and to manage the production mode of GenAI, architects deploy a GenAI API gateway as shown in the Fig below. this infrastructure layer handles the operational overhead, prompt versioning, rate limiting, cost tracking across different providers and the enforcement of guardrails to filter adversarial inputs or unsafe outputs

In LLMOps, evals are the new unit tests. Every prompt change or model swap must be benchmarked against them to ensure the system is growing rather than just changing. The main evaluation strategies are:

Golden datasets: curated input/output pairs representing ideal behavior, run on every change
LLM-as-a-Judge: using a stronger model (like GPT-4 or Claude) to score outputs on criteria like relevance, faithfulness, and tone
Human evaluation: domain experts reviewing samples, expensive but still the gold standard
Red-teaming: adversarial prompts designed to break guardrails and surface unsafe outputs
“For a software engineer to become an AI engineer, the mental approach has to be different: Shift from code mindset to data mindset. It’s a game of experimentation, iterations and growing a system. It’s not a one-time build.”

Architecture shift : orchestrating compound systems

GenAI Architecture

We are moving away from simple REST APIs that serve static predictions. Today’s GenAI architecture requires an API gateway to handle routing, caching, and guardrails, which then feeds into an orchestrator that interacts with vector databases and foundation models dynamically.

GenAI API Gateway: for ease of development, ease of switching models, logging, monitoring metrics, alerting, cost tracking… The core idea of agents is to use a language model to choose a sequence of actions to take.

LLMOps vs MLOps

Category	MLOps	GenAIOps / LLMOps
Core Focus	Model Training & Tuning	Model Interaction & Context
Key Asset	Feature Stores	Vector Databases & Embeddings
Data Management	Tabular & Structured business data	Unstructured data (Text, Video, Audio)
Evaluation	Standard deterministic metrics (Accuracy, F1 Score)	‘LLM-as-a-Judge’, output safety, hallucination rates, and prompt versioning

Traditional MLOps was all about structured data, managing feature store and tracking metrics like accuracy

LLMOps; a whole different beast : now we are dealing with unstructured text, managing prompts like code and using vector databases and building guardrails that keeps those powerful models to not go off the rails

a new set of problems to solve : we are not tunning models anymore, we are managing embeddings for rag systems or versioning our prompts and detect when exactly our model hallucinate , how to build a reliable system when the outputs are not guaranteed to be the same twice

LLMOps is not replacing MLOps it is just building right on top of it

all those fundamentals about data quality , monitoring and automation are much important than ever

as these ai systems become more and more creative , unpredictable the job is now a bit different, the challenge is not just managing the lifecycle of the model , the real task is designing systems to not just manage but trust AI

In very few short years we have moved from XGboost to transformers and the next architecture might be on the horizon. however the fundamentals of robust system design are constant. Reliability, scalability and maintainability are the only things that survive a model’s obsolescence

As we evaluate our pipelines today we can ask ourselves: is our current infrastructure designed to facilitate the iterative process required for long-term survival or have you built a static monument to a model that will be irrelevant in 6 months?

Consider this the foundation. There’s a lot more to dig into, piece by piece.

References

[1] Google cloud : what is LLMOps
[2] Nvidia : Mastering LLMs techniques
[3] Human-centric ML infra @netflix
[4] Chip Huyen; Designing Machine Learning Systems (O’Reilly, 2022)
[5] Stanford MLSys Seminar

Share this article:

(END)