MLOps

This post is mainly based on

Wikipedia - MLOps
Databricks - MLOps
A set of best practices
Motivation: productionize ML models are difficult, requiring collaboration between multiple teams
Goal: deploy and maintain machine learning models in production, improve efficiency, scalability, and risk reduction
Approach: workflow abstraction and increase automation
ML lifecycle management
- Model generation (Software development lifecycle, CI/CD)
- Orchestration
- Deployment
- Health, diagnostics, governance, and business metrics
One way to breaking down ML lifecycle
- Data collection, data processing, feature engineering, data labeling
- Model design, model training and optimization
- Endpoint deployment, and endpoint monitoring
MLOps Tools
- Model metadata storage and management
  - MLflow
- Data and pipeline versioning
  - DVC
- Orchestration and workflow pipelines
  - Kubeflow
  - Polyaxon
  - SageMaker Pipelines
- Production model monitoring
  - SageMaker Model Monitor

Traditional software engineering
- Enforce strong abstraction boundaries using encapsulation and modular design
- This help create maintainable code in which it is easy to make isolated changes and improvements.
ML system
- Difficult to enforce strict abstraction boundaries or prescribing specific intended behavior
- ML is create to handle problems where the desired behavior cannot be effectively expressed in software logic without dependency on external data
- The real world does not fit into tidy encapsulation
- Traditional abstractions and boundaries may be subtly corrupted or invalidated, due to data dependency

Machine learning systems mix signals together, entangling them and making isolation of improvements impossible.
Example
- Consider a ML system that makes prediction based on features $x_1, …x_n$
- Mathematically, a ML model is a mapping $f: x \rightarrow y$, which minimize some loss $\mathcal{L}(\hat{y}, y)$
- Suppose new data flows in and contains anomalies that change the distribution of $x_1$, and we update the model (retrain or train from checkpoint)
- The new mapping $f’$, trained with anomalies, may not minimize loss $\mathcal{L}(\hat{y}, y)$ under the original distribution
Possible Mitigation
- Isolate models: final output is based on an ensembles of explainable features
- Detecting changes in prediction behavior

Code dependencies can be identified via static analysis (e.g., compilers and linkers)
Data Dependency can be hard to analyze without appropriate tools
Unstable Data Dependencies
- Model consume features produced by other systems
- Examples
  - One ML model consume output of another ML model
  - One ML model consume output of some data dependent features (e.g., population mean or TF-IDF score)
- Mitigation: create versioned copy of signals
Underutilized Data Dependencies
- Model consume features that provide little incremental modeling benefit
- Examples
  - Legacy Features
  - Bundled Features: a group of features is evaluated and found to be beneficial
  - $\epsilon$-Features: feature with high complexity but very small accuracy gain
  - Correlated Features
- Mitigation: regularly detect and remove, s.t., the ML system is less vulnerable to change

Live ML systems that update regularly over time
Direct feedback loops
- A model’s prediction directly affects the data it will collect in the future
- Example: an algorithm choosing which ad to present to a consumer
- Requires knowledge in bandit problem / reinforcement learning to analyze these feedback loops
Hidden feedback loops
- Two or more models’ prediction directly affects the data they will collect in the future
- Example: two stock-market prediction model choose how to execute trades

Only a tiny fraction of code in ML system is devoted to model specification (learning or prediction)
Glue Code & Pipeline Jungles
- Supporting codes/pipelines to transform data into format that ML model can consume
- May caused by incremental model improvement / adding features
- Problems
  - Difficult to maintain
  - Requires many error detections / failure recovery logic, requires expensive end-to-end integration tests
  - Many intermediate files output
- Mitigation
  - Wrap open-source packages into API
  - Reduce separation between “research” and “engineering” arm
Dead Experimental Codepaths
- Experimental codepaths for experiments / ablation studies
- Mitigation: periodic review

ml-sys-code

Too many configuration for a ML system
- Feature selection
- Data selection / tran-test split
- Model architecture
- Hyperparameter
- Various pre- / post-processing
Many researchers and engineers treat configuration as an afterthought
Principles of good configuration systems
- Easy to specify a small change from a previous configuration
- Easy to detect the difference two configurations
- Easy to automatically assert and verify basic facts about the configuration
- Configurations should undergo a full code review and be checked into a repository

ML systems often interact directly with the external world, which is rarely stable
Fixed Thresholds in Dynamic Systems
- Specify a decision threshold / prediction range
- Learning thresholds from a heldout validation set
Monitoring and Testing
- Unit test may be inadequate for an online system
- Comprehensive live monitoring is required to ensure that a system is working as intended
- Monitor
  - Prediction distribution shift
  - Action Limits
  - Up-Stream Producers
- External changes occur in real-time, therefore, response must also occur in real-time