This post is mainly based on

MLOps

  • Wikipedia - MLOps
  • Databricks - MLOps
  • A set of best practices
  • Motivation: productionize ML models are difficult, requiring collaboration between multiple teams
  • Goal: deploy and maintain machine learning models in production, improve efficiency, scalability, and risk reduction
  • Approach: workflow abstraction and increase automation
  • ML lifecycle management
    • Model generation (Software development lifecycle, CI/CD)
    • Orchestration
    • Deployment
    • Health, diagnostics, governance, and business metrics
  • One way to breaking down ML lifecycle
    • Data collection, data processing, feature engineering, data labeling
    • Model design, model training and optimization
    • Endpoint deployment, and endpoint monitoring
  • MLOps Tools
    • Model metadata storage and management
      • MLflow
    • Data and pipeline versioning
      • DVC
    • Orchestration and workflow pipelines
      • Kubeflow
      • Polyaxon
      • SageMaker Pipelines
    • Production model monitoring
      • SageMaker Model Monitor

Why MLOps?

  • Technical Debt
    • A metaphor introduced by Ward Cunningham
    • Refer to: long term costs incurred by moving quickly in software engineering
    • Technical debt may be paid down by
      • Refactoring code
      • Improving unit tests
      • Deleting dead code
      • Reducing dependencies
      • Tightening APIs
      • Improving documentation
    • The goal is not to add new functionality, but to
      • Enable future improvements
      • Reduce errors
      • Improve maintainability
  • Technical Debt in ML
    • Developing and deploying ML systems is relatively fast and cheap
    • However, maintaining ML systems over time is difficult and expensive, due to
      • Abstraction boundary erosion
      • Entanglement
      • Data dependency > Code Dependency
      • Feedback loops
      • System-level anti-patterns
      • Configuration debt
      • Changes in the external world

Abstraction Boundary Erosion

  • Traditional software engineering
    • Enforce strong abstraction boundaries using encapsulation and modular design
    • This help create maintainable code in which it is easy to make isolated changes and improvements.
  • ML system
    • Difficult to enforce strict abstraction boundaries or prescribing specific intended behavior
    • ML is create to handle problems where the desired behavior cannot be effectively expressed in software logic without dependency on external data
    • The real world does not fit into tidy encapsulation
    • Traditional abstractions and boundaries may be subtly corrupted or invalidated, due to data dependency

Entanglement

  • Machine learning systems mix signals together, entangling them and making isolation of improvements impossible.
  • Example
    • Consider a ML system that makes prediction based on features $x_1, …x_n$
    • Mathematically, a ML model is a mapping $f: x \rightarrow y$, which minimize some loss $\mathcal{L}(\hat{y}, y)$
    • Suppose new data flows in and contains anomalies that change the distribution of $x_1$, and we update the model (retrain or train from checkpoint)
    • The new mapping $f’$, trained with anomalies, may not minimize loss $\mathcal{L}(\hat{y}, y)$ under the original distribution
  • Possible Mitigation
    • Isolate models: final output is based on an ensembles of explainable features
    • Detecting changes in prediction behavior

Data Dependency > Code Dependency

  • Code dependencies can be identified via static analysis (e.g., compilers and linkers)
  • Data Dependency can be hard to analyze without appropriate tools
  • Unstable Data Dependencies
    • Model consume features produced by other systems
    • Examples
      • One ML model consume output of another ML model
      • One ML model consume output of some data dependent features (e.g., population mean or TF-IDF score)
    • Mitigation: create versioned copy of signals
  • Underutilized Data Dependencies
    • Model consume features that provide little incremental modeling benefit
    • Examples
      • Legacy Features
      • Bundled Features: a group of features is evaluated and found to be beneficial
      • $\epsilon$-Features: feature with high complexity but very small accuracy gain
      • Correlated Features
    • Mitigation: regularly detect and remove, s.t., the ML system is less vulnerable to change

Feedback loops

  • Live ML systems that update regularly over time
  • Direct feedback loops
    • A model’s prediction directly affects the data it will collect in the future
    • Example: an algorithm choosing which ad to present to a consumer
    • Requires knowledge in bandit problem / reinforcement learning to analyze these feedback loops
  • Hidden feedback loops
    • Two or more models’ prediction directly affects the data they will collect in the future
    • Example: two stock-market prediction model choose how to execute trades

System-level anti-patterns

  • Only a tiny fraction of code in ML system is devoted to model specification (learning or prediction)
  • Glue Code & Pipeline Jungles
    • Supporting codes/pipelines to transform data into format that ML model can consume
    • May caused by incremental model improvement / adding features
    • Problems
      • Difficult to maintain
      • Requires many error detections / failure recovery logic, requires expensive end-to-end integration tests
      • Many intermediate files output
    • Mitigation
      • Wrap open-source packages into API
      • Reduce separation between “research” and “engineering” arm
  • Dead Experimental Codepaths
    • Experimental codepaths for experiments / ablation studies
    • Mitigation: periodic review

ml-sys-code

Configuration debt

  • Too many configuration for a ML system
    • Feature selection
    • Data selection / tran-test split
    • Model architecture
    • Hyperparameter
    • Various pre- / post-processing
  • Many researchers and engineers treat configuration as an afterthought
  • Principles of good configuration systems
    • Easy to specify a small change from a previous configuration
    • Easy to detect the difference two configurations
    • Easy to automatically assert and verify basic facts about the configuration
    • Configurations should undergo a full code review and be checked into a repository

Changes in the external world

  • ML systems often interact directly with the external world, which is rarely stable
  • Fixed Thresholds in Dynamic Systems
    • Specify a decision threshold / prediction range
    • Learning thresholds from a heldout validation set
  • Monitoring and Testing
    • Unit test may be inadequate for an online system
    • Comprehensive live monitoring is required to ensure that a system is working as intended
    • Monitor
      • Prediction distribution shift
      • Action Limits
      • Up-Stream Producers
    • External changes occur in real-time, therefore, response must also occur in real-time

Useful questions to consider

  • How easily can an entirely new algorithmic approach be tested at full scale?
  • What is the transitive closure of all data dependencies?
  • How precisely can the impact of a new change to the system be measured?
  • How quickly can new members of the team be brought up to speed?