Tuesday, December 10, 2024

Application of Machine Learning

Application of ML
  • Key - identifying the right ML problems to work on
  • Most of ML project fail! (87%! - VentureBeat, 2019)
  • Key questions:
    • Is there a problem?
      • Listen to users
      • Observe in context - field experiment, shadow a user, etc.
    • Can ML help this problem?
      • Is it a good fit for NL - easy / hard / impossible
      • Is data available?
    • Does someone really care if this problem is solved?
      • Business impact vs. Feasibility of providing a solution
      • Target is - high business value and feasible to archive
 
Understanding the problem
 
Validating Product Ideas - Brainstorming
  • Formulate hypothesis
  • Test the hypothesis
    • POC or Mock Up
    • Not an actual working model
  • Analyze the findings
  • Decide to continue or not
  • Refine the hypothesis and repeat
 
ML use considerations
Human factor:
  • Automation - replacing a human
  • Augmentation  - supporting a human
 
Before rushing into ML, see Heuristics:
  • Hard-coded business rules
  • Built on previous experiences
    • Ex: take a mean value of sales across a set of days to predict
  • Can use heuristics to build a baseline and then move into ML
  • Compare ML performance against the baseline
  • Pros
    • Minimum computation cost, easy to maintain, easy to read
  • Cons when compared to ML
    • ML performs better
    • Can be re-trained on new data
    • Can solve more problems
 
Project Organization
  • ML Project challenges vs tradition software dev
  • Defining the Process
    • Don't jump into solution
    • Don't spend resources and effort fixing a poorly defined problem
    • Organize the team well
  • CRIS-DM
    • https://www.ibm.com/docs/zh/spss-modeler/18.0.0?topic=dm-crisp-help-overview
    • https://medium.com/@avikumart_/crisp-dm-framework-a-foundational-data-mining-process-model-86fe642da18c
    • Cross-industry standard process for data mining
    • Developed in 1996, flexible industry-agnostic
      • Business understanding
        • Define the problem
          • Target users
          • Write problem statement
          • Why it matters
          • How is it solved today
          • Gaps in current state
        • Define Success
          • Quantify expected business impact
          • Identify constraints
          • Translate impact into metrics
          • Define success targets for metrics
        • Identify factors
          • Gather domain expertise
          • Identify potentially relevant factors
      • Data understanding
        • Gather data
          • Identify data sources
          • Label data
          • Create features
        • Validate data
          • Quality control of data - time consuming
          • Resolve data issues
        • Explore data
          • Stat analysis and visualization
          • Reduce dimensions
          • Identify relationships and patterns
      • Data preparation
        • Split data - training and test sets
        • Define features
        • Prepare for modeling
      • Modeling
        • Model section
          • Evaluate algorithms
          • Documentation, versioning
        • Model tuning
          • Hyperparameter optimization
          • Documentation, versioning
          • Morel re-training
      • Evaluation
        • Evaluate results
          • Run model on test set
          • Interpret outputs and performance
        • Test solution
          • Software unit and integration testing
          • Model unit testing
          • User testing - alpha/beta
      • Deployment
        • Deploy
          • APIs
          • Product integration
          • Scale the infrastructure
          • Security
          • Software deployment
        • Monitor
          • Observe performance
          • Re-train the model
      • ITERATIVE - Go through another iteration of these
 
Team Organization
  • Business Sponsor
  • Product
    • Owner
    • Manager
  • Data Science
    • Scientists
      • Stat/data background
      • Gets insights out of the data
      • Determines ML approach
      • Inovlved heavily earlier in the project
  • Engineering
    • Data Engineer
    • Software Engineer
    • ML Engineer
      • CS background
      • Integration of ML into product
    • QA/DevOps
 
 
Agile - Iterative experiments
  • Business understanding
    • Mock up of solution
    • Get feedback from customer
  • Business understanding / Data  understanding
    • Collect data, feed into model
    • Collect customer feedback
  • Business understanding / Data  understanding / Data processing
    • Try real data, heuristic model
    • Collect customer feedback
  • Business understanding / Data  understanding / Data processing / Modeling
    • Try real data, simple ML model
    • Collect customer feedback
  • Circle again if needed
 
Measure Performance
  • Outcome Metrics
    • Business impact, $ usually
    • No technical performance metrics here
    • Internal - not customer facing
  • Output Metrics
    • Customer facing - testing together possibly
    • A/B or Beta testing
  • Non-performance considerations
    • Expandability and interpretability
      • Debug-friendliness
      • Resilience
    • Cost - data and compute
 
Data Needs
  • Historical and Real time
  • Training Data
    • Subject matter experts
    • Customers
    • Temporal and geospatial characteristics
  • How much/many
    • Start with small amount of features
    • Add more and evaluate
  • Training Data
    • Use Labels
  • How much data
    • More = better
    • Depends on
      • Number of features
      • How complex feature/target relationship is (linear or not)
      • Data quality
      • Target model performance
  • When collecting data beware of:
    • Obtain only relevant data
    • Introduction of bias
    • Need to update data regularly / retrain model
    • Document data sources
    • Flywheel effect - users interact w AI, the data fed back into the model
 
Governance and Access
  • https://engineering.atspotify.com/2020/02/how-we-improved-data-discovery-for-data-scientists-at-spotify/
  • Key barrier - siloed and inaccessible data
  • Break down the barriers and silos first:
    • Cultural Change
      • Executive sponsor
      • Education
    • Technology
      • Centralized DWH
      • Query tools
    • Data Access
      • Responsibility
      • Permissions
  • Data Cleaning
    • Issues - missing or incomplete data
      • Missing data
      • Anomalous data
      • Mis-mapped data
    • Types of missing data
      • Missing Completely at Random
        • No pattern in missing
        • Low bias - not great concern
      • Missing at Random
        • Missing due to another feature of the data
        • High bias potential
      • Missing Not at Random
        • Missing due to values of the feature itself
        • High bias potential
    • Options for dealing with this
      • Remove rows or columns
      • Flag it to be treated as a special case
      • Replace with mean/median
      • Backfill / Forward fill
      • Infer it - use a simpler model to predict the missing values
    • Outliers
      • Can greatly influence the result
      • Use visualizations and statistical methods to identify
      • Understand the root cause
    • Preparing data
      • EDA - Exploratory Data Analysis
        • Understanding the trends in data
        • Catch issues
      • Feature engineering
        • Selecting the right features
      • Feature section methods
        • Filter Methods
          • Statistical tests based on data characteristics
          • Used to removed irrelevant features
          • Computationally inexpensive
        • Wrapper Methods
          • Train on subset of features
          • Often non-feasible in real world
          • Computationally expensive
        • Embedded methods
          • Select features that contribute the most
          • Use model training set
      • Transform data for modelling
        • Ensure data is the format required
 
  • Reproducibility
    • Ability to reproduce results
    • Helps debugging
    • Helps learning (team hand off)
    • Best practices:
      • Documentation
      • Data lineage
        • Tracking data from source to use in a model
        • Adding visualizations to illustrate data relationships
        • Helps meet compliance if required
      • Proper versioning
        • Version code and the model itself
        • Helps rollback
        • Champion/challenger tests - running different versions of model in parallel
 
Technology Selection
  • ML System consist of:
    • UI
    • Data
    • Model
    • Infra
  • Key decisions driving the tech selection:
    • Cloud or Edge
      • Cloud - need network concavity, allow for high throughput
      • Edge - primary benefit is latency and security as not exposed to network; need sufficient compute and memory locally
      • Hybrid
        • Use Edge AI to trigger Cloud AI (local events captured, data sent to Cloud)
        • Cache common prediction at Edge or nearest DC
        • Key questions:
          • Is latency a concern?
          • How critical is connectivity?
          • Is security of sending data to cloud a concern?
    • Offline or Online Learning
      • Determine if Model re-training and prediction needs to happen in real-time
        • Model re-training: Scheduled - Offline; Realtime - Online
        • Prediction:  Scheduled - Batch; Realtime - Online
      • Offline - re-training is done on scheduled basis; easier to implement and evaluate but slowed to adopt to changes
      • Online - re-train as new data comes in (min/hr); adopts in real-time but harder to implement
    • Batch or Online Predictions
      • Batch prediction - predict based on batched observations on scheduled basis; efficient but predictions are nor immediately done on new data
      • Online predictions - - real-time on demand; results available immediately; latency can be an issue, model can drift
  • Technology decisions
    • Programming language
      • Python
        • Pandas + NumPy for data manipulation, Sci-Kit for modeling, Matplotlib for visualization, NLTK + SpaCy for text processing
      • R, C/C++
    • Dats processing tools
    • Modeling tools
      • TensorFlow (Google), PyTorch (Facebook), Keras (TensorFlow API) - libraries for deep learning
      • AutoML
        • Enables to build models quickly
        • Available from Google, Microsoft, AWS, H2O
        • Consumes data, trains the model, exposes over API
    • API and interface
    • Factors that affect these
      • Open source vs prop
      • Learning curve
      • Documentation
      • Community support
      • Talen availability
      • Simplicity and flexibility
      • Long term cost
      • Factor of cost vs time to implement
 
Challenges of running and deploying models
  • System Failures
    • Overtime performance of a model often decreases
    • Training-serving skew
      • Mismatch between data trained on and data received in prod
      • Ex: train on clear images, receive blurry images in prod
    • Excessive latency
      • Latency in generating predictions
      • Due to volume of input data, pipeline, choice of model
    • Data drift
      • Changes in environment cause change in input data
    • Concept drift
  • System Monitoring
    • Important to monitor the performance and issues to prevent disruption in service
    • Input data monitor
      • Quality checks
      • Changes in distribution
      • Changes in correlation between features and targets
      • Perform manual audits
        • data pipeline, model outputs, target labels
    • Audit the model
      • Split the data into groups - monitor the performance across groups to identify bias
      • Check the impact of features to ensure the prediction results make sense:
        •  LIME - Local Interpretable Model-Agnostic Explanations
        • SHAP - Shapely Additive Explanations
  • Model Maintenance Cycle
    • Monitor
    • Retrain and Update
      • Retrain - keep old data, just change weights
        • can be scheduled or triggered
      • Update - use new data to re-model; allows for model adjustments
    • Evaluate
    • Deploy
 

No comments:

Post a Comment