Application of Machine Learning
Application of ML
- Key - identifying the right ML problems to work on
- Most of ML project fail! (87%! - VentureBeat, 2019)
- Key questions:
- Is there a problem?
- Listen to users
- Observe in context - field experiment, shadow a user, etc.
- Can ML help this problem?
- Is it a good fit for NL - easy / hard / impossible
- Is data available?
- Does someone really care if this problem is solved?
- Business impact vs. Feasibility of providing a solution
- Target is - high business value and feasible to archive
Understanding the problem
Validating Product Ideas - Brainstorming
- Formulate hypothesis
- Test the hypothesis
- POC or Mock Up
- Not an actual working model
- Analyze the findings
- Decide to continue or not
- Refine the hypothesis and repeat
ML use considerations
Human factor:
- Automation - replacing a human
- Augmentation - supporting a human
Before rushing into ML, see Heuristics:
- Hard-coded business rules
- Built on previous experiences
- Ex: take a mean value of sales across a set of days to predict
- Can use heuristics to build a baseline and then move into ML
- Compare ML performance against the baseline
- Pros
- Minimum computation cost, easy to maintain, easy to read
- Cons when compared to ML
- ML performs better
- Can be re-trained on new data
- Can solve more problems
Project Organization
- ML Project challenges vs tradition software dev
- Border team and new skill set
- Technical risk
- Need quality data, model limitations
- Identify features
- Because if this, the projects are more difficult to plan
- Cyclical project - not easy to show progress
- How "good" of model performance is good enough?
- Need to retrain models
- Users need to adjust to the model and trust it
- https://medium.com/@l2k/why-are-machine-learning-projects-so-hard-to-manage-8e9b9cf49641
- Defining the Process
- Don't jump into solution
- Don't spend resources and effort fixing a poorly defined problem
- Organize the team well
- CRIS-DM
- https://www.ibm.com/docs/zh/spss-modeler/18.0.0?topic=dm-crisp-help-overview
- https://medium.com/@avikumart_/crisp-dm-framework-a-foundational-data-mining-process-model-86fe642da18c
- Cross-industry standard process for data mining
- Developed in 1996, flexible industry-agnostic
- Business understanding
- Define the problem
- Target users
- Write problem statement
- Why it matters
- How is it solved today
- Gaps in current state
- Define Success
- Quantify expected business impact
- Identify constraints
- Translate impact into metrics
- Define success targets for metrics
- Identify factors
- Gather domain expertise
- Identify potentially relevant factors
- Data understanding
- Gather data
- Identify data sources
- Label data
- Create features
- Validate data
- Quality control of data - time consuming
- Resolve data issues
- Explore data
- Stat analysis and visualization
- Reduce dimensions
- Identify relationships and patterns
- Data preparation
- Split data - training and test sets
- Define features
- Prepare for modeling
- Modeling
- Model section
- Evaluate algorithms
- Documentation, versioning
- Model tuning
- Hyperparameter optimization
- Documentation, versioning
- Morel re-training
- Evaluation
- Evaluate results
- Run model on test set
- Interpret outputs and performance
- Test solution
- Software unit and integration testing
- Model unit testing
- User testing - alpha/beta
- Deployment
- Deploy
- APIs
- Product integration
- Scale the infrastructure
- Security
- Software deployment
- Monitor
- Observe performance
- Re-train the model
- ITERATIVE - Go through another iteration of these
Team Organization
- Business Sponsor
- Product
- Owner
- Manager
- Data Science
- Scientists
- Stat/data background
- Gets insights out of the data
- Determines ML approach
- Inovlved heavily earlier in the project
- Engineering
- Data Engineer
- Software Engineer
- ML Engineer
- CS background
- Integration of ML into product
- QA/DevOps
Agile - Iterative experiments
- Business understanding
- Mock up of solution
- Get feedback from customer
- Business understanding / Data understanding
- Collect data, feed into model
- Collect customer feedback
- Business understanding / Data understanding / Data processing
- Try real data, heuristic model
- Collect customer feedback
- Business understanding / Data understanding / Data processing / Modeling
- Try real data, simple ML model
- Collect customer feedback
- Circle again if needed
Measure Performance
- Outcome Metrics
- Business impact, $ usually
- No technical performance metrics here
- Internal - not customer facing
- Output Metrics
- Customer facing - testing together possibly
- A/B or Beta testing
- Non-performance considerations
- Expandability and interpretability
- Debug-friendliness
- Resilience
- Cost - data and compute
Data Needs
- Historical and Real time
- Training Data
- Subject matter experts
- Customers
- Temporal and geospatial characteristics
- How much/many
- Start with small amount of features
- Add more and evaluate
- Training Data
- Use Labels
- How much data
- More = better
- Depends on
- Number of features
- How complex feature/target relationship is (linear or not)
- Data quality
- Target model performance
- When collecting data beware of:
- Obtain only relevant data
- Introduction of bias
- Need to update data regularly / retrain model
- Document data sources
- Flywheel effect - users interact w AI, the data fed back into the model
Governance and Access
- https://engineering.atspotify.com/2020/02/how-we-improved-data-discovery-for-data-scientists-at-spotify/
- Key barrier - siloed and inaccessible data
- Break down the barriers and silos first:
- Cultural Change
- Executive sponsor
- Education
- Technology
- Centralized DWH
- Query tools
- Data Access
- Responsibility
- Permissions
- Data Cleaning
- Issues - missing or incomplete data
- Missing data
- Anomalous data
- Mis-mapped data
- Types of missing data
- Missing Completely at Random
- No pattern in missing
- Low bias - not great concern
- Missing at Random
- Missing due to another feature of the data
- High bias potential
- Missing Not at Random
- Missing due to values of the feature itself
- High bias potential
- Options for dealing with this
- Remove rows or columns
- Flag it to be treated as a special case
- Replace with mean/median
- Backfill / Forward fill
- Infer it - use a simpler model to predict the missing values
- Outliers
- Can greatly influence the result
- Use visualizations and statistical methods to identify
- Understand the root cause
- Preparing data
- EDA - Exploratory Data Analysis
- Understanding the trends in data
- Catch issues
- Feature engineering
- Selecting the right features
- Feature section methods
- Filter Methods
- Statistical tests based on data characteristics
- Used to removed irrelevant features
- Computationally inexpensive
- Wrapper Methods
- Train on subset of features
- Often non-feasible in real world
- Computationally expensive
- Embedded methods
- Select features that contribute the most
- Use model training set
- Transform data for modelling
- Ensure data is the format required
- Reproducibility
- Ability to reproduce results
- Helps debugging
- Helps learning (team hand off)
- Best practices:
- Documentation
- Data lineage
- Tracking data from source to use in a model
- Adding visualizations to illustrate data relationships
- Helps meet compliance if required
- Proper versioning
- Version code and the model itself
- Helps rollback
- Champion/challenger tests - running different versions of model in parallel
Technology Selection
- ML System consist of:
- UI
- Data
- Model
- Infra
- Key decisions driving the tech selection:
- Cloud or Edge
- Cloud - need network concavity, allow for high throughput
- Edge - primary benefit is latency and security as not exposed to network; need sufficient compute and memory locally
- Hybrid
- Use Edge AI to trigger Cloud AI (local events captured, data sent to Cloud)
- Cache common prediction at Edge or nearest DC
- Key questions:
- Is latency a concern?
- How critical is connectivity?
- Is security of sending data to cloud a concern?
- Offline or Online Learning
- Determine if Model re-training and prediction needs to happen in real-time
- Model re-training: Scheduled - Offline; Realtime - Online
- Prediction: Scheduled - Batch; Realtime - Online
- Offline - re-training is done on scheduled basis; easier to implement and evaluate but slowed to adopt to changes
- Online - re-train as new data comes in (min/hr); adopts in real-time but harder to implement
- Batch or Online Predictions
- Batch prediction - predict based on batched observations on scheduled basis; efficient but predictions are nor immediately done on new data
- Online predictions - - real-time on demand; results available immediately; latency can be an issue, model can drift
- Technology decisions
- Programming language
- Python
- Pandas + NumPy for data manipulation, Sci-Kit for modeling, Matplotlib for visualization, NLTK + SpaCy for text processing
- R, C/C++
- Dats processing tools
- Modeling tools
- TensorFlow (Google), PyTorch (Facebook), Keras (TensorFlow API) - libraries for deep learning
- AutoML
- Enables to build models quickly
- Available from Google, Microsoft, AWS, H2O
- Consumes data, trains the model, exposes over API
- API and interface
- Factors that affect these
- Open source vs prop
- Learning curve
- Documentation
- Community support
- Talen availability
- Simplicity and flexibility
- Long term cost
- Factor of cost vs time to implement
Challenges of running and deploying models
- System Failures
- Overtime performance of a model often decreases
- Training-serving skew
- Mismatch between data trained on and data received in prod
- Ex: train on clear images, receive blurry images in prod
- Excessive latency
- Latency in generating predictions
- Due to volume of input data, pipeline, choice of model
- Data drift
- Changes in environment cause change in input data
- Concept drift
- Patterns' that the model learned no longer apply
- Ex: change in human behavior
- https://www.technologyreview.com/
- System Monitoring
- Important to monitor the performance and issues to prevent disruption in service
- Input data monitor
- Quality checks
- Changes in distribution
- Changes in correlation between features and targets
- Perform manual audits
- data pipeline, model outputs, target labels
- Audit the model
- Split the data into groups - monitor the performance across groups to identify bias
- Check the impact of features to ensure the prediction results make sense:
- LIME - Local Interpretable Model-Agnostic Explanations
- SHAP - Shapely Additive Explanations
- Model Maintenance Cycle
- Monitor
- Retrain and Update
- Retrain - keep old data, just change weights
- can be scheduled or triggered
- Update - use new data to re-model; allows for model adjustments
- Evaluate
- Deploy
Resources: AI Canon | Andreessen Horowitz (a16z.com)