Tuesday, December 10, 2024

Application of Machine Learning

Application of ML
  • Key - identifying the right ML problems to work on
  • Most of ML project fail! (87%! - VentureBeat, 2019)
  • Key questions:
    • Is there a problem?
      • Listen to users
      • Observe in context - field experiment, shadow a user, etc.
    • Can ML help this problem?
      • Is it a good fit for NL - easy / hard / impossible
      • Is data available?
    • Does someone really care if this problem is solved?
      • Business impact vs. Feasibility of providing a solution
      • Target is - high business value and feasible to archive
 
Understanding the problem
 
Validating Product Ideas - Brainstorming
  • Formulate hypothesis
  • Test the hypothesis
    • POC or Mock Up
    • Not an actual working model
  • Analyze the findings
  • Decide to continue or not
  • Refine the hypothesis and repeat
 
ML use considerations
Human factor:
  • Automation - replacing a human
  • Augmentation  - supporting a human
 
Before rushing into ML, see Heuristics:
  • Hard-coded business rules
  • Built on previous experiences
    • Ex: take a mean value of sales across a set of days to predict
  • Can use heuristics to build a baseline and then move into ML
  • Compare ML performance against the baseline
  • Pros
    • Minimum computation cost, easy to maintain, easy to read
  • Cons when compared to ML
    • ML performs better
    • Can be re-trained on new data
    • Can solve more problems
 
Project Organization
  • ML Project challenges vs tradition software dev
  • Defining the Process
    • Don't jump into solution
    • Don't spend resources and effort fixing a poorly defined problem
    • Organize the team well
  • CRIS-DM
    • https://www.ibm.com/docs/zh/spss-modeler/18.0.0?topic=dm-crisp-help-overview
    • https://medium.com/@avikumart_/crisp-dm-framework-a-foundational-data-mining-process-model-86fe642da18c
    • Cross-industry standard process for data mining
    • Developed in 1996, flexible industry-agnostic
      • Business understanding
        • Define the problem
          • Target users
          • Write problem statement
          • Why it matters
          • How is it solved today
          • Gaps in current state
        • Define Success
          • Quantify expected business impact
          • Identify constraints
          • Translate impact into metrics
          • Define success targets for metrics
        • Identify factors
          • Gather domain expertise
          • Identify potentially relevant factors
      • Data understanding
        • Gather data
          • Identify data sources
          • Label data
          • Create features
        • Validate data
          • Quality control of data - time consuming
          • Resolve data issues
        • Explore data
          • Stat analysis and visualization
          • Reduce dimensions
          • Identify relationships and patterns
      • Data preparation
        • Split data - training and test sets
        • Define features
        • Prepare for modeling
      • Modeling
        • Model section
          • Evaluate algorithms
          • Documentation, versioning
        • Model tuning
          • Hyperparameter optimization
          • Documentation, versioning
          • Morel re-training
      • Evaluation
        • Evaluate results
          • Run model on test set
          • Interpret outputs and performance
        • Test solution
          • Software unit and integration testing
          • Model unit testing
          • User testing - alpha/beta
      • Deployment
        • Deploy
          • APIs
          • Product integration
          • Scale the infrastructure
          • Security
          • Software deployment
        • Monitor
          • Observe performance
          • Re-train the model
      • ITERATIVE - Go through another iteration of these
 
Team Organization
  • Business Sponsor
  • Product
    • Owner
    • Manager
  • Data Science
    • Scientists
      • Stat/data background
      • Gets insights out of the data
      • Determines ML approach
      • Inovlved heavily earlier in the project
  • Engineering
    • Data Engineer
    • Software Engineer
    • ML Engineer
      • CS background
      • Integration of ML into product
    • QA/DevOps
 
 
Agile - Iterative experiments
  • Business understanding
    • Mock up of solution
    • Get feedback from customer
  • Business understanding / Data  understanding
    • Collect data, feed into model
    • Collect customer feedback
  • Business understanding / Data  understanding / Data processing
    • Try real data, heuristic model
    • Collect customer feedback
  • Business understanding / Data  understanding / Data processing / Modeling
    • Try real data, simple ML model
    • Collect customer feedback
  • Circle again if needed
 
Measure Performance
  • Outcome Metrics
    • Business impact, $ usually
    • No technical performance metrics here
    • Internal - not customer facing
  • Output Metrics
    • Customer facing - testing together possibly
    • A/B or Beta testing
  • Non-performance considerations
    • Expandability and interpretability
      • Debug-friendliness
      • Resilience
    • Cost - data and compute
 
Data Needs
  • Historical and Real time
  • Training Data
    • Subject matter experts
    • Customers
    • Temporal and geospatial characteristics
  • How much/many
    • Start with small amount of features
    • Add more and evaluate
  • Training Data
    • Use Labels
  • How much data
    • More = better
    • Depends on
      • Number of features
      • How complex feature/target relationship is (linear or not)
      • Data quality
      • Target model performance
  • When collecting data beware of:
    • Obtain only relevant data
    • Introduction of bias
    • Need to update data regularly / retrain model
    • Document data sources
    • Flywheel effect - users interact w AI, the data fed back into the model
 
Governance and Access
  • https://engineering.atspotify.com/2020/02/how-we-improved-data-discovery-for-data-scientists-at-spotify/
  • Key barrier - siloed and inaccessible data
  • Break down the barriers and silos first:
    • Cultural Change
      • Executive sponsor
      • Education
    • Technology
      • Centralized DWH
      • Query tools
    • Data Access
      • Responsibility
      • Permissions
  • Data Cleaning
    • Issues - missing or incomplete data
      • Missing data
      • Anomalous data
      • Mis-mapped data
    • Types of missing data
      • Missing Completely at Random
        • No pattern in missing
        • Low bias - not great concern
      • Missing at Random
        • Missing due to another feature of the data
        • High bias potential
      • Missing Not at Random
        • Missing due to values of the feature itself
        • High bias potential
    • Options for dealing with this
      • Remove rows or columns
      • Flag it to be treated as a special case
      • Replace with mean/median
      • Backfill / Forward fill
      • Infer it - use a simpler model to predict the missing values
    • Outliers
      • Can greatly influence the result
      • Use visualizations and statistical methods to identify
      • Understand the root cause
    • Preparing data
      • EDA - Exploratory Data Analysis
        • Understanding the trends in data
        • Catch issues
      • Feature engineering
        • Selecting the right features
      • Feature section methods
        • Filter Methods
          • Statistical tests based on data characteristics
          • Used to removed irrelevant features
          • Computationally inexpensive
        • Wrapper Methods
          • Train on subset of features
          • Often non-feasible in real world
          • Computationally expensive
        • Embedded methods
          • Select features that contribute the most
          • Use model training set
      • Transform data for modelling
        • Ensure data is the format required
 
  • Reproducibility
    • Ability to reproduce results
    • Helps debugging
    • Helps learning (team hand off)
    • Best practices:
      • Documentation
      • Data lineage
        • Tracking data from source to use in a model
        • Adding visualizations to illustrate data relationships
        • Helps meet compliance if required
      • Proper versioning
        • Version code and the model itself
        • Helps rollback
        • Champion/challenger tests - running different versions of model in parallel
 
Technology Selection
  • ML System consist of:
    • UI
    • Data
    • Model
    • Infra
  • Key decisions driving the tech selection:
    • Cloud or Edge
      • Cloud - need network concavity, allow for high throughput
      • Edge - primary benefit is latency and security as not exposed to network; need sufficient compute and memory locally
      • Hybrid
        • Use Edge AI to trigger Cloud AI (local events captured, data sent to Cloud)
        • Cache common prediction at Edge or nearest DC
        • Key questions:
          • Is latency a concern?
          • How critical is connectivity?
          • Is security of sending data to cloud a concern?
    • Offline or Online Learning
      • Determine if Model re-training and prediction needs to happen in real-time
        • Model re-training: Scheduled - Offline; Realtime - Online
        • Prediction:  Scheduled - Batch; Realtime - Online
      • Offline - re-training is done on scheduled basis; easier to implement and evaluate but slowed to adopt to changes
      • Online - re-train as new data comes in (min/hr); adopts in real-time but harder to implement
    • Batch or Online Predictions
      • Batch prediction - predict based on batched observations on scheduled basis; efficient but predictions are nor immediately done on new data
      • Online predictions - - real-time on demand; results available immediately; latency can be an issue, model can drift
  • Technology decisions
    • Programming language
      • Python
        • Pandas + NumPy for data manipulation, Sci-Kit for modeling, Matplotlib for visualization, NLTK + SpaCy for text processing
      • R, C/C++
    • Dats processing tools
    • Modeling tools
      • TensorFlow (Google), PyTorch (Facebook), Keras (TensorFlow API) - libraries for deep learning
      • AutoML
        • Enables to build models quickly
        • Available from Google, Microsoft, AWS, H2O
        • Consumes data, trains the model, exposes over API
    • API and interface
    • Factors that affect these
      • Open source vs prop
      • Learning curve
      • Documentation
      • Community support
      • Talen availability
      • Simplicity and flexibility
      • Long term cost
      • Factor of cost vs time to implement
 
Challenges of running and deploying models
  • System Failures
    • Overtime performance of a model often decreases
    • Training-serving skew
      • Mismatch between data trained on and data received in prod
      • Ex: train on clear images, receive blurry images in prod
    • Excessive latency
      • Latency in generating predictions
      • Due to volume of input data, pipeline, choice of model
    • Data drift
      • Changes in environment cause change in input data
    • Concept drift
  • System Monitoring
    • Important to monitor the performance and issues to prevent disruption in service
    • Input data monitor
      • Quality checks
      • Changes in distribution
      • Changes in correlation between features and targets
      • Perform manual audits
        • data pipeline, model outputs, target labels
    • Audit the model
      • Split the data into groups - monitor the performance across groups to identify bias
      • Check the impact of features to ensure the prediction results make sense:
        •  LIME - Local Interpretable Model-Agnostic Explanations
        • SHAP - Shapely Additive Explanations
  • Model Maintenance Cycle
    • Monitor
    • Retrain and Update
      • Retrain - keep old data, just change weights
        • can be scheduled or triggered
      • Update - use new data to re-model; allows for model adjustments
    • Evaluate
    • Deploy
 

Snowflake Deep Dive


Architecture
  • Key Concepts & Architecture | Snowflake Documentation
  • Historic
    • Shared-disk - Traditional DB approach (MySQL on a PC), shared storage multiple CPUs
    • Shared-nothing - Storage and Compute is decentralized but coupled into nodes; MPP - massively parallel processing. Challenge:
      • Need to scale the store and the compute together
      • Need to take system offline to add nodes, re-distribute data once nodes are added
  • Snowflake Data -  Multi-cluster, Shared data
    • Data is centralized, nodes can be scaled independently and sync data off the central store
    • Central data store sits as files on object-based cloud store
    • Data is immutable - once a data file is saved, it can only be appended to OR fully removed and replaced with a new file
    • Snowflake's architecture is based on a massively parallel processing (MPP) approach, but also introduces unique elements:
      • MPP (Massively Parallel Processing): Distributed processing approach that executes tasks across numerous nodes, enhancing performance and enabling scalability.
      • Shared-Nothing Architecture: Each node operates independently without sharing resources, preventing contention, although Snowflake abstracts this by separating storage and compute.
      • Storage and Compute Separation: Centralized storage accessed by independent compute clusters, allowing flexible, cost-effective scaling and resource allocation.
      • Elastic Scalability: The ability to dynamically adjust compute resources, enabling on-demand performance management and cost efficiency.
      • Services Layer: A central control plane that manages metadata, authentication, query planning, and system optimization, crucial for overall system management.
  • 1. Compute
    • Query Processing Layer
    • Virtual Warehouses - Query processing, multiple clusters (do not confuse w a data warehouse; this is compute)
    • Can have as many as needed - all access the shared data store; can have compute nodes of different size for different needs
    • Multi-node compute cluster X-Small to 4X-large
    • Suspended Warehouse costs noting - a w/h can only be suspended once all queries running on it have completed; exception - forced suspend
    • Resume and auto-resume - will reprovision the required size nodes
    • Re-size - new size will apply only after all nodes in the new size have been provisioned; old queries will complete on the old size
    • Scale-down - can downsize only after all active queries have completed; unneeded nodes will be decommissioned; only new queries will run on the new size
    • Cache - each virtual w/h has its own cache; cache dropped on w/h suspension
 
  • 2. Storage Layer
    • Micro Partitions - the way data is stored, 50-500MB
    • Data is stored in the way it arrived in the table
    • Stored in columnar format - compressed automatically
      • This allows for searching only within relevant columns when extracting data
    • Micro-partitions are immutable - new data and updates to existing data are added to a new partition; therefore data across partitions may overlap
    • Metadata - to efficiently map to the correct data when searching; store within micro-partition:
      • Range of values for each column - min/max
      • Count of distinct values for each column
      • Additional metadata used for optimization
    • Storage Cost
      •  Costs is associated with storing data in databases
        • Tables
        • Schemas
        • Stages
        • Time Travel
        • Fail-safe features
      • Calculated based on the amount of data stored and the duration for which it is stored
 
 
  • 3. Cloud Services Layer
    • Security and Governance - authentication/authorization
    • Data Sharing
    • Metadata
    • Query parsing and optimization
    • ACID control
    • Contain Query Cache
  • Pricing
 
Editions
 
 
Web UI, Tools, Drivers
  • Connectivity
    • Web UI
      • Classic
      • New
    • Command Line
      • SnowSQL
      • Snow Connectivity Diagnostic Tool
    • Drivers / Connectors
      • ODBC Driver
      • JDBC Driver
      • Python Connector
      • Node.js Driver
      • Go Driver
      • .NET Driver
      • Spark Connector
      • Kafka Connector
      • PowerShell Module
      • SnowCD is Snowflake Connectivity Diagnostic Tool
 
Data Loading
  • Load Data into Snowflake | Snowflake Documentation
  • Load metadata
    • Names of every file that was loaded saved into a table, timestamps
    • This is to ensure that a loaded file is not reprocessed
    • Load metadata expires after 64 days
    • Snowflake skips over any older files for which the load status is undetermined
  • Stage
    • Storage location used to stage data files that are to be loaded into Snowflake or unloaded from Snowflake
    • User Stages - defined and managed by individual Snowflake users. A user can create a named stage object or use an unnamed stage - a temporary location to hold data for quick operations.
    • Table Stages - each table in Snowflake has an implicit stage associated with it. When files are copied into a table without specifying a stage, Snowflake uses the table's stage. This stage is tied to the table and cannot be used separately.
    • Internal Named Stages - named stages that are created within Snowflake and are not tied to any specific user or table. An internal named stage is accessible to any user with the appropriate permissions and is useful for operations that span multiple tables or databases.
    • External Stages - Snowflake allows users to stage data files in external cloud storage, such as Amazon S3, Google Cloud Storage, or Microsoft Azure. An external stage specifies the location (bucket, container) and necessary credentials for Snowflake to access the files. This is often used for bulk data loading or unloading because it leverages the cloud provider's infrastructure, which can be more scalable and cost-effective for handling large datasets.
  • Data Loading with External Stage
    • Bulk files - COPY into a Virtual Warehouse
    • Use PUT is for uploading data to a stage, COPY for loading data from a stage into a table
    • Micro partitions are added, metadata is updates and written into Cloud Services
    • Data is first loaded into STAGE, the COPY command is executed
    • Staging area is located b/w data source and the w/house - data is often sftp-ed into Staging
    • Staging can be outside of the Data W/H or be part of the Data W/H
    • Extraction is similarly done via a Stage
    • Stages:
      • External (S3, Azure Blob, Google Storage)
      • Internal - a location within Snowflake
      • Table - a stage is created as a table in the W/house and data is COPYed from out to the real table
      • User Stage - user's personal area, sub-type if an Internal stage
  • Data Loading - File Removal
    • The action of loading the data does not automatically remove or delete the files from the stage. The files remain in the stage until are explicitly removed
  • Integral Stages for Data Loading
    • Three types:
      • User Stages
      • Table Stages
      • Named Stages
    • Each user and table are assigned an internal stage by default - can't rename or drop
    • Can create additional stages - Named Stages
    • Data in an internal stage is counted towards SF cost
    • Data in a Snowflake internal stage is stored in an encrypted format
    • Data is encrypted prior to being transferred into a Snowflake internal stage
    • Named Stage
      • Can be created, dropped or modified
      • More flexible than a Table or a User stage
      • Can be used to load multiple files in multiple parameters into multiple tables
      • Can be assigned security and access rights and can be shared across multiple users
    • Table Stage
      • Automatically created for each table; can't be dropped or created manually
      • Multiple users can access same Table stage - but data is loaded into one table only
      • Have the same name as the table - can be addressed using @%tablename
      • Can't set file format on a Table stage - can only specify the format at COPY command
      • No support for file transformation while loading data
    • User Stage
      • Automatically created for each user; can't be dropped or created manually
      • User-specific - a user can access only his own stage
      • Can be referenced as @~
      • Can't set file format on a User stage - can only specify the format at COPY command
  • Semi-structured Data Support
    • SF has robust support for semi-structured data formats, including:
      • JSON (JavaScript Object Notation)
      • Avro
      • ORC (Optimized Row Columnar)
      • Parquet
      • XML
    • Ex: can load a JSON into a stage and then FLATTEN it
    • Can:
      • Reference the data directly in cloud storage using external tables
      • load the data into a single column of type VARIANT
      • transform and load into separate columns in a standard relational table
  • Extremal Tables
    • Introduction to external tables | Snowflake Documentation
    • Can create tables on data stored outside of Snowflake - metadata is stored in SF, actual data is outside of SF
    • It's a schema object within Snowflake that points to data stored externally, typically in a cloud storage location like Amazon S3, Google Cloud Storage, or Azure Blob Storage. User can define the format of the files (CSV, JSON, Parquet, ORC, etc.) as well as the location(s) where the files are stored.
    • External Tables functionality allows for querying of external tables just like internal - use in joins, create views on top of the external tables
    • An external table can be joined with other tables and views may be created using them
    • External tables are read-only, DML operations CAN'T be pointed to them
    • These do not contribute to Snowflake cost; compute is used though for querying
    • Extremal Table point to an External Stage - less performant than querying internally
    • Can add a Materialized view on top to improve performance; materialized views don’t refresh automatically
    • Directory Tables
      • Directory tables | Snowflake Documentation
      • Implicit object layered on a stage (not a separate database object) and is conceptually similar to an external table because it stores file-level metadata about the data files in the stage
      • A directory table is not a separate database object but is an implicit object available with a stage
      • You can enable the directory table for a stage while creating the stage or enable it afterward
      • A directory table has no grantable privileges of its own
Unloading of Data
  • Overview of data unloading | Snowflake Documentation
  • Very similar to loading - uses COPY command (ex: push data into S3)
  • Unload to multiple file formats
  • Large files are automatically split int chunks when unloading - parallelism; each exported file is 16MB in size
  • The maximum allowed size per file is 5GB if you export data to cloud storage
  • If needed, can set the SINGLE parameter to true to ensure the export goes to a single file.
  • Can compress and encrypt while unloading
    • Gzip compression is automatic; encryption - need to specify
  • Can COPY INTO a Named Internal stage as well, then use GET to download into the on-prem
    • Compression and encryption is automatically enabled when pushing into an internal stage
  • Similar with an internal Table of User Stage, followed by GET
  • Supported file formats:
    • CSV, TSV, and any other separator character
    • Parquet
    • JSON - only NDJSON (Newline Delimited JSON) is supported
  • Compression
    • Gzip by default
    • Can specify additional methods
  • Encryption
    • Automatically on when exporting to an internal stage
    • NOT automatically on when exporting to an external stage
  • Single vs Multiple files
    • Export into multiple files by default - parallelism
    • Can be changed to a single file
    • The size of each exported file can be configured
Data Transformation
  • At ingestion during COPY can:
    • Omit columns
    • Reorder columns
    • Cast columns into data specific types
    • Truncate text strings
  • NOT supported at COPY
    • Joins
    • Filters (WHERE)
    • Aggregation / flatten
Snowpipe
  • Snowpipe | Snowflake Documentation
  • Continuous load of data from files as soon as they arrive at a stage - within mins of data arrival
    • As opposed to batch loading using COPY
  • Serverless - scale up/down is managed by Snowflake
  • Virtual w/houses are not used
  • Billed separately
  • Can load data from an external stage an internal stage
  • Loading from cloud:
    • Can use cloud service notification to trigger Snowpipe and push data into an External Stage; Snowpipe then uses COPY to load data into the target table
    • Can use Snowpipe REST API to trigger Snowpipe
  • Internal load - can use an Internal Stare; no REST API for this
  • Snowpipe Streaming | Snowflake Documentation
    • Continuous loading
    • Writes rows of data directly to Snowflake tables without the requirement of staging files
 
Continuous Data Protection
  • Time Travel
    • Query and retrieve historical data
    • Prior to this feature - used to restore from backup
    • Time travel period - 1-90 days (90 in Enterprise Edition, Business Critical, VPS)
    • Cost is charged - need to be careful if data is updated frequently
    • Can:
      • Travel to the period before query
      • Un-drop objects
      • Undo accidental changes
      • Combine time travel with cloning - set up environment with pre-issue data
    • On objects
      • Databases
      • Schemas
      • Tables
    • How it works:
      • Data is stored in micro-partitions - these are immutable
      • Data updated - new micro partition is added and pointed to, old is marked as Deleted
      • Same when data is deleted - it is not physically deleted, but marked as Deleted
      • Deleted partitions can be read at time travel
      • The duration these mark-deleted partitions are kept for is the Time Travel duration  (1-90 days)
      • SQL Extraction:
        • AT and BEFORE; UNDROP statement
        • SELECTE FROM AT (
          • TIMESTAMP >=… or
          • OFFSET=> <seconds in the past from now or>
          • STATEMENT => <id of the statement travelling back to>
        • NOTE: AT returns data inclusive of data at the specified time or criteria
        • Is needs to be SQL or DML or Transaction
      • UNDROP:
        • Table
        • Schema
        • Database
  • Failsafe
    • Understanding and viewing Fail-safe | Snowflake Documentation
    • Historical data is recoverable for 7 days - permanent tables only
    • Failsafe starts immediately after Time Travel ends
    • Data can only be extracted by Snowflake Support
    • Transient and Temporary tables don't have Failsafe - i.e. 0 days
    • No queries are allowed on data in Failsafe
    • Cost is charged, every 24 hours * number of days
    • Can not disable Failsafe
    • Cannot be used to access historical data but is used to recover from accidental data loss
  • Tables Types
    • Temporary
      • exists for the lifetime of a session, data is deleted after
      • not visible to other sessions
      • Time Trave period of 1 day max, no Failsafe
      • CREATE TEMPORARY TABLE MY_TABLE_TEMP
    • Transient
      • similar to temporary, but persist b/w sessions; can be accessed by other users and other sessions
      • hold temp data that needs to stay alive across sessions
      • Time Trave period of 1 day max, no Failsafe
      • need to be delete manually
      • CREATE TRANSIENT TABLE MY_TABLE_TRA
    • Permanent - default type; Time Travel up to 90 days, Failsafe 7 days
  • Streams
    • Feature designed to track changes to data
    • A way to capture and process data manipulation language (DML) changes, such as INSERT, UPDATE, and DELETE operations, that have occurred since the stream was created or since the last time it was queried.
    • Key Features:
      • Change Tracking: Captures INSERT, UPDATE, and DELETE operations on a Snowflake table, allowing identification of data modifications.
      • Real-Time and Incremental Processing: Enables processing of fresh data changes for timely updates in downstream systems.
      • Zero-Copy Cloning: Leverages metadata to reflect data changes, ensuring efficient and consistent tracking across cloned environments.
      • Concurrency and Isolation: Supports concurrent data operations while maintaining accurate change tracking.
      • Schema Evolution: Adapts automatically to changes in the table's schema, such as column additions or removals.
    • Stream cannot be used with External Tables - since external tables do not track changes in the underlying files themselves, they are not eligible for change tracking using streams
    • Streams can be used with directory tables to easily track which files have been added, removed, or changed. This is done by creating a stream on top of the stage object.
 
 
Cloning
  • Zero Copy Cloning
    • Can create a clone of table or schema without moving the physical data
    • Done by Cloud Services layer on micro-partition metadata
    • Metadata operation; no extra storage needed; fast
    • A table consists of micro partitions - a new table metadata is created pointing to the same micro partitions
    • Since micro partitions are immutable, adding new data to either table will not overwrite the original partitions but will add new partitions instead
    • The source and cloned items are independent - updates to the data in the source table do not automatically update the data in the cloned table
  • Cloning
    • Cloning considerations | Snowflake Documentation
    • CLONE COMMAND works on
      • Databases, schemas, tables
      • Streams
      • Stages, file formats, sequences, tasks
      • CREATE TABLE MY_CLONE AS CLONE MY_ORIGINAL
    • Fast but not instantaneous if a table is large
    • Permissions
      • Cloned table does not inherit permissions of the original table
      • Child objects of a cloned database or schema do inherit permission
    • Can't clone:
      • External table
      • Named Internal Stages
      • Snowpipes referring to Named Internal Stages
    • Can combine Cloning with Time Travel - clone a table off a Time Travel instance of that table
      • CREATE TABLE MY_COPY CLONE CUSTOMER BEFORE(TIMESTAMP => '<timestamp>'::timestamp_ltz);
    • Can clone a Schema - all child objects within the schema are cloned (i.e. all cloneable objects - tables, etc.)
    • Temporary table - can't clone it into a permanent table; can clone it into another temporary or into a transient table
 
Data Sharing
  • Secure Data Sharing
    • Metadata operation via Cloud Services
    • No physical data movement - shared data references the source partitions
    • Any changes to the source are visible to the consumer
    • No extra storage - no extra fees
    • Sharing across Region and/or cloud providers involves data replications
    • Each Snowflake account is hosted in a particular Snowflake region. To use Snowflake in multiple regions, a Snowflake customer needs to maintain multiple Snowflake accounts, at least one for each region
    • 1:Many: A single provider can share data with many consumers
    • Not Many:1: However, each share does not consolidate data from many providers into a single share for one consumer.
    • The same Snowflake account can share (as a data provider) and consume data (as a data consumer)
    • Terminology
      • Data Provider - a SF account that crates share; provider creates a Share
      • Share - object that defines what's being shared and with who; needs to be granted access to the database
        • Tables being shared
        • Schema and Database containing those tables
        • List of Consumers
      • Data Consumer - account that consumes the shard data
        • Once a Share is configured, it appears in the Consumer Account
        • Consumer can configure Read-Only database on the Share and read data
        • No cost for storage to the consumer - just the compute for querying
    • Offerings
      • Data Share
        • One account to another or to many directly
        • Can add specific table to a share only
        • Can share with  non-SF customers by setting up Reader accounts for them. Cost and management of these is billed to the data provider though
      • Snowflake Marketplace - access to 3rd party data
      • Data Exchange - customer's own data hub for private exchange
      • SQL Usage Example:
        • On the provider account:
          • CREATE SHARE myshare_MYDATA;
          • GRANT USAGE ON DATABASE sharing_db TO SHARE myshare_MYDATA;
          • GRANT USAGE ON SCHEMA sharing_db.public TO SHARE myshare_DATA;
          • GRANT SELECT ON TABLE sharing_db.public.MYDATA TO SHARE shr_MYDATA;
          • ALTER SHARE shr_MYDATA ADD ACCOUNT = <consumer_account_name>;
        • On the consumer account:
          • USE ROLE ACCOUNTADMIN;
          • CREATE DATABASE Procurement_DB FROM SHARE <provider_account_name>.shr_MYDATA;
          • SELECT * FROM Procurement_DB.public.MYDATA;
      • Can set up share on SF user interface as well
        • With another SF account - Private Sharing
        • With a non-SF - Private Sharing with a New Reader Account; this will create a reader account for the consumer
        • SF Market place - 3rd party data set store, some free some are not
          • Listings can be Ready to Query or By Request and Personalized (for a fee)
        • Data Exchange - private data sharing hub. Market Place is public, but this is by-invitation
 
Performance Features and Optimization
  • Optimizing performance in Snowflake | Snowflake Documentation
  • Options are:
    • Caching
    • Scaling Up / Down / Out (multi-cluster w/h)
    • Partition pruning
    • Materialized Views
    • Search optimization
  • The following strategies may be applied to improve the performance of a virtual warehouse:
  • Query execution steps
    • Cloud Services
      • Query compilation takes place; SQL is converted into a query plan
      • Query Optimizer optimizes the query
    • Virtual Warehouse
      • VWH process the query plan
      • Reading data from Micro partitions relevant to the query
      • Data is brought to the VWH RAM - if it does not fit into RAM it can be saved to disc
      • Warehouse Cache is created for future similar queries to use
    • Cloud Services
      • Results are returned from VWH and saved in Query Results Cache
      • If the next query can be satisfied from this cache - it is not sent to VWH
  • Query profile
    • Query execution details in graphical format
    • Designed to assist in error resolution
  • Caching
    • Cloud Services - Metadata Cache, Query Results Cache; available to all VWH
      • ALTER SESSION SET USED_CACHE_RESULT = FALSE
      • This disables cache use on the Session level
    • VWH Layer - WH Cache; available to its WH only
    • Metadata Caching
      • as new micro-partitions are created, metadata is updated in Cloud Service layer keeping track of what data resides in each partition
      • Min and max values for each column, count if distinct: COUNT, MIN, MAX queries can be done off metadata w/out going down to the tables. There are exceptions to this:
        • On character columns or columns that store string data
        • when MIN and MAX are used in conjunction with complex expressions or transformations rather than directly on the column
        • When additional filtering conditions are applied
    • Query Result Cache
      • Stores and re-uses results from a query
      • Results are stored for 24 hours unless reused; if reused - stored for another 24 hours
      • Can be refreshed up to 31 days - after which us discarded
    • Virtual Warehouse Cache
      • Data is cached locally for reading and query processing
      • If a VWH is suspended, all its compute and temporary storage resources related are released or de-provisioned and the WH Cache is removed. When a VWH is re-established the cache is rebuilt overtime. Sometimes it might be worth keeping the VWH running to have the cache ready, rather than suspend and then re-build the cache
      • Cache size is proportional to the VWH size - depended on the  number of nodes in VWH
    • Partition Pruning and Clustering Keys
      • Clustering Keys & Clustered Tables | Snowflake Documentation
      • Data is added to a partition in the order it is received
      • Columns can be scattered across multiple micro partitions
      • SF optimizes the search - it does not read from the micro partitions w data for columns the query does not need. The process of elimination of partitions at query time is called Partition Pruning
      • Clustering Key
        • Co-locates similar rows in the same micro-partition
      • Automatic Clustering
        • SF re-distributes data according to the Cluster Key. This happens only if SF deems this beneficial
        • This is Serverless - VWH is not used, but Compute cost is charged
        • Additional Storage cost - as old pre-clustering partitions are kept for fail safe and time travel
      • Partition Pruning
        • A database optimization technique that improves query performance by selectively ignoring (or "pruning") data partitions that are not relevant to the query
        • Snowflake keeps track of what range of data is kept in which micro-partitions for each column. This metadata enables Snowflake to eliminate unnecessary micro-partitions when running queries, boosting overall query performance
        • SF monitors Partitions Scanned and Partitions Total for a query profile. Is Scanned = Total, might be an opportunity to Prune
    • Clustering Depth
      • Relevant metadata is stored:
        • Number of microoperations in a table
        • Number of overlapping micro partitions
        • Depth of overlap
      • Clustering depth is calculated for specific columns - starts at 1 and goes up; i.e. across how many micro partitions data for a given column is split. For example, if data resides in one partition only - depth is 1, no overlap
      • The smaller - the more well-clustered a table is
      • SYSTEM$CLUSTERING_DEPTH, SYSTEM$CLUSTERINF_INFORMATION
    • VHW Optimization
      • Scale Up/Down - when query complexity goes up/down
        • Sale Up take places only when new nodes are fully provisioned; current queries finish on the pre-upscale size; new queries take advantage of the new size
          • Scale down take place only when active queries have been completed
      • Multi-Cluster VWH - when there is high query concurrency (large number of simultaneous queries); or if w/load fluctuations unpredictable
      • ALTER WAREHOUSE <name>
      • SET WAREHOUSE_SIZE = SMALL … x6LARGE
    • Multi-Cluster VWH
      • Multi-cluster warehouses | Snowflake Documentation
      • Used mainly when compute isn't insufficient and queries are put into queue
      • Auto-scales up/down
      • Available with Enterprise Edition only
      • The syntax for creating one is:
        • CREATE WAREHOUSE <> WITH WAREHOUSE SIZE <>
        • MAX_CLUSRE_COUNT = <>
        • MIN_CLUSTER_COUNT = <>
        • AUTO_SUSPENS = <> | NULL
        • AUTO_RESUME = TRUE | FALSE
        • INITIALLY_SUSPEND = TRUE | FALSE
    • Scaling Policies
      • Multi-cluster warehouses | Snowflake Documentation
      • Use MAX and MIN_CLUSTER_COUNT
      • To continuously run at maximum capacity set both to same value
      • Auto-Scaling Policies:
        • Standard - this is the Default. Prioritizes launching new clusters over queueing up queries to save credits
          • Scale UP: spins up a new WH as soon as a queueing is detected; additional - 20 sec after the preceding one has started.
          • Scale DOWN: 2-3 consecutive checks at 1 min interval to confirm if the workload can be reallocated without the need to spin up another WH
        • Economy - aims to preserve credits; applies additional logic to determine is a new WH is needed prior to launching it
          • Scale UP: launch only if the new WH has enough burden to be busy for at least 6 min
          • Scale DOWN: 5-6 consecutive checks at 1 min interval to confirm if the workload can be reallocated without the need to spin up another WH
    • Materialized View
      • Requires Enterprise Edition
      • Physically pre-store computed results based on a SELECT query
      • Kept up to day by SF - data is synced with the base table
      • SF re-routes queries to base table is the martialized view has not sync up yet
      • Compute is charged separately
      • Good for frequent complex and costly queries that normally have consistent results
    • Search Optimization
      • Search Optimization Service | Snowflake Documentation
      • Enterprise Edition min
      • Similar to Secondary Index -  improves query performance
      • Gets enabled on a table or individual columns
      • Maintained by a dedicated managed Service - does not require a VWH; transparent to the user
      • Credit and storage costs are incurred
      • Not available on:
        • Materialized Views
        • External Tables
        • Views
        • Runtime Transformations
        • Semi-structured Data Types
        • Tables without Clustering Keys
        • DML Operations
        • Transactional Operations
        • Data Loading Processes
        • System-tables/System Objects
 
Security
  • Data
    • Encrypted at rest
    • Yearly re-keying
      • Re-encrypt data using a new key if an old key is more than a year old
      • Required at least Enterprise Edition
      • Manually enabled by account Admin
    • Tri-Secret Secure (BYOK)
      • Uses Composite Master Ket - comprised of customer managed key and SF managed key
      • If customer key is revoked - SF can no longer decrypt data
      • Minimum of Business Critical Edition
  • Authentication
    • MFA
      • Enabled by default for all accounts
      • Authentication via Push (on App), SMS code, Phone Call
      • Available in all editions (min - Basic)
      • Best practice - at least ACCOOUNTADMIN should use MFA
      • Enable: ALTER USER; SET DISABLE_MFA = TRUE
      • Temporality disable: ALTER USER; SET MINS_TO_BYPASS = <min>
      • MFS supported by SF clients and connectors:
        • SF Web UI
        • SnowSQL
        • SF ODBC and JDBC drivers
        • Python and other connectors
        • Push approach is used by default for these
        • Command line passcode can be used:
          • --mfa-passcode <…> or --mfa-passcode-in-password
        • Can be supplied in connections string for drivers:
          • Passcode=<..> or passcode_in_password=True
    • Key Pair Authentication
      • Public Key issued to a user, Private Key used for authentication
      • User can have up to two keys
      • Supported by all SF editions; SnowSQL and other connectors
    • User Provisioning via SCIM
      • Auto-provisioning of users and group membership
      • SCIM (System for Cross-domain Identity Management) is an open standard for automating the exchange of user identity information between identity providers (like Okta, Azure AD) and service providers (like Snowflake)
      • SCIM is an open standard that provides automatic user provisioning and role synchronization based on identity provider information. When a new user is created in the identity provider, the SCIM automatically provisions the user in Snowflake. Additionally, SCIM can sync groups defined in an identity provider with Snowflake roles
    • SSO through SAML 2.0 Federated Authentication
      • Authenticate with External Identity Provider
      • Native support for Okta, ADFS
      • Google G Suite, MS Azure AD, OneLogin, Ping Identity Ping One
      • Supported by all SF editions
    • Password Policies
      • Passwords at least 8 characters long
      • At least one digit, one upper case, one lower case
      • Admin can create a weak password; best practice - enforce on first login
  • Authorization
    • Role Base Access Control (RBAC)
      • Access to Securable Objects - db's, schemas, tables, views, WHs, etc.
      • Creator of an object has Owner Role
        • This can be transferred to another roles
        • Can grant privileges to others roles (not users) - DAC
    • Discretional Access Control (DAC)
      • a type of access control mechanism in which access rights are granted or restricted based on the discretion of the object's owner or another authority figure
    • Column Level Security
      • Data Masking - dynamically mask data at query time depending on user role
        • A masking policy is specific to a schema; once a schema exists, a policy can be assigned to tables and views
        • Applies to a column in all locations the column appears
      • External Tokenization
        • Tokenize prior to lading into SF - replace sensitive data with an unreadable token
        • Detokenize during execution, depending on user role
        • Admin creates Access Policy: Logical Condition + Masking Function (gets executed at runtime - unmask for users with a specific role, mask for all others)
        • Once defined, can be applied to one or more columns
    • Row Level Security
      • Return specific results based on a role
      • Admin creates Access Policy: Logical Condition + Row Filtering Condition
    • Secure Views
      • Used for Data Privacy - hide underlying data from a user
      • Determined when a view us created - use SECURE keyword
    • Secure UFDs
      • Ensures that user executing the UDF does not have access to the underlying data
    • Audit Logs
    • OAuth 2.0 authentication
  • Cloud Provides adds a lever of security
  • Default Roles:
    • Overview of Access Control | Snowflake Documentation
    • ORGADMIN role performs organization-specific tasks like listing all accounts and creating new ones; able to view usage activity across organization
    • ACCOUNTADMIN - Highest-level role with administrative privileges across the entire account. Encapsulates the SYSADMIN and SECURITYADMIN system-defined roles.
      • SECURITYADMIN - Manages user authentication and authorization, incl role grants and revocations.
        • Inherits the privileges of the USERADMIN role
        • USERADMIN - Responsible for user management, including creating and modifying users:
          • Can create users and roles in the account
          • Creating and dropping user accounts
          • Managing roles (creating, dropping, assigning, and unassigning roles to users)
          • Resetting passwords for users
          • Enforcing password policies (setting and modifying password complexity, expiration, etc.)
      • SYSTEMADMIN - Oversees database and object management, excluding user and role management.
  • NOTE: ACCOUNTADMIN can manage accounts, roles, and privileges, view query history, and monitor system usage. They do not automatically have the privilege to see the data returned by queries run by other users unless they have been explicitly granted access to the underlying databases, schemas, and tables that the queries are accessing
  • Network Security
    • TLS 1.2 by default
    • Network Policies
      • By default can connect from any IP
      • Can define Allowed and Blocked lists
        • CREATE_NETWORK_POLICY POLICY_1
        • ALLOWED_IP_LIST=('111.111….','…')
        • BLOCKED_IP_LIST=('111.111….','…')
          • IPs or IP Range, v4 only
      • SYSTEMADMIN or higher can create only
      • Or a role with CREATENETWORKPLOICY privilege
      • Can be applied on the Account basis or to individual Users
      • User level takes precedence over Account
    • Support for private connectivity
      • Min of Business Critical Addition
      • AWS PrivateLink
      • Azure Private Links
      • Google Cloud Private Service Connect
    • Encryption in transit
      • TLS1.2 for all in-transit
      • HTTPS for all connectivity
 
Extending Snowflake
  • UDF - Function
    • SQL
    • Java
    • JavaScript
    • Python
  • Scalar UDF - returns one value for each input
  • Tabular UDFs - returns zero, one or many rows
  • External Function - type of a function
    • Code is stored outside of SF
    • SF calls out to an HTTP endpoint hosted in a remote service for computation or processing
    • SF only maintains the metadata
    • Can be written in any language that standard UDF does not support, can use external libraries and call external APIs
    • Only Scalar external functions are supported
  • Secure UDF - Performance vs Security
  • Stored Procedures
    • Proc code with if-else, loops etc.
      • SQL
      • JavaScript
      • Snowflake Scripting
      • When using Snowpark can use Java, Python or Scala
    • Can construct dynamic SQL statements
    • Returns a single value or tabular data
    • Sprocs vs UDF
      • UDF perform computation on its input data
      • Sproc executes one or more SQL statements
    • Understanding caller’s rights and owner’s rights stored procedures | Snowflake Documentation
    • Sproc created under Caller's Right - uses privileges of the caller; can access caller's session (ex: session variables)
    • Created under Owner's Right - default, runs mostly with the privileges of the stored procedure’s owner. "if the owner has the privileges to perform a task, then the stored procedure can perform that task even when called by a role that does not have privileges to perform that task directly."
    • The right under which a Sproc will be run is defined at creation. Can run ALTER PROCEDURE to change
  • Snowpark
    • Snowpark API | Snowflake Documentation
    • Dev environment - SQL/Python/Java/Scala code get translated into SQL
    • DataFrame API - similar to pandas
    • In-Database processing - Data transformations and computations are pushed down to Snowflake’s compute and storage layers
    • Snowpark functions are evaluated in lazy manner - nothing is done until a function is called; only when a function is called, it is run and the output is converted into SQL
  • Snowflake Scripting
    • Extensions to SQL
    • Can use variable, handle exceptions, If-then-else, case, loops, etc.
    • Typically used to build SProcs
 
Account and Resource Management
  • Resource Monitors - mechanism for tracking and controlling the consumption of compute resources measured in Snowflake credits. Can be applied to:
    • Single VWH
    • A group of VWHs
    • Entire SF account
  • Based on % of credit can notify and suspend
  • A single VWH can have only one Resource Monitor assigned
    • Exception - account level monitoring
  • Resource Monitors monitor only resource created by the customer
  • Any SF-created resource cannot have a Monitor assigned. This includes:
    • Snowpipe
    • Automatic Re-clustering
    • Materialized View maintenance
    • Etc.
  • NOTE: Resource Monitors cannot manage the cost of cloud services
  • Only ACCOUNTADMIN can create Monitors
    • Can view usage data under SNOWFLAKE.INFORMATION_SCHEMA
  • ACCOUNT_USAGE Schema
    • Historical usage data that allows for tracking past activities and resource consumption over time
    • Not real-time - refreshes every 45 min - 3 hrs
    • Data retention - 365 days
    • Shows information on deleted objects
      • QUERY_HISTORY: Details for queries executed, including performance and users
      • WAREHOUSE_METERING_HISTORY: Credit usage for each virtual warehouse
      • STORAGE_USAGE: Daily account-wide storage usage
      • DATABASE_STORAGE_USAGE_HISTORY: Storage usage by each database over time
      • LOAD_HISTORY: Information about data load jobs, including successes and failures
      • LOGIN_HISTORY: Records of user login attempts and details
      • USER_HISTORY: Historical data about user accounts, roles, and activity
      • WAREHOUSE_HISTORY: Changes and usage statistics of virtual warehouses
  • INFORMATION_SCHEMA (aka “Data Dictionary”) - contains views and table functions the provide metadata info on objects and views
    • Snowflake Information Schema | Snowflake Documentation
    • A standardized set of read-only views in SQL databases that provide metadata information about the database objects, such as tables, columns, constraints, and more. It allows users to query system catalog information using regular SQL queries
    • Provides metadata that is updated in real-time (no latency), allowing for immediate visibility into the current state of the database objects
    • Data retention - 14 days typically, ranges from 7 days to 6 months (depending on specific views)
    • Does NOT contain info on deleted (dropped) objects
  • Account Usage vs. Information Schema
  • Snowflake Releases
    • Take place twice a week - transparent to user, no downtime
    • Full Releases - new features, enhancements, bug fixes
    • Patches - issue fixes only
    • New behavior is released - a customer can opt in during the first month, then the behavior get enabled automatically, but the customer can opt out
    • Release steps
      • Snowflake Releases | Snowflake Documentation
      • Day 0 - 24 Hr Early Access, VPS
      • Day 0.5 - 12 Hr Early Access, Enterprise Edition or higher only
      • Day 1 (24 hrs later) - all remaining Enterprise Editions or higher receive the upgrade; all Standard Edition get the upgrade
 
Compliance
 


CHEAT SHEET

  • Difference
    ACCOUNT_USAGE
    INFORMATION_SCHEMA
    Includes dropped objects
    Yes
    No
    Latency of data
    From 45 minutes to 3 hours (varies by view)
    None
    Retention of historical data
    365 Days
    7 days - 6 months (varies by view/table function)
 
  • Default Roles:
    • ORGADMIN role performs organization-specific tasks like listing all accounts and creating new ones; able to view usage activity across organization
    • ACCOUNTADMIN - Highest-level role with administrative privileges across the entire account. Incl the SYSADMIN and SECURITYADMIN system-defined roles.
      • Can create/delet all objects (databases, schemas, tables, views, etc.).
      • SECURITYADMIN - Manages user authentication and authorization, incl role grants and revocations.
        • Inherits the privileges of the USERADMIN role
        • USERADMIN - Responsible for user management, including creating and modifying users:
          • Can create users and roles in the account
          • Creating and dropping user accounts
          • Managing roles (creating, dropping, assigning, and unassigning roles to users)
          • Resetting passwords for users
          • Enforcing password policies (setting and modifying password complexity, expiration, etc.)
      • SYSTEMADMIN - Oversees database and object management, excluding user and role management.
Editions
  • Snowflake Editions | Snowflake Documentation
  • Standard Edition
  • Enterprise Edition
    • Time travel - 1-90 days
  • Business Critical Edition (also known as Enterprise for Sensitive Data)
    • Time travel - 1-90 days
  • Virtual Private Snowflake (VPS)
    • Time travel - 1-90 days
 
Authentication
  • MFA, Key Pair, User Provisioning via SCIM, SSO through SAML 2.0 Federated Authentication, Password Policies
 
Caching
  • Cloud Services: Metadata Cache, Query Results Cache
  • VWH Layer - WH Cache
  • Partition Pruning
  • Clustering Keys (Co-locates similar rows in partition)
  • Automatic Clustering - re-distributes data according to the Cluster Key
 
VHW Optimization
  • Scale Up/Down
  • Multi-Cluster VWH - when large number of simultaneous queries
  • Multi-Cluster VWH - when queries are  queueing, need more compute
  • Scaling Policies
  • Materialized View
  • Search Optimization
  • Query acceleration
 
Auto-Scale
  • Standard
    • Prevents queuing
    • Start Up New - waits 20 sec
    • Shut Down - 2-3 checks mins apart
  • Economy
    • Conserves Credits
    • Start Up New - if est. to keep one busy for at least 6 mins
    • Shut Down - 2-6 check mins apart
 
Data Transform - ingestion during COPY can:
  • Omit columns
  • Reorder columns
  • Cast columns into data specific types
  • Truncate text strings