Aleks on AWS

Tuesday, December 10, 2024

Application of Machine Learning

Application of ML

Key - identifying the right ML problems to work on
Most of ML project fail! (87%! - VentureBeat, 2019)
Key questions:

Is there a problem?

Listen to users
Observe in context - field experiment, shadow a user, etc.

Can ML help this problem?

Is it a good fit for NL - easy / hard / impossible
Is data available?

Does someone really care if this problem is solved?

Business impact vs. Feasibility of providing a solution
Target is - high business value and feasible to archive

Understanding the problem

https://developers.google.com/machine-learning/problem-framing/problem

Validating Product Ideas - Brainstorming

Formulate hypothesis
Test the hypothesis

POC or Mock Up
Not an actual working model

Analyze the findings
Decide to continue or not
Refine the hypothesis and repeat

ML use considerations

Human factor:

Automation - replacing a human
Augmentation - supporting a human

Before rushing into ML, see Heuristics:

Hard-coded business rules
Built on previous experiences

Ex: take a mean value of sales across a set of days to predict

Can use heuristics to build a baseline and then move into ML
Compare ML performance against the baseline
Pros

Minimum computation cost, easy to maintain, easy to read

Cons when compared to ML

ML performs better
Can be re-trained on new data
Can solve more problems

Project Organization

ML Project challenges vs tradition software dev

Border team and new skill set
Technical risk

Need quality data, model limitations
Identify features

Because if this, the projects are more difficult to plan
Cyclical project - not easy to show progress

How "good" of model performance is good enough?

Need to retrain models
Users need to adjust to the model and trust it
https://medium.com/@l2k/why-are-machine-learning-projects-so-hard-to-manage-8e9b9cf49641

Defining the Process

Don't jump into solution
Don't spend resources and effort fixing a poorly defined problem
Organize the team well

CRIS-DM

https://www.ibm.com/docs/zh/spss-modeler/18.0.0?topic=dm-crisp-help-overview
https://medium.com/@avikumart_/crisp-dm-framework-a-foundational-data-mining-process-model-86fe642da18c
Cross-industry standard process for data mining
Developed in 1996, flexible industry-agnostic

Business understanding

Define the problem

Target users
Write problem statement
Why it matters
How is it solved today
Gaps in current state

Define Success

Quantify expected business impact
Identify constraints
Translate impact into metrics
Define success targets for metrics

Identify factors

Gather domain expertise
Identify potentially relevant factors

Data understanding

Gather data

Identify data sources
Label data
Create features

Validate data

Quality control of data - time consuming
Resolve data issues

Explore data

Stat analysis and visualization
Reduce dimensions
Identify relationships and patterns

Data preparation

Split data - training and test sets
Define features
Prepare for modeling

Modeling

Model section

Evaluate algorithms
Documentation, versioning

Model tuning

Hyperparameter optimization
Documentation, versioning
Morel re-training

Evaluation

Evaluate results

Run model on test set
Interpret outputs and performance

Test solution

Software unit and integration testing
Model unit testing
User testing - alpha/beta

Deployment

Deploy

APIs
Product integration
Scale the infrastructure
Security
Software deployment

Monitor

Observe performance
Re-train the model

ITERATIVE - Go through another iteration of these

Team Organization

Business Sponsor
Product

Owner
Manager

Data Science

Scientists

Stat/data background
Gets insights out of the data
Determines ML approach
Inovlved heavily earlier in the project

Engineering

Data Engineer
Software Engineer
ML Engineer

CS background
Integration of ML into product

QA/DevOps

Agile - Iterative experiments

Business understanding

Mock up of solution
Get feedback from customer

Business understanding / Data understanding

Collect data, feed into model
Collect customer feedback

Business understanding / Data understanding / Data processing

Try real data, heuristic model
Collect customer feedback

Business understanding / Data understanding / Data processing / Modeling

Try real data, simple ML model
Collect customer feedback

Circle again if needed

Measure Performance

Outcome Metrics

Business impact, $ usually
No technical performance metrics here
Internal - not customer facing

Output Metrics

Customer facing - testing together possibly
A/B or Beta testing

Non-performance considerations

Expandability and interpretability

Debug-friendliness
Resilience

Cost - data and compute

Data Needs

Historical and Real time
Training Data

Subject matter experts
Customers
Temporal and geospatial characteristics

How much/many

Start with small amount of features
Add more and evaluate

Training Data

Use Labels

How much data

More = better
Depends on

Number of features
How complex feature/target relationship is (linear or not)
Data quality
Target model performance

When collecting data beware of:

Obtain only relevant data
Introduction of bias
Need to update data regularly / retrain model
Document data sources
Flywheel effect - users interact w AI, the data fed back into the model

Governance and Access

https://engineering.atspotify.com/2020/02/how-we-improved-data-discovery-for-data-scientists-at-spotify/
Key barrier - siloed and inaccessible data
Break down the barriers and silos first:

Cultural Change

Executive sponsor
Education

Technology

Centralized DWH
Query tools

Data Access

Responsibility
Permissions

Data Cleaning

Issues - missing or incomplete data

Missing data
Anomalous data
Mis-mapped data

Types of missing data

Missing Completely at Random

No pattern in missing
Low bias - not great concern

Missing at Random

Missing due to another feature of the data
High bias potential

Missing Not at Random

Missing due to values of the feature itself
High bias potential

Options for dealing with this

Remove rows or columns
Flag it to be treated as a special case
Replace with mean/median
Backfill / Forward fill
Infer it - use a simpler model to predict the missing values

Outliers

Can greatly influence the result
Use visualizations and statistical methods to identify
Understand the root cause

Preparing data

EDA - Exploratory Data Analysis

Understanding the trends in data
Catch issues

Feature engineering

Selecting the right features

Feature section methods

Filter Methods

Statistical tests based on data characteristics
Used to removed irrelevant features
Computationally inexpensive

Wrapper Methods

Train on subset of features
Often non-feasible in real world
Computationally expensive

Embedded methods

Select features that contribute the most
Use model training set

Transform data for modelling

Ensure data is the format required

Reproducibility

Ability to reproduce results
Helps debugging
Helps learning (team hand off)
Best practices:

Documentation
Data lineage

Tracking data from source to use in a model
Adding visualizations to illustrate data relationships
Helps meet compliance if required

Proper versioning

Version code and the model itself
Helps rollback
Champion/challenger tests - running different versions of model in parallel

Technology Selection

ML System consist of:

UI
Data
Model
Infra

Key decisions driving the tech selection:

Cloud or Edge

Cloud - need network concavity, allow for high throughput
Edge - primary benefit is latency and security as not exposed to network; need sufficient compute and memory locally
Hybrid

Use Edge AI to trigger Cloud AI (local events captured, data sent to Cloud)
Cache common prediction at Edge or nearest DC
Key questions:

Is latency a concern?
How critical is connectivity?
Is security of sending data to cloud a concern?

Offline or Online Learning

Determine if Model re-training and prediction needs to happen in real-time

Model re-training: Scheduled - Offline; Realtime - Online
Prediction: Scheduled - Batch; Realtime - Online

Offline - re-training is done on scheduled basis; easier to implement and evaluate but slowed to adopt to changes
Online - re-train as new data comes in (min/hr); adopts in real-time but harder to implement

Batch or Online Predictions

Batch prediction - predict based on batched observations on scheduled basis; efficient but predictions are nor immediately done on new data
Online predictions - - real-time on demand; results available immediately; latency can be an issue, model can drift

Technology decisions

Programming language

Python

Pandas + NumPy for data manipulation, Sci-Kit for modeling, Matplotlib for visualization, NLTK + SpaCy for text processing

R, C/C++

Dats processing tools
Modeling tools

TensorFlow (Google), PyTorch (Facebook), Keras (TensorFlow API) - libraries for deep learning
AutoML

Enables to build models quickly
Available from Google, Microsoft, AWS, H2O
Consumes data, trains the model, exposes over API

API and interface
Factors that affect these

Open source vs prop
Learning curve
Documentation
Community support
Talen availability
Simplicity and flexibility
Long term cost
Factor of cost vs time to implement

Challenges of running and deploying models

System Failures

Overtime performance of a model often decreases
Training-serving skew

Mismatch between data trained on and data received in prod
Ex: train on clear images, receive blurry images in prod

Excessive latency

Latency in generating predictions
Due to volume of input data, pipeline, choice of model

Data drift

Changes in environment cause change in input data

Concept drift

Patterns' that the model learned no longer apply
Ex: change in human behavior
https://www.technologyreview.com/

System Monitoring

Important to monitor the performance and issues to prevent disruption in service
Input data monitor

Quality checks
Changes in distribution
Changes in correlation between features and targets
Perform manual audits

data pipeline, model outputs, target labels

Audit the model

Split the data into groups - monitor the performance across groups to identify bias
Check the impact of features to ensure the prediction results make sense:

LIME - Local Interpretable Model-Agnostic Explanations
SHAP - Shapely Additive Explanations

Model Maintenance Cycle

Monitor
Retrain and Update

Retrain - keep old data, just change weights

can be scheduled or triggered

Update - use new data to re-model; allows for model adjustments

Evaluate
Deploy

Resources: AI Canon | Andreessen Horowitz (a16z.com)

Snowflake Deep Dive

Architecture

Key Concepts & Architecture | Snowflake Documentation
Historic

Shared-disk - Traditional DB approach (MySQL on a PC), shared storage multiple CPUs
Shared-nothing - Storage and Compute is decentralized but coupled into nodes; MPP - massively parallel processing. Challenge:

Need to scale the store and the compute together
Need to take system offline to add nodes, re-distribute data once nodes are added

Snowflake Data - Multi-cluster, Shared data

Data is centralized, nodes can be scaled independently and sync data off the central store
Central data store sits as files on object-based cloud store
Data is immutable - once a data file is saved, it can only be appended to OR fully removed and replaced with a new file
Snowflake's architecture is based on a massively parallel processing (MPP) approach, but also introduces unique elements:

MPP (Massively Parallel Processing): Distributed processing approach that executes tasks across numerous nodes, enhancing performance and enabling scalability.
Shared-Nothing Architecture: Each node operates independently without sharing resources, preventing contention, although Snowflake abstracts this by separating storage and compute.
Storage and Compute Separation: Centralized storage accessed by independent compute clusters, allowing flexible, cost-effective scaling and resource allocation.
Elastic Scalability: The ability to dynamically adjust compute resources, enabling on-demand performance management and cost efficiency.
Services Layer: A central control plane that manages metadata, authentication, query planning, and system optimization, crucial for overall system management.

1. Compute

Query Processing Layer
Virtual Warehouses - Query processing, multiple clusters (do not confuse w a data warehouse; this is compute)
Can have as many as needed - all access the shared data store; can have compute nodes of different size for different needs
Multi-node compute cluster X-Small to 4X-large
Suspended Warehouse costs noting - a w/h can only be suspended once all queries running on it have completed; exception - forced suspend
Resume and auto-resume - will reprovision the required size nodes
Re-size - new size will apply only after all nodes in the new size have been provisioned; old queries will complete on the old size
Scale-down - can downsize only after all active queries have completed; unneeded nodes will be decommissioned; only new queries will run on the new size
Cache - each virtual w/h has its own cache; cache dropped on w/h suspension

2. Storage Layer

Micro Partitions - the way data is stored, 50-500MB
Data is stored in the way it arrived in the table
Stored in columnar format - compressed automatically

This allows for searching only within relevant columns when extracting data

Micro-partitions are immutable - new data and updates to existing data are added to a new partition; therefore data across partitions may overlap
Metadata - to efficiently map to the correct data when searching; store within micro-partition:

Range of values for each column - min/max
Count of distinct values for each column
Additional metadata used for optimization

Storage Cost

Costs is associated with storing data in databases

Tables
Schemas
Stages
Time Travel
Fail-safe features

Calculated based on the amount of data stored and the duration for which it is stored

3. Cloud Services Layer

Security and Governance - authentication/authorization
Data Sharing
Metadata
Query parsing and optimization
ACID control
Contain Query Cache

Pricing

Understanding overall cost | Snowflake Documentation
Charged every sec, min of 1 min charged
Costs are:

Storage
Compute
Serverless Compute
Data Transfer
Cloud Services - up to 10% is free

Editions

Snowflake Editions | Snowflake Documentation
Standard Edition
Enterprise Edition
Business Critical Edition (also known as Enterprise for Sensitive Data)
Virtual Private Snowflake (VPS)

Web UI, Tools, Drivers

Connectivity

Web UI

Classic
New

Command Line

SnowSQL
Snow Connectivity Diagnostic Tool

Drivers / Connectors

ODBC Driver
JDBC Driver
Python Connector
Node.js Driver
Go Driver
.NET Driver
Spark Connector
Kafka Connector
PowerShell Module
SnowCD is Snowflake Connectivity Diagnostic Tool

Data Loading

Load Data into Snowflake | Snowflake Documentation
Load metadata

Names of every file that was loaded saved into a table, timestamps
This is to ensure that a loaded file is not reprocessed
Load metadata expires after 64 days
Snowflake skips over any older files for which the load status is undetermined

Stage

Storage location used to stage data files that are to be loaded into Snowflake or unloaded from Snowflake
User Stages - defined and managed by individual Snowflake users. A user can create a named stage object or use an unnamed stage - a temporary location to hold data for quick operations.
Table Stages - each table in Snowflake has an implicit stage associated with it. When files are copied into a table without specifying a stage, Snowflake uses the table's stage. This stage is tied to the table and cannot be used separately.
Internal Named Stages - named stages that are created within Snowflake and are not tied to any specific user or table. An internal named stage is accessible to any user with the appropriate permissions and is useful for operations that span multiple tables or databases.
External Stages - Snowflake allows users to stage data files in external cloud storage, such as Amazon S3, Google Cloud Storage, or Microsoft Azure. An external stage specifies the location (bucket, container) and necessary credentials for Snowflake to access the files. This is often used for bulk data loading or unloading because it leverages the cloud provider's infrastructure, which can be more scalable and cost-effective for handling large datasets.

Data Loading with External Stage

Bulk files - COPY into a Virtual Warehouse
Use PUT is for uploading data to a stage, COPY for loading data from a stage into a table
Micro partitions are added, metadata is updates and written into Cloud Services
Data is first loaded into STAGE, the COPY command is executed
Staging area is located b/w data source and the w/house - data is often sftp-ed into Staging
Staging can be outside of the Data W/H or be part of the Data W/H
Extraction is similarly done via a Stage
Stages:

External (S3, Azure Blob, Google Storage)
Internal - a location within Snowflake
Table - a stage is created as a table in the W/house and data is COPYed from out to the real table
User Stage - user's personal area, sub-type if an Internal stage

Data Loading - File Removal

The action of loading the data does not automatically remove or delete the files from the stage. The files remain in the stage until are explicitly removed

Integral Stages for Data Loading

Three types:

User Stages
Table Stages
Named Stages

Each user and table are assigned an internal stage by default - can't rename or drop
Can create additional stages - Named Stages
Data in an internal stage is counted towards SF cost
Data in a Snowflake internal stage is stored in an encrypted format
Data is encrypted prior to being transferred into a Snowflake internal stage
Named Stage

Can be created, dropped or modified
More flexible than a Table or a User stage
Can be used to load multiple files in multiple parameters into multiple tables
Can be assigned security and access rights and can be shared across multiple users

Table Stage

Automatically created for each table; can't be dropped or created manually
Multiple users can access same Table stage - but data is loaded into one table only
Have the same name as the table - can be addressed using @%tablename
Can't set file format on a Table stage - can only specify the format at COPY command
No support for file transformation while loading data

User Stage

Automatically created for each user; can't be dropped or created manually
User-specific - a user can access only his own stage
Can be referenced as @~
Can't set file format on a User stage - can only specify the format at COPY command

Semi-structured Data Support

SF has robust support for semi-structured data formats, including:

JSON (JavaScript Object Notation)
Avro
ORC (Optimized Row Columnar)
Parquet
XML

Ex: can load a JSON into a stage and then FLATTEN it
Can:

Reference the data directly in cloud storage using external tables
load the data into a single column of type VARIANT
transform and load into separate columns in a standard relational table

Extremal Tables

Introduction to external tables | Snowflake Documentation
Can create tables on data stored outside of Snowflake - metadata is stored in SF, actual data is outside of SF
It's a schema object within Snowflake that points to data stored externally, typically in a cloud storage location like Amazon S3, Google Cloud Storage, or Azure Blob Storage. User can define the format of the files (CSV, JSON, Parquet, ORC, etc.) as well as the location(s) where the files are stored.
External Tables functionality allows for querying of external tables just like internal - use in joins, create views on top of the external tables
An external table can be joined with other tables and views may be created using them
External tables are read-only, DML operations CAN'T be pointed to them
These do not contribute to Snowflake cost; compute is used though for querying
Extremal Table point to an External Stage - less performant than querying internally
Can add a Materialized view on top to improve performance; materialized views don’t refresh automatically
Directory Tables

Directory tables | Snowflake Documentation
Implicit object layered on a stage (not a separate database object) and is conceptually similar to an external table because it stores file-level metadata about the data files in the stage
A directory table is not a separate database object but is an implicit object available with a stage
You can enable the directory table for a stage while creating the stage or enable it afterward
A directory table has no grantable privileges of its own

Unloading of Data

Overview of data unloading | Snowflake Documentation
Very similar to loading - uses COPY command (ex: push data into S3)
Unload to multiple file formats
Large files are automatically split int chunks when unloading - parallelism; each exported file is 16MB in size
The maximum allowed size per file is 5GB if you export data to cloud storage
If needed, can set the SINGLE parameter to true to ensure the export goes to a single file.
Can compress and encrypt while unloading

Gzip compression is automatic; encryption - need to specify

Can COPY INTO a Named Internal stage as well, then use GET to download into the on-prem

Compression and encryption is automatically enabled when pushing into an internal stage

Similar with an internal Table of User Stage, followed by GET
Supported file formats:

CSV, TSV, and any other separator character
Parquet
JSON - only NDJSON (Newline Delimited JSON) is supported

Compression

Gzip by default
Can specify additional methods

Encryption

Automatically on when exporting to an internal stage
NOT automatically on when exporting to an external stage

Single vs Multiple files

Export into multiple files by default - parallelism
Can be changed to a single file
The size of each exported file can be configured

Data Transformation

At ingestion during COPY can:

Omit columns
Reorder columns
Cast columns into data specific types
Truncate text strings

NOT supported at COPY

Joins
Filters (WHERE)
Aggregation / flatten

Snowpipe

Snowpipe | Snowflake Documentation
Continuous load of data from files as soon as they arrive at a stage - within mins of data arrival

As opposed to batch loading using COPY

Serverless - scale up/down is managed by Snowflake
Virtual w/houses are not used
Billed separately
Can load data from an external stage an internal stage
Loading from cloud:

Can use cloud service notification to trigger Snowpipe and push data into an External Stage; Snowpipe then uses COPY to load data into the target table
Can use Snowpipe REST API to trigger Snowpipe

Internal load - can use an Internal Stare; no REST API for this
Snowpipe Streaming | Snowflake Documentation

Continuous loading
Writes rows of data directly to Snowflake tables without the requirement of staging files

Continuous Data Protection

Time Travel

Query and retrieve historical data
Prior to this feature - used to restore from backup
Time travel period - 1-90 days (90 in Enterprise Edition, Business Critical, VPS)
Cost is charged - need to be careful if data is updated frequently
Can:

Travel to the period before query
Un-drop objects
Undo accidental changes
Combine time travel with cloning - set up environment with pre-issue data

On objects

Databases
Schemas
Tables

How it works:

Data is stored in micro-partitions - these are immutable
Data updated - new micro partition is added and pointed to, old is marked as Deleted
Same when data is deleted - it is not physically deleted, but marked as Deleted
Deleted partitions can be read at time travel
The duration these mark-deleted partitions are kept for is the Time Travel duration (1-90 days)
SQL Extraction:

AT and BEFORE; UNDROP statement
SELECTE FROM AT (

TIMESTAMP >=… or
OFFSET=> <seconds in the past from now or>
STATEMENT => <id of the statement travelling back to>

NOTE: AT returns data inclusive of data at the specified time or criteria
Is needs to be SQL or DML or Transaction

UNDROP:

Table
Schema
Database

Failsafe

Understanding and viewing Fail-safe | Snowflake Documentation
Historical data is recoverable for 7 days - permanent tables only
Failsafe starts immediately after Time Travel ends
Data can only be extracted by Snowflake Support
Transient and Temporary tables don't have Failsafe - i.e. 0 days
No queries are allowed on data in Failsafe
Cost is charged, every 24 hours * number of days
Can not disable Failsafe
Cannot be used to access historical data but is used to recover from accidental data loss

Tables Types

Temporary

exists for the lifetime of a session, data is deleted after
not visible to other sessions
Time Trave period of 1 day max, no Failsafe
CREATE TEMPORARY TABLE MY_TABLE_TEMP

Transient

similar to temporary, but persist b/w sessions; can be accessed by other users and other sessions
hold temp data that needs to stay alive across sessions
Time Trave period of 1 day max, no Failsafe
need to be delete manually
CREATE TRANSIENT TABLE MY_TABLE_TRA

Permanent - default type; Time Travel up to 90 days, Failsafe 7 days

Streams

Feature designed to track changes to data
A way to capture and process data manipulation language (DML) changes, such as INSERT, UPDATE, and DELETE operations, that have occurred since the stream was created or since the last time it was queried.
Key Features:

Change Tracking: Captures INSERT, UPDATE, and DELETE operations on a Snowflake table, allowing identification of data modifications.
Real-Time and Incremental Processing: Enables processing of fresh data changes for timely updates in downstream systems.
Zero-Copy Cloning: Leverages metadata to reflect data changes, ensuring efficient and consistent tracking across cloned environments.
Concurrency and Isolation: Supports concurrent data operations while maintaining accurate change tracking.
Schema Evolution: Adapts automatically to changes in the table's schema, such as column additions or removals.

Stream cannot be used with External Tables - since external tables do not track changes in the underlying files themselves, they are not eligible for change tracking using streams
Streams can be used with directory tables to easily track which files have been added, removed, or changed. This is done by creating a stream on top of the stage object.

Cloning

Zero Copy Cloning

Can create a clone of table or schema without moving the physical data
Done by Cloud Services layer on micro-partition metadata
Metadata operation; no extra storage needed; fast
A table consists of micro partitions - a new table metadata is created pointing to the same micro partitions
Since micro partitions are immutable, adding new data to either table will not overwrite the original partitions but will add new partitions instead
The source and cloned items are independent - updates to the data in the source table do not automatically update the data in the cloned table

Cloning

Cloning considerations | Snowflake Documentation
CLONE COMMAND works on

Databases, schemas, tables
Streams
Stages, file formats, sequences, tasks
CREATE TABLE MY_CLONE AS CLONE MY_ORIGINAL

Fast but not instantaneous if a table is large
Permissions

Cloned table does not inherit permissions of the original table
Child objects of a cloned database or schema do inherit permission

Can't clone:

External table
Named Internal Stages
Snowpipes referring to Named Internal Stages

Can combine Cloning with Time Travel - clone a table off a Time Travel instance of that table

CREATE TABLE MY_COPY CLONE CUSTOMER BEFORE(TIMESTAMP => '<timestamp>'::timestamp_ltz);

Can clone a Schema - all child objects within the schema are cloned (i.e. all cloneable objects - tables, etc.)
Temporary table - can't clone it into a permanent table; can clone it into another temporary or into a transient table

Data Sharing

Secure Data Sharing

Metadata operation via Cloud Services
No physical data movement - shared data references the source partitions
Any changes to the source are visible to the consumer
No extra storage - no extra fees
Sharing across Region and/or cloud providers involves data replications
Each Snowflake account is hosted in a particular Snowflake region. To use Snowflake in multiple regions, a Snowflake customer needs to maintain multiple Snowflake accounts, at least one for each region
1:Many: A single provider can share data with many consumers
Not Many:1: However, each share does not consolidate data from many providers into a single share for one consumer.
The same Snowflake account can share (as a data provider) and consume data (as a data consumer)
Terminology

Data Provider - a SF account that crates share; provider creates a Share
Share - object that defines what's being shared and with who; needs to be granted access to the database

Tables being shared
Schema and Database containing those tables
List of Consumers

Data Consumer - account that consumes the shard data

Once a Share is configured, it appears in the Consumer Account
Consumer can configure Read-Only database on the Share and read data
No cost for storage to the consumer - just the compute for querying

Offerings

Data Share

One account to another or to many directly
Can add specific table to a share only
Can share with non-SF customers by setting up Reader accounts for them. Cost and management of these is billed to the data provider though

Snowflake Marketplace - access to 3rd party data
Data Exchange - customer's own data hub for private exchange
SQL Usage Example:

On the provider account:

CREATE SHARE myshare_MYDATA;
GRANT USAGE ON DATABASE sharing_db TO SHARE myshare_MYDATA;
GRANT USAGE ON SCHEMA sharing_db.public TO SHARE myshare_DATA;
GRANT SELECT ON TABLE sharing_db.public.MYDATA TO SHARE shr_MYDATA;
ALTER SHARE shr_MYDATA ADD ACCOUNT = <consumer_account_name>;

On the consumer account:

USE ROLE ACCOUNTADMIN;
CREATE DATABASE Procurement_DB FROM SHARE <provider_account_name>.shr_MYDATA;
SELECT * FROM Procurement_DB.public.MYDATA;

Can set up share on SF user interface as well

With another SF account - Private Sharing
With a non-SF - Private Sharing with a New Reader Account; this will create a reader account for the consumer
SF Market place - 3rd party data set store, some free some are not

Listings can be Ready to Query or By Request and Personalized (for a fee)

Data Exchange - private data sharing hub. Market Place is public, but this is by-invitation

Performance Features and Optimization

Optimizing performance in Snowflake | Snowflake Documentation
Options are:

Caching
Scaling Up / Down / Out (multi-cluster w/h)
Partition pruning
Materialized Views
Search optimization

The following strategies may be applied to improve the performance of a virtual warehouse:

Optimizing warehouses for performance | Snowflake Documentation
Reduce queuing
Resolve memory spillage.
Increase warehouse size.
Try query acceleration.
Optimize the warehouse cache.
Limit concurrently running queries.

Query execution steps

Cloud Services

Query compilation takes place; SQL is converted into a query plan
Query Optimizer optimizes the query

Virtual Warehouse

VWH process the query plan
Reading data from Micro partitions relevant to the query
Data is brought to the VWH RAM - if it does not fit into RAM it can be saved to disc
Warehouse Cache is created for future similar queries to use

Cloud Services

Results are returned from VWH and saved in Query Results Cache
If the next query can be satisfied from this cache - it is not sent to VWH

Query profile

Query execution details in graphical format
Designed to assist in error resolution

Caching

Cloud Services - Metadata Cache, Query Results Cache; available to all VWH

ALTER SESSION SET USED_CACHE_RESULT = FALSE
This disables cache use on the Session level

VWH Layer - WH Cache; available to its WH only
Metadata Caching

as new micro-partitions are created, metadata is updated in Cloud Service layer keeping track of what data resides in each partition
Min and max values for each column, count if distinct: COUNT, MIN, MAX queries can be done off metadata w/out going down to the tables. There are exceptions to this:

On character columns or columns that store string data
when MIN and MAX are used in conjunction with complex expressions or transformations rather than directly on the column
When additional filtering conditions are applied

Query Result Cache

Stores and re-uses results from a query
Results are stored for 24 hours unless reused; if reused - stored for another 24 hours
Can be refreshed up to 31 days - after which us discarded

Virtual Warehouse Cache

Data is cached locally for reading and query processing
If a VWH is suspended, all its compute and temporary storage resources related are released or de-provisioned and the WH Cache is removed. When a VWH is re-established the cache is rebuilt overtime. Sometimes it might be worth keeping the VWH running to have the cache ready, rather than suspend and then re-build the cache
Cache size is proportional to the VWH size - depended on the number of nodes in VWH

Partition Pruning and Clustering Keys

Clustering Keys & Clustered Tables | Snowflake Documentation
Data is added to a partition in the order it is received
Columns can be scattered across multiple micro partitions
SF optimizes the search - it does not read from the micro partitions w data for columns the query does not need. The process of elimination of partitions at query time is called Partition Pruning
Clustering Key

Co-locates similar rows in the same micro-partition

Automatic Clustering

SF re-distributes data according to the Cluster Key. This happens only if SF deems this beneficial
This is Serverless - VWH is not used, but Compute cost is charged
Additional Storage cost - as old pre-clustering partitions are kept for fail safe and time travel

Partition Pruning

A database optimization technique that improves query performance by selectively ignoring (or "pruning") data partitions that are not relevant to the query
Snowflake keeps track of what range of data is kept in which micro-partitions for each column. This metadata enables Snowflake to eliminate unnecessary micro-partitions when running queries, boosting overall query performance
SF monitors Partitions Scanned and Partitions Total for a query profile. Is Scanned = Total, might be an opportunity to Prune

Clustering Depth

Relevant metadata is stored:

Number of microoperations in a table
Number of overlapping micro partitions
Depth of overlap

Clustering depth is calculated for specific columns - starts at 1 and goes up; i.e. across how many micro partitions data for a given column is split. For example, if data resides in one partition only - depth is 1, no overlap
The smaller - the more well-clustered a table is
SYSTEM$CLUSTERING_DEPTH, SYSTEM$CLUSTERINF_INFORMATION

VHW Optimization

Scale Up/Down - when query complexity goes up/down

Sale Up take places only when new nodes are fully provisioned; current queries finish on the pre-upscale size; new queries take advantage of the new size

Scale down take place only when active queries have been completed

Multi-Cluster VWH - when there is high query concurrency (large number of simultaneous queries); or if w/load fluctuations unpredictable
ALTER WAREHOUSE <name>
SET WAREHOUSE_SIZE = SMALL … x6LARGE

Multi-Cluster VWH

Multi-cluster warehouses | Snowflake Documentation
Used mainly when compute isn't insufficient and queries are put into queue
Auto-scales up/down
Available with Enterprise Edition only
The syntax for creating one is:

CREATE WAREHOUSE <> WITH WAREHOUSE SIZE <>
MAX_CLUSRE_COUNT = <>
MIN_CLUSTER_COUNT = <>
AUTO_SUSPENS = <> | NULL
AUTO_RESUME = TRUE | FALSE
INITIALLY_SUSPEND = TRUE | FALSE

Scaling Policies

Multi-cluster warehouses | Snowflake Documentation
Use MAX and MIN_CLUSTER_COUNT
To continuously run at maximum capacity set both to same value
Auto-Scaling Policies:

Standard - this is the Default. Prioritizes launching new clusters over queueing up queries to save credits

Scale UP: spins up a new WH as soon as a queueing is detected; additional - 20 sec after the preceding one has started.
Scale DOWN: 2-3 consecutive checks at 1 min interval to confirm if the workload can be reallocated without the need to spin up another WH

Economy - aims to preserve credits; applies additional logic to determine is a new WH is needed prior to launching it

Scale UP: launch only if the new WH has enough burden to be busy for at least 6 min
Scale DOWN: 5-6 consecutive checks at 1 min interval to confirm if the workload can be reallocated without the need to spin up another WH

Materialized View

Requires Enterprise Edition
Physically pre-store computed results based on a SELECT query
Kept up to day by SF - data is synced with the base table
SF re-routes queries to base table is the martialized view has not sync up yet
Compute is charged separately
Good for frequent complex and costly queries that normally have consistent results

Search Optimization

Search Optimization Service | Snowflake Documentation
Enterprise Edition min
Similar to Secondary Index - improves query performance
Gets enabled on a table or individual columns
Maintained by a dedicated managed Service - does not require a VWH; transparent to the user
Credit and storage costs are incurred
Not available on:

Materialized Views
External Tables
Views
Runtime Transformations
Semi-structured Data Types
Tables without Clustering Keys
DML Operations
Transactional Operations
Data Loading Processes
System-tables/System Objects

Security

Data

Encrypted at rest

Understanding end-to-end encryption in Snowflake | Snowflake Documentation
AES-256, keys managed by SF; transparent to the customer
Automatic key rotation every 30 days

Old key are retired - used only for decryption
New keys are added - used for encryption and decryption

Yearly re-keying

Re-encrypt data using a new key if an old key is more than a year old
Required at least Enterprise Edition
Manually enabled by account Admin

Tri-Secret Secure (BYOK)

Uses Composite Master Ket - comprised of customer managed key and SF managed key
If customer key is revoked - SF can no longer decrypt data
Minimum of Business Critical Edition

Authentication

Enabled by default for all accounts
Authentication via Push (on App), SMS code, Phone Call
Available in all editions (min - Basic)
Best practice - at least ACCOOUNTADMIN should use MFA
Enable: ALTER USER; SET DISABLE_MFA = TRUE
Temporality disable: ALTER USER; SET MINS_TO_BYPASS = <min>
MFS supported by SF clients and connectors:

SF Web UI
SnowSQL
SF ODBC and JDBC drivers
Python and other connectors
Push approach is used by default for these
Command line passcode can be used:

--mfa-passcode <…> or --mfa-passcode-in-password

Can be supplied in connections string for drivers:

Passcode=<..> or passcode_in_password=True

Key Pair Authentication

Public Key issued to a user, Private Key used for authentication
User can have up to two keys
Supported by all SF editions; SnowSQL and other connectors

User Provisioning via SCIM

Auto-provisioning of users and group membership
SCIM (System for Cross-domain Identity Management) is an open standard for automating the exchange of user identity information between identity providers (like Okta, Azure AD) and service providers (like Snowflake)
SCIM is an open standard that provides automatic user provisioning and role synchronization based on identity provider information. When a new user is created in the identity provider, the SCIM automatically provisions the user in Snowflake. Additionally, SCIM can sync groups defined in an identity provider with Snowflake roles

SSO through SAML 2.0 Federated Authentication

Authenticate with External Identity Provider
Native support for Okta, ADFS
Google G Suite, MS Azure AD, OneLogin, Ping Identity Ping One
Supported by all SF editions

Password Policies

Passwords at least 8 characters long
At least one digit, one upper case, one lower case
Admin can create a weak password; best practice - enforce on first login

Authorization

Role Base Access Control (RBAC)

Access to Securable Objects - db's, schemas, tables, views, WHs, etc.
Creator of an object has Owner Role

This can be transferred to another roles
Can grant privileges to others roles (not users) - DAC

Discretional Access Control (DAC)

a type of access control mechanism in which access rights are granted or restricted based on the discretion of the object's owner or another authority figure

Column Level Security

Data Masking - dynamically mask data at query time depending on user role

A masking policy is specific to a schema; once a schema exists, a policy can be assigned to tables and views
Applies to a column in all locations the column appears

External Tokenization

Tokenize prior to lading into SF - replace sensitive data with an unreadable token
Detokenize during execution, depending on user role
Admin creates Access Policy: Logical Condition + Masking Function (gets executed at runtime - unmask for users with a specific role, mask for all others)
Once defined, can be applied to one or more columns

Row Level Security

Return specific results based on a role
Admin creates Access Policy: Logical Condition + Row Filtering Condition

Secure Views

Used for Data Privacy - hide underlying data from a user
Determined when a view us created - use SECURE keyword

Secure UFDs

Ensures that user executing the UDF does not have access to the underlying data

Audit Logs
OAuth 2.0 authentication

Cloud Provides adds a lever of security
Default Roles:

Overview of Access Control | Snowflake Documentation
ORGADMIN role performs organization-specific tasks like listing all accounts and creating new ones; able to view usage activity across organization

Getting started with organizations | Snowflake Documentation

ACCOUNTADMIN - Highest-level role with administrative privileges across the entire account. Encapsulates the SYSADMIN and SECURITYADMIN system-defined roles.

SECURITYADMIN - Manages user authentication and authorization, incl role grants and revocations.

Inherits the privileges of the USERADMIN role
USERADMIN - Responsible for user management, including creating and modifying users:

Can create users and roles in the account
Creating and dropping user accounts
Managing roles (creating, dropping, assigning, and unassigning roles to users)
Resetting passwords for users
Enforcing password policies (setting and modifying password complexity, expiration, etc.)

SYSTEMADMIN - Oversees database and object management, excluding user and role management.

NOTE: ACCOUNTADMIN can manage accounts, roles, and privileges, view query history, and monitor system usage. They do not automatically have the privilege to see the data returned by queries run by other users unless they have been explicitly granted access to the underlying databases, schemas, and tables that the queries are accessing
Network Security

TLS 1.2 by default
Network Policies

By default can connect from any IP
Can define Allowed and Blocked lists

CREATE_NETWORK_POLICY POLICY_1
ALLOWED_IP_LIST=('111.111….','…')
BLOCKED_IP_LIST=('111.111….','…')

IPs or IP Range, v4 only

SYSTEMADMIN or higher can create only
Or a role with CREATENETWORKPLOICY privilege
Can be applied on the Account basis or to individual Users
User level takes precedence over Account

Support for private connectivity

Min of Business Critical Addition
AWS PrivateLink
Azure Private Links
Google Cloud Private Service Connect

Encryption in transit

TLS1.2 for all in-transit
HTTPS for all connectivity

Extending Snowflake

UDF - Function

SQL
Java
JavaScript
Python

Scalar UDF - returns one value for each input
Tabular UDFs - returns zero, one or many rows
External Function - type of a function

Code is stored outside of SF
SF calls out to an HTTP endpoint hosted in a remote service for computation or processing
SF only maintains the metadata
Can be written in any language that standard UDF does not support, can use external libraries and call external APIs
Only Scalar external functions are supported

Secure UDF - Performance vs Security

Protecting Sensitive Information with Secure UDFs and Stored Procedures | Snowflake Documentation
Standard UDF uses SQL Optimization which can enable access to the underlying data indirectly. This is a security concern
Secure UDF do NOT use Optimization
Secure UDF does hides its code (function definition) - exposes to authorized users only

Stored Procedures

Proc code with if-else, loops etc.

SQL
JavaScript
Snowflake Scripting
When using Snowpark can use Java, Python or Scala

Can construct dynamic SQL statements
Returns a single value or tabular data
Sprocs vs UDF

UDF perform computation on its input data
Sproc executes one or more SQL statements

Understanding caller’s rights and owner’s rights stored procedures | Snowflake Documentation
Sproc created under Caller's Right - uses privileges of the caller; can access caller's session (ex: session variables)
Created under Owner's Right - default, runs mostly with the privileges of the stored procedure’s owner. "if the owner has the privileges to perform a task, then the stored procedure can perform that task even when called by a role that does not have privileges to perform that task directly."
The right under which a Sproc will be run is defined at creation. Can run ALTER PROCEDURE to change

Snowpark

Snowpark API | Snowflake Documentation
Dev environment - SQL/Python/Java/Scala code get translated into SQL
DataFrame API - similar to pandas
In-Database processing - Data transformations and computations are pushed down to Snowflake’s compute and storage layers
Snowpark functions are evaluated in lazy manner - nothing is done until a function is called; only when a function is called, it is run and the output is converted into SQL

Snowflake Scripting

Extensions to SQL
Can use variable, handle exceptions, If-then-else, case, loops, etc.
Typically used to build SProcs

Account and Resource Management

Resource Monitors - mechanism for tracking and controlling the consumption of compute resources measured in Snowflake credits. Can be applied to:

Single VWH
A group of VWHs
Entire SF account

Based on % of credit can notify and suspend
A single VWH can have only one Resource Monitor assigned

Exception - account level monitoring

Resource Monitors monitor only resource created by the customer
Any SF-created resource cannot have a Monitor assigned. This includes:

Snowpipe
Automatic Re-clustering
Materialized View maintenance
Etc.

NOTE: Resource Monitors cannot manage the cost of cloud services
Only ACCOUNTADMIN can create Monitors

Can view usage data under SNOWFLAKE.INFORMATION_SCHEMA

ACCOUNT_USAGE Schema

Historical usage data that allows for tracking past activities and resource consumption over time
Not real-time - refreshes every 45 min - 3 hrs
Data retention - 365 days
Shows information on deleted objects

QUERY_HISTORY: Details for queries executed, including performance and users
WAREHOUSE_METERING_HISTORY: Credit usage for each virtual warehouse
STORAGE_USAGE: Daily account-wide storage usage
DATABASE_STORAGE_USAGE_HISTORY: Storage usage by each database over time
LOAD_HISTORY: Information about data load jobs, including successes and failures
LOGIN_HISTORY: Records of user login attempts and details
USER_HISTORY: Historical data about user accounts, roles, and activity
WAREHOUSE_HISTORY: Changes and usage statistics of virtual warehouses

INFORMATION_SCHEMA (aka “Data Dictionary”) - contains views and table functions the provide metadata info on objects and views

Snowflake Information Schema | Snowflake Documentation
A standardized set of read-only views in SQL databases that provide metadata information about the database objects, such as tables, columns, constraints, and more. It allows users to query system catalog information using regular SQL queries
Provides metadata that is updated in real-time (no latency), allowing for immediate visibility into the current state of the database objects
Data retention - 14 days typically, ranges from 7 days to 6 months (depending on specific views)
Does NOT contain info on deleted (dropped) objects

Account Usage vs. Information Schema

Differences between Account Usage and Information Schema | Snowflake Documentation

Difference	ACCOUNT_USAGE	INFORMATION_SCHEMA
Includes dropped objects	Yes	No
Latency of data	From 45 minutes to 3 hours (varies by view)	None
Retention of historical data	1 Year	From 7 days to 6 months (varies by view/table function)

Snowflake Releases

Take place twice a week - transparent to user, no downtime
Full Releases - new features, enhancements, bug fixes
Patches - issue fixes only
New behavior is released - a customer can opt in during the first month, then the behavior get enabled automatically, but the customer can opt out
Release steps

Snowflake Releases | Snowflake Documentation
Day 0 - 24 Hr Early Access, VPS
Day 0.5 - 12 Hr Early Access, Enterprise Edition or higher only
Day 1 (24 hrs later) - all remaining Enterprise Editions or higher receive the upgrade; all Standard Edition get the upgrade

Compliance

Regulatory compliance | Snowflake Documentation

Query Statistics: Monitor query activity with Query History | Snowflake Documentation

CHEAT SHEET

Difference	ACCOUNT_USAGE	INFORMATION_SCHEMA
Includes dropped objects	Yes	No
Latency of data	From 45 minutes to 3 hours (varies by view)	None
Retention of historical data	365 Days	7 days - 6 months (varies by view/table function)

Default Roles:

ORGADMIN role performs organization-specific tasks like listing all accounts and creating new ones; able to view usage activity across organization
ACCOUNTADMIN - Highest-level role with administrative privileges across the entire account. Incl the SYSADMIN and SECURITYADMIN system-defined roles.

Can create/delet all objects (databases, schemas, tables, views, etc.).
SECURITYADMIN - Manages user authentication and authorization, incl role grants and revocations.

Inherits the privileges of the USERADMIN role
USERADMIN - Responsible for user management, including creating and modifying users:

Can create users and roles in the account
Creating and dropping user accounts
Managing roles (creating, dropping, assigning, and unassigning roles to users)
Resetting passwords for users
Enforcing password policies (setting and modifying password complexity, expiration, etc.)

SYSTEMADMIN - Oversees database and object management, excluding user and role management.

Editions

Snowflake Editions | Snowflake Documentation
Standard Edition
Enterprise Edition

Time travel - 1-90 days

Business Critical Edition (also known as Enterprise for Sensitive Data)

Time travel - 1-90 days

Virtual Private Snowflake (VPS)

Time travel - 1-90 days

Authentication

MFA, Key Pair, User Provisioning via SCIM, SSO through SAML 2.0 Federated Authentication, Password Policies

Caching

Cloud Services: Metadata Cache, Query Results Cache
VWH Layer - WH Cache
Partition Pruning
Clustering Keys (Co-locates similar rows in partition)
Automatic Clustering - re-distributes data according to the Cluster Key

VHW Optimization

Scale Up/Down
Multi-Cluster VWH - when large number of simultaneous queries
Multi-Cluster VWH - when queries are queueing, need more compute
Scaling Policies
Materialized View
Search Optimization
Query acceleration

Auto-Scale

Standard

Prevents queuing
Start Up New - waits 20 sec
Shut Down - 2-3 checks mins apart

Economy

Conserves Credits
Start Up New - if est. to keep one busy for at least 6 mins
Shut Down - 2-6 check mins apart

Data Transform - ingestion during COPY can:

Omit columns
Reorder columns
Cast columns into data specific types
Truncate text strings