Aleks on AWS: Redshift

Redshift
- https://docs.aws.amazon.com/redshift/index.html
- Warehouse – relational database; used for analysis, not for transaction processing
- OLAP – online analytics processing; low transactional volumes

- OLTP – RDS

- Amazon Redshift nodes, which are organized into a group called a cluster. Each cluster runs an Amazon Redshift engine and contains one or more databases

- Redshift –

- 10 times faster than a traditional SQL RDBMS
- Fully managed, scales to petabytes (2 to the power of 50)
- Distributed;
      §  Back up retention up to 35 days (chargeable)
- Auto-recovery
- Not meant for consuming real time data from multiple sources (unlike Kinesis)
- Encryption at rest – AES-256
      §  Redshift manages - then yes by default
      §  Can elect to use HSM or KMS
- Encryption in transit - SSL (HTTPS) b/w apps and Redshift cluster
- Can’t get manual root access to Redshift nodes; however, applications using connecting to Redshift gain have access to the actual nodes
- Performance achieved using:
      §  https://docs.aws.amazon.com/redshift/latest/dg/c_columnar_storage_disk_mem_mgmnt.html

§ Typical database block sizes range from 2 KB to 32 KB. Amazon Redshift uses a

      §  Columnar storage (not rows) – fewer I/O, data stored sequentially, great for analytics
      §  Advanced compression – enabled for sequential columnar storage
      §  Massive Parallel Processing (MPP) – data and queries are distributed across multiple nodes

- Cluster
      §  Min size
      §   Leader node – receives queries, distributes work, manages connections – doesn’t store data
      §  Compute node – performs queries and does computations; can have up to 128 compute nodes
      §  Automatic backups – on by default; stored in S3; 0 – 35 days; 24 hours by default
      §  Can copy backup into another region
      §  Delete a cluster – can elect to save a
      §  Single AZ
      §  Standard and custom metrics are available via CloudWatch

- Resilience and Durability
      §  All data is
      §  Original data
      §  Copy of all data within the actual cluster
      §  Copy of all data into S3
      §  Single node failure
      §  Redshift replaces the failed one
      §  the cluster is unavailable while the replacement is being brought up (minutes)
      §  most frequently used data is copied into the replacement node first to make it available ASAP
      §  single node clusters do not support replication – use the snapshot in S3 for replication is case of failure

-Scaling
§ Horizontal – new nodes get added
§ Vertical – a whole new more powerful cluster is created; data is moved to the new cluster making the older one unavailable temporarily

-Billing
      §  Billed per hour per each compute mode; leader node is free
      §  S3 for backup storage
      §  Data transfer – in/out Redshift across regions only, single region transfers are free

Amazon Redshift Spectrum

- https://docs.aws.amazon.com/redshift/latest/dg/c-using-spectrum.html

- Using Amazon Redshift Spectrum, you can efficiently query and retrieve structured and semistructured data from files in Amazon S3 without having to load the data into Amazon Redshift tables. Redshift Spectrum queries employ massive parallelism to execute very fast against large datasets

Integration

- Redshift and DynamoDB
      §  Can copy data from DynamoDB into Redshift to perform data analysis
      §  Charges apply – copy operation counted against Read CU Dynamo
      §  DynamoDB is non-SQL, Redshift is SQL
- DynamoDB and EMR
      §  DynamoDB integrates with Apache Hive, redshift-like warehouse application by Apache that runs on EMR
      §  Hive can query DynamoDB tables using HiveQL

§ EMR can copy data out from Dynamo into S3 or HDFS

Aleks on AWS

Sunday, July 19, 2020

Redshift

No comments:

Post a Comment