Sunday, July 19, 2020

EMR – Elastic Map Reduce

EMR – Elastic Map Reduce
-         https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html
-         Amazon EMR is a web service that makes it easy to process large amounts of data efficiently. Amazon EMR uses Hadoop processing combined with several AWS products to do such tasks as web indexing, data mining, log file analysis, machine learning, scientific simulation, and data warehousing.
-         In a nutshell – Hadoop on EC2 with S3; big data processing
-         Integrates with:
          §  EC2
          §  S3
          §  VPC
          §  CloudWatch
          §  CloudTrail
          §  IAM
          §  Pipelines
-         Can size up or down any time (ex: EC2 size)
-         HDFS – Hadoop Distributed File System; open source Java
-         EMRFS – AWS proprietary file systems; can be used on EMR in place of HDFS
          §  The advantage is – integrated with EMRFS is S3. If shutting down the cluster with HDFS, that data gets lost. EMRFS allows for data export into S3.
-         MapReduce – technique of splitting data pool into smaller chunks and parallel-processing
-         Apache Spark – compute engine for Hadoop data
-         HBase – distributed Hadoop database
-         Apache Hive – Hadoop data warehouse infrastructure. Open source. Supports Hive QL – SQL-like querying
-         Pig – open source Hadoop analytics package. Supports Pig Latin – SQL-like

EMR Cluster
-         EMR Cluster is AZ specific – for performance reasons. By default AWS picks the AZ, but can specify manually
-         Master node – cluster manager; manages task distribution node health. Can have a cluster consisting of a single manager and no other nodes
-         Core node – runs tasks, stores data in HDFS
-         Task node – runs tasks only, does NOT stores data in HDFS. Sends processed data to Core nodes to write into HDFS. Optional. Can use Spot instance for this – since loos of an instance wouldn’t cause data loss here.
-         EMRFS – AWS proprietary file systems; can be used on EMR in place of HDFS
          §  The advantage is – integrated with EMRFS is S3. If shutting down the cluster with HDFS, that data gets lost. EMRFS allows for data export into S3
          §  Can scale up and down the compute and store instances separately – since EMRFS separates the two functions by integrating with S3

No comments:

Post a Comment