- https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html
- AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console.
- Cleans, enriches and categorizes data
- Automatic crawling – populates the central data repository, Glue Data Catalog
- Job Authoring – engine that automatically generates Python or Scala code to transform source data into the target scheme. Runs on dedicated Apache Spark.
- Job Scheduler – flexible job scheduler
- Glue Data Catalog
§ Contains references to data that is used as sources and targets of your extract, transform, and load (ETL) jobs in AWS Glue. Similar to Apache Hive Metastore
§ Out of the box integration with
§ Athena
§ EMR
§ Redshift
§ Redshift Spectrum
§ S3
§ Any app compatible wit Apache Hive Metastore
§ An index to the location, schema, and runtime metrics of your data. You use the information in the Data Catalog to create and monitor your ETL jobs. Information in the Data Catalog is stored as metadata tables, where each table specifies a single data store. Typically, you run a crawler to take inventory of the data in your data stores, but there are other ways to add metadata tables into your Data Catalog.
- Crawler
§ data cataloging program; automatically scans data, identifies format, derives schema
§ requires access (i.e. credentials) to the data source
§ periodically scans data, detects new data / changes to existing data
§ automatically adds new tables,
new partitions, new table definitions
No comments:
Post a Comment