0% found this document useful (0 votes)

922 views46 pages

Aws Glue Information

AWS Glue is a fully managed ETL service that makes it easy to move data between data stores and transform it on the fly. It automatically discovers schemas, crawls data sources to build a data catalog, and generates Python code for ETL jobs that can run on a serverless Spark infrastructure. Key components include the data catalog for metadata management, job authoring for Python code generation and editing, and job execution on a scalable Spark platform.

Uploaded by

华希希

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

922 views46 pages

Aws Glue Information

Uploaded by

华希希

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Introduction to AWS Glue

Simple, Flexible, Cost-Effective ETL

Debanjan Saha, General Manager

Amazon Aurora, Amazon RDS MySQL, AWS Glue

August 14, 2017

> Why did we build AWS Glue?

> Main components of AWS Glue

> What did we announce today?

Why would AWS get into the ETL space?
We have lots of ETL partners

Amazon Redshift Partner Page for Data Integration

Fivetran
The problem is

70% of ETL jobs are hand-coded

With no use of ETL tools.

Actually…

It’s over 90% in the cloud

Why do we see so much hand-coding?

Code is flexible Code is powerful

You can unit test You can deploy with other code You know your dev tools
Hand-coding involves a lot of
undifferentiated heavy lifting…

Brittle Error-prone Laborious

► As data formats change ► As you add sources

► As target schemas change ► As data volume grows
AWS Glue automates
the undifferentiated heavy lifting of ETL

Automatically discover and categorize your data making it immediately searchable

Discover
and queryable across data sources

Develop Generate code to clean, enrich, and reliably move data between various data
sources; you can also use their favorite tools to build ETL jobs

Run your jobs on a serverless, fully managed, scale-out environment. No compute

Deploy
resources to provision or manage.
AWS Glue: Components

 Hive Metastore compatible with enhanced functionality

 Crawlers automatically extracts metadata and creates tables
 Integrated with Amazon Athena, Amazon Redshift Spectrum
Data Catalog

 Auto-generates ETL code

 Build on open frameworks – Python and Spark
 Developer-centric – editing, debugging, sharing
Job Authoring

 Run jobs on a serverless Spark platform

 Provides flexible scheduling
 Handles dependency resolution, monitoring and alerting
Job Execution
Common use cases for AWS Glue
Understand your data assets
Instantly query your data lake on Amazon S3
ETL data into your data warehouse
Build event-driven ETL pipelines
Main components of AWS Glue
AWS Glue Data Catalog
Discover and organize your data
Glue data catalog

Manage table metadata through a Hive metastore API or Hive SQL.

Supported by tools like Hive, Presto, Spark etc.

We added a few extensions:

 Search over metadata for data discovery
 Connection info – JDBC URLs, credentials
 Classification for identifying and parsing files
 Versioning of table metadata as schemas evolve and other metadata are updated

Populate using Hive DDL, bulk import, or automatically through Crawlers.

Data Catalog: Crawlers

Crawlers automatically build your Data Catalog and keep it in sync

Automatically discover new data, extracts schema definitions

• Detect schema changes and version tables
• Detect Hive style partitions on Amazon S3

Built-in classifiers for popular types; custom classifiers using Grok expressions

Run ad hoc or on a schedule; serverless – only pay when crawler runs

AWS Glue Data Catalog

Bring in metadata from a variety of data sources (Amazon S3, Amazon Redshift, etc.) into a single
categorized list that is searchable
Data Catalog: Table details

Table properties
Nested fields

Data statistics

Table schema
Data Catalog: Version control
Compare schema versions List of table versions
Data Catalog: Detecting partitions

S3 bucket hierarchy Table definition

Column Type
sim=.93 month=Nov
month str

date str
sim=.99 date=10 … sim=.95 date=15 col 1 int

col 2 float

file 1 … file N file 1 … file N

Estimate schema similarity among files at each level to

handle semi-structured logs, schema evolution…
Data Catalog: Automatic partition detection

Table
partitions

Automatically register available partitions

Job authoring in AWS Glue

 Python code generated by AWS Glue

You have choices on
how to get started  Connect a notebook or IDE to AWS Glue

 Existing code brought into AWS Glue

Job authoring: Automatic code generation

1. Customize the mappings

2. Glue generates transformation graph and Python code
3. Connect your notebook to development endpoints to customize your code
Job authoring: ETL code

 Human-readable, editable, and portable PySpark code

 Flexible: Glue’s ETL library simplifies manipulating complex, semi-structured data

 Customizable: Use native PySpark, import custom libraries, and/or leverage Glue’s libraries

 Collaborative: share code snippets via GitHub, reuse code across jobs
Job Authoring: Glue Dynamic Frames

Like Spark’s Data Frames, but better for:

Dynamic frame schema • Cleaning and (re)-structuring semi-structured
data sets, e.g. JSON, Avro, Apache logs ...

No upfront schema needed:

• Infers schema on-the-fly, enabling transformations
A B1 B2 C D[ ] in a single pass

Easy to handle the unexpected:

X Y
• Tracks new fields, and inconsistent changing data
types with choices, e.g. integer or string
• Automatically mark and separate error records
Job Authoring: Glue transforms
Adaptive and flexible

project cast separate into cols

ResolveChoice() B B B B B B B

C
Apply Mapping() A
A X Y
X Y
Job authoring: Relationalize() transform

Semi-structured schema Relational schema

A B B C.X C.Y FK

PK Offset Value

A B B C D[ ]

X Y

• Transforms and adds new columns, types, and tables on-the-fly

• Tracks keys and foreign keys across runs
• SQL on the relational schema is orders of magnitude faster than JSON processing
Job authoring: Glue transformations

 Prebuilt transformation: Click and

add to your job with simple
configuration

 Spigot writes sample data from

DynamicFrame to S3 in JSON format

 Expanding… more transformations

to come
Job authoring: Write your own scripts

Convert to a Spark Data Frame

for complex SQL-based ETL

Convert back to Glue Dynamic Frame

for semi-structured processing and
AWS Glue connectors

Import custom libraries required by your code

Job authoring: Developer endpoints
Glue Spark environment

Remote Interpreter
interpreter server

 Environment to iteratively develop and test ETL code.

 Connect your IDE or notebook (e.g. Zeppelin) to a Glue development endpoint.

 When you are satisfied with the results you can create an ETL job that runs your code.
Job Authoring: Leveraging the community

No need to start from scratch.

Use Glue samples stored in Github to share, reuse,
contribute: https://github.com/awslabs/aws-glue-samples
• Migration scripts to import existing Hive Metastore data
into AWS Glue Data Catalog
• Examples of how to use Dynamic Frames and
Relationalize() transform
• Examples of how to use arbitrary PySpark code with
Glue’s Python ETL library

Download Glue’s Python ETL library to start developing code

in your IDE: https://github.com/awslabs/aws-glue-libs
Orchestration and resource management
Fully managed, serverless job execution
Job execution: Scheduling and monitoring

Compose jobs globally with event-

based dependencies Marketing: Ad-spend by
customer segment
 Easy to reuse and leverage work across Event Based
organization boundaries Lambda Trigger
Data
Multiple triggering mechanisms based

 Schedule-based: e.g., time of day

 Event-based: e.g., job completion
Schedule
 On-demand: e.g., AWS Lambda Weekly Data
sales based Central: ROI by
 More coming soon: Data Catalog based customer
events, S3 notifications and Amazon segment
CloudWatch events Sales: Revenue by
customer segment
Logs and alerts are available in
Amazon CloudWatch
Job execution: Job bookmarks
Option Behavior
Glue keeps track of data that has already Enable Pick up from where you left off
been processed by a previous run of an Ignore and process the entire dataset
Disable
ETL job. This persisted state information is every time
called a bookmark. Temporarily disable advancing the
Pause
bookmark

For example, you get new files everyday

in your S3 bucket. By default, AWS Glue Data objects
keeps track of which files have been
successfully processed by the job to
prevent data duplication. Marketing: Ad-spend by customer segment
Job execution: Serverless

Compute instances
There is no need to provision, configure, or
manage servers

 Auto-configure VPC and role-based access

 Customers can specify the capacity that

gets allocated to each job

 Automatically scale resources (on post-GA

roadmap)

 You pay only for the resources you

consume while consuming them
Customer VPC Customer VPC
AWS Glue pricing examples
AWS Glue pricing

ETL jobs, development endpoints, and crawlers

Compute based usage: $0.44 per DPU-Hour, 1 minute increments.10-minute minimum
A single DPU Unit = 4 vCPU and 16 GB of memory

Data Catalog Storage:

Free for the first million objects stored
Data Catalog usage: $1 per 100,000 objects, per month, stored above 1M

Data Catalog Requests:

Free for the first million requests per month
$1 per million requests above 1M
Glue ETL pricing example

Consider an ETL job that ran for 10 minutes on a 6 DPU environment.

The price of 1 DPU-Hour in US East (N. Virginia) is $0.44.

The cost for this job run = 6 DPUs * (10/60) hour * $0.44 per DPU-Hour or $0.44.

Now consider you provision a development endpoint to debug the code for this job
and keep the development endpoint active for 24 min.

Each development endpoint is provisioned with 5 DPUs

The cost to use the development endpoint = 5 DPUs * (24/ 60) hour * 0.44 per DPU-Hour or $0.88.
Glue Data Catalog pricing example

Let’s consider that you store 1 million tables in your Data Catalog in a given
month and make 1 million requests to access these tables.
You pay $0 for using data catalog. You are covered under the Data Catalog free tier.

Now consider your requests double to 2 million requests.

You will only be paying for one million requests above the free tier, which is $1

If you use crawlers to find new tables and they run for 30 min and use 2 DPUs.

You will pay for 2 DPUs * (30/60) hour * $0.44 per DPU-Hour or $0.44.
Your total monthly bill = $0 + $1 + $0.44 or $1.44
AWS Glue regional availability
AWS Glue regional availability plan

Planned schedule Regions

At launch US East (N. Virginia)

Q3 2017 US East (Ohio), US West (Oregon)
Q4 2017 EU (Ireland), Asia Pacific (Tokyo), Asia Pacific (Sydney)

2018 Rest of the public regions

Q&A
Thank you!

AWS Glue ETL Guide: Setup & Execution
No ratings yet
AWS Glue ETL Guide: Setup & Execution
10 pages
Hadoop/Spark Developer Resume
No ratings yet
Hadoop/Spark Developer Resume
7 pages
Siva
No ratings yet
Siva
4 pages
Senior Data Engineer Resume Example
No ratings yet
Senior Data Engineer Resume Example
1 page
Shelly Bansal - SR Data Engineer
No ratings yet
Shelly Bansal - SR Data Engineer
6 pages
53 SQL Questions-Answers
No ratings yet
53 SQL Questions-Answers
89 pages
Vijay Kanth - Azure Data Engineer
No ratings yet
Vijay Kanth - Azure Data Engineer
2 pages
Interview DE by Company Azurelib Dot Com
No ratings yet
Interview DE by Company Azurelib Dot Com
14 pages
Naresh DE
No ratings yet
Naresh DE
5 pages
Azure Databricks Overview
No ratings yet
Azure Databricks Overview
23 pages
Ajay Kadiyala Resume 2023 PDF
No ratings yet
Ajay Kadiyala Resume 2023 PDF
6 pages
Dice Resume CV SN
No ratings yet
Dice Resume CV SN
5 pages
Big Data & Hadoop Developer Resume
No ratings yet
Big Data & Hadoop Developer Resume
8 pages
Azure Data Engineering Guide
No ratings yet
Azure Data Engineering Guide
11 pages
Lead Data Engineer with AWS Expertise
No ratings yet
Lead Data Engineer with AWS Expertise
2 pages
Azure Interview Prep Guide
No ratings yet
Azure Interview Prep Guide
11 pages
Dice Resume CV Yamini Vakula
No ratings yet
Dice Resume CV Yamini Vakula
5 pages
ADE Azure Data Engineer Interview
No ratings yet
ADE Azure Data Engineer Interview
12 pages
AWS DataEngineering
100% (1)
AWS DataEngineering
23 pages
Data Quality Administration Guide
No ratings yet
Data Quality Administration Guide
210 pages
Azure Data Enginner
No ratings yet
Azure Data Enginner
8 pages
SCD Type-2 with Pandas in Spark
0% (1)
SCD Type-2 with Pandas in Spark
8 pages
Roshani Kumari ETL Engineer
100% (1)
Roshani Kumari ETL Engineer
7 pages
Resume Mohit
No ratings yet
Resume Mohit
6 pages
Sr. Data Engineer with Azure Expertise
No ratings yet
Sr. Data Engineer with Azure Expertise
6 pages
SQL Practice for Students
No ratings yet
SQL Practice for Students
10 pages
ADF Copy Data
100% (1)
ADF Copy Data
81 pages
IT & Big Data Professional Profile
No ratings yet
IT & Big Data Professional Profile
7 pages
Azure DataEngineer Course Outline
No ratings yet
Azure DataEngineer Course Outline
4 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
AWS Glue Studio
100% (1)
AWS Glue Studio
126 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Big Data Engineer Resume Template Download 20201120
No ratings yet
Big Data Engineer Resume Template Download 20201120
2 pages
Azure Data Engineer Mock Interview - Project Special
No ratings yet
Azure Data Engineer Mock Interview - Project Special
11 pages
Principal Architect - Big Data Architect - Solutions Architect Resume
No ratings yet
Principal Architect - Big Data Architect - Solutions Architect Resume
9 pages
Data Warehousing for Analysts
No ratings yet
Data Warehousing for Analysts
40 pages
What Is Bigquery: Enterprise Data Warehouse
No ratings yet
What Is Bigquery: Enterprise Data Warehouse
2 pages
Senior Data Engineer Resume Overview
No ratings yet
Senior Data Engineer Resume Overview
7 pages
Airflow Introduction
No ratings yet
Airflow Introduction
9 pages
ADF DataFlow Functions CheatSheet by Deepak Goyal Azurelib-H0X4sMxnVP-DsMku3fYRq
No ratings yet
ADF DataFlow Functions CheatSheet by Deepak Goyal Azurelib-H0X4sMxnVP-DsMku3fYRq
29 pages
Hanumantha Rao Resume-1 (4391)
No ratings yet
Hanumantha Rao Resume-1 (4391)
4 pages
Creating A SCD Type 2 Mapping Using The Informatica PowerCenter Mapping Wizard
0% (1)
Creating A SCD Type 2 Mapping Using The Informatica PowerCenter Mapping Wizard
16 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Caching in Spark
No ratings yet
Caching in Spark
51 pages
ETL QA Sample Scenario V3
100% (2)
ETL QA Sample Scenario V3
3 pages
AWS 05 DataLake
No ratings yet
AWS 05 DataLake
78 pages
Dhanush Bigdata Resume Updated
No ratings yet
Dhanush Bigdata Resume Updated
9 pages
Azure Databricks & Spark Course
No ratings yet
Azure Databricks & Spark Course
9 pages
Tuning SQL Queries - Oracle
100% (1)
Tuning SQL Queries - Oracle
27 pages
Azure Data Factory Workshop
No ratings yet
Azure Data Factory Workshop
26 pages
AWS Glue: Quick Start Guide for Devs
No ratings yet
AWS Glue: Quick Start Guide for Devs
36 pages
Glue by Pushpjeet
No ratings yet
Glue by Pushpjeet
7 pages
Data Pipelines With AWS Glue (Level 200)
No ratings yet
Data Pipelines With AWS Glue (Level 200)
33 pages
AWS Athena & Glue for Data Analysis
No ratings yet
AWS Athena & Glue for Data Analysis
13 pages
Affinity
No ratings yet
Affinity
7 pages
AWS Glue
100% (1)
AWS Glue
225 pages
Modernserverlessdatalak
No ratings yet
Modernserverlessdatalak
45 pages
AWS Glue
No ratings yet
AWS Glue
5 pages
AWS Glue Guide
No ratings yet
AWS Glue Guide
17 pages
Aws Glue Interview
No ratings yet
Aws Glue Interview
259 pages
What Is Data Warehouse?: Defination
No ratings yet
What Is Data Warehouse?: Defination
17 pages
Uploading and Downloading Files in Web Dynpro Java - 0 - 1
No ratings yet
Uploading and Downloading Files in Web Dynpro Java - 0 - 1
19 pages
Chambers - Alternative Legal Service Providers 2021 - Aslp-2021-Book
No ratings yet
Chambers - Alternative Legal Service Providers 2021 - Aslp-2021-Book
35 pages
New Zealand - NSDI
No ratings yet
New Zealand - NSDI
26 pages
Apache Iceberg - Additional Real World Use Cases
No ratings yet
Apache Iceberg - Additional Real World Use Cases
25 pages
CyberSource Cartridge Storefront Reference Architecture Integration Guide
No ratings yet
CyberSource Cartridge Storefront Reference Architecture Integration Guide
45 pages
SAP Implementation Guide
No ratings yet
SAP Implementation Guide
17 pages
ExifTool Command-Line Examples
No ratings yet
ExifTool Command-Line Examples
3 pages
Ooyala State of The Broadcast Industry 2018
No ratings yet
Ooyala State of The Broadcast Industry 2018
14 pages
SAP BO Publisher en
No ratings yet
SAP BO Publisher en
122 pages
CDMP Practice Exam v3
No ratings yet
CDMP Practice Exam v3
38 pages
SnowPro Core Test Prep
No ratings yet
SnowPro Core Test Prep
105 pages
Data Mining and Data Warehouse
No ratings yet
Data Mining and Data Warehouse
11 pages
Test Automation Software Guide
No ratings yet
Test Automation Software Guide
12 pages
Oracle Workflow Optimization Guide
No ratings yet
Oracle Workflow Optimization Guide
64 pages
Data Warehousing Fundamentals Paulraj Ponniah
75% (4)
Data Warehousing Fundamentals Paulraj Ponniah
518 pages
1 s2.0 S0142061524000231 Main PDF
No ratings yet
1 s2.0 S0142061524000231 Main PDF
5 pages
EDD Standards for Federal Student Aid
No ratings yet
EDD Standards for Federal Student Aid
33 pages
Manufacturing Methods Explained
No ratings yet
Manufacturing Methods Explained
2 pages
BMC Remedy Knowledge Management 7.6.04 User Guide
No ratings yet
BMC Remedy Knowledge Management 7.6.04 User Guide
86 pages
Knime Press Advanced Luck - v5.2 Plain
No ratings yet
Knime Press Advanced Luck - v5.2 Plain
223 pages
Semantic Web (RDF)
No ratings yet
Semantic Web (RDF)
14 pages
Reporting Guide: Ibm Infosphere Information Server
No ratings yet
Reporting Guide: Ibm Infosphere Information Server
82 pages
Resume of Ref No: Cjh647384 - Informatica MDM Specialist With 7 Years Exp
No ratings yet
Resume of Ref No: Cjh647384 - Informatica MDM Specialist With 7 Years Exp
6 pages
Warnke Implementations of Warburg in The Digital Era
No ratings yet
Warnke Implementations of Warburg in The Digital Era
13 pages
Taxonomies For Development: Knowledge Solutions
No ratings yet
Taxonomies For Development: Knowledge Solutions
7 pages
Data Engineering Internship at AICTE
No ratings yet
Data Engineering Internship at AICTE
18 pages
Unit-II Data Warehousing
No ratings yet
Unit-II Data Warehousing
98 pages
Plex Tagging For Audiobook Management
No ratings yet
Plex Tagging For Audiobook Management
16 pages

Aws Glue Information

Uploaded by

Aws Glue Information

Uploaded by

Introduction to AWS Glue

Simple, Flexible, Cost-Effective ETL

Debanjan Saha, General Manager

August 14, 2017

> Why did we build AWS Glue?

> Main components of AWS Glue

> What did we announce today?

Amazon Redshift Partner Page for Data Integration

70% of ETL jobs are hand-coded

With no use of ETL tools.

It’s over 90% in the cloud

Code is flexible Code is powerful

Brittle Error-prone Laborious

► As data formats change ► As you add sources

Automatically discover and categorize your data making it immediately searchable

Run your jobs on a serverless, fully managed, scale-out environment. No compute

 Hive Metastore compatible with enhanced functionality

 Auto-generates ETL code

 Run jobs on a serverless Spark platform

Manage table metadata through a Hive metastore API or Hive SQL.

We added a few extensions:

Populate using Hive DDL, bulk import, or automatically through Crawlers.

Crawlers automatically build your Data Catalog and keep it in sync

Automatically discover new data, extracts schema definitions

Run ad hoc or on a schedule; serverless – only pay when crawler runs

S3 bucket hierarchy Table definition

file 1 … file N file 1 … file N

Estimate schema similarity among files at each level to

Automatically register available partitions

 Python code generated by AWS Glue

 Existing code brought into AWS Glue

1. Customize the mappings

 Human-readable, editable, and portable PySpark code

 Flexible: Glue’s ETL library simplifies manipulating complex, semi-structured data

Like Spark’s Data Frames, but better for:

No upfront schema needed:

Easy to handle the unexpected:

project cast separate into cols

Semi-structured schema Relational schema

• Transforms and adds new columns, types, and tables on-the-fly

 Prebuilt transformation: Click and

 Spigot writes sample data from

 Expanding… more transformations

Convert to a Spark Data Frame

Convert back to Glue Dynamic Frame

Import custom libraries required by your code

 Environment to iteratively develop and test ETL code.

 Connect your IDE or notebook (e.g. Zeppelin) to a Glue development endpoint.

No need to start from scratch.

Download Glue’s Python ETL library to start developing code

Compose jobs globally with event-

 Schedule-based: e.g., time of day

For example, you get new files everyday

 Auto-configure VPC and role-based access

 Customers can specify the capacity that

 Automatically scale resources (on post-GA

 You pay only for the resources you

ETL jobs, development endpoints, and crawlers

Data Catalog Storage:

Data Catalog Requests:

Consider an ETL job that ran for 10 minutes on a 6 DPU environment.

The price of 1 DPU-Hour in US East (N. Virginia) is $0.44.

Each development endpoint is provisioned with 5 DPUs

Now consider your requests double to 2 million requests.

Planned schedule Regions

At launch US East (N. Virginia)

2018 Rest of the public regions

You might also like