AWS Glue 101
Intro about Glue
AWS Glue is a serverless data integration service that allows you to perform Extract-Transform-Load (ETL) tasks at scale. It provides a solution for data processing, data cataloging, and data quality monitoring. It is especially useful for Machine Learning (ML) workloads. With AWS Glue, you can quickly build an end-to-end ETL pipeline. The pipeline can be triggered based on events, run on schedule, or manually. This service also allows you to flexibly scale up or down the resources on demand.
In this article, we will cover the core concepts and various pieces of AWS Glue, so that the readers get familiar with the service.
Why should you use AWS Glue?
- It’s serverless, meaning the provision of resources is managed by AWS, which manages all the infrastructure, configurations, and the scaling of the resources.
- Pay for only what you use.
- Run memory-heavy workloads with ease with distributed computing architecture.
- Built-in integrations with various data sources/sinks such as S3, DynamoDB, and Athena, as well as connectors are available for GCP BigQuery. If you have the data available with AWS cloud, using AWS Glue for processing the data is a wise choice for data colocation and reducing data transfer costs.
- Code generation with the Visual Editor for trivial workloads.
- Run existing Spark code with minimal changes.
How do I run my data pipeline on AWS Glue?
AWS Glue offers a variety of data processing engines to choose from. The most popular of them is Apache Spark. However, Ray is becoming more prominent in recent days. In addition to that, AWS Glue also offers support for running Jupyter notebooks, Python shell jobs, as well as a visual editor that takes care of trivial workloads without ever writing any code.
Let’s learn more about the engines in depth.
Apache Spark
Apache Spark is an analytical processing engine for the distributed processing of big data. It is capable of processing data in batches and real-time streaming, which makes it suitable to run analytics, data science, and Machine Learning workloads at scale.
It is based on the Resilient Data Dataset (RDD), which is a fault-tolerant collection of elements that can be operated in parallel. The Apache Spark project is originally written in Scala, which is a JVM-based programming language. However, they also provide a Python interface to interact with the Spark engine called PySpark. PySpark allows you to build the transformation steps using Python, and PySpark submits that over to the Spark runtime for execution. PySpark also comes with a Pandas-like API for those coming from Pandas. AWS Glue provides its own set of proprietary abstractions built on top of the PySpark API, such as the DynamicFrame, which are better optimized for access to other AWS data services.
At its core, the Spark engine is driven by two fundamental operations on RDDs:
- Transformations, which create a new dataset from an existing one,
- Actions, which return a value to the driver program after running a computation on the dataset
The transformations can be thought of as query builder methods which build a DAG (Directed Acyclic Graph). Whereas, the actions are the terminal points where the Spark engine actually performs the execution of the DAG. The Spark engine is smart enough to optimize the DAG before its execution, so an optimal performance is achieved.
To illustrate this, I am quoting an example directly from the Spark docs:
To illustrate RDD basics, consider the simple program below:
lines = sc.textFile("data.txt") ineLengths = lines.map(lambda s: len(s)) totalLength = lineLengths.reduce(lambda a, b: a + b)
The first line defines a base RDD from an external file. This dataset is not loaded in memory or otherwise acted on:
lines
are merely a pointer to the file. The second line defineslineLengths
as the result of amap
transformation. Again,lineLengths
is not immediately computed, due to laziness. Finally, we runreduce
, which is an action. At this point Spark breaks the computation into tasks to run on separate machines, and each machine runs both its part of the map and a local reduction, returning only its answer to the driver program.If we also wanted to use
lineLengths
again later, we could add:lineLengths.persist()
before the
reduce
, which would causelineLengths
to be saved in memory after the first time it is computed.
The Spark engine consists of these primary components:
- The driver node
- A set of Worker nodes forming a cluster
The driver node is responsible for orchestrating the worker nodes in the cluster, executing the DAG, moving data in and out of the clusters, and among the worker nodes (aka shuffling). The distribution of tasks into multiple worker nodes is what enables effortless distributed computing.
If you want to try hands-on with deploying and running a PySpark job on AWS Glue, check out our tutorial at AWS Glue Tutorial.
Ray
Ray is a new addition to the list of supported data engines in AWS Glue. It is a unified compute framework that makes it easy to scale AI and Python workloads. Ray comes with a rich-set of libraries and integrations, which makes it a suitable candidate for running the pipelines effortlessly.
Ray also comes with a set of Pandas-compatible APIs. An advantage of Ray over Spark is that since Ray is implemented in Python, it fares better than PySpark in terms of performance due to no heavy serialization costs between the Python and the JVM runtimes in the case of PySpark. It also enables easy debugging during the development process as the logs do not contain stack traces from the JVM runtime, unlike PySpark.
Jupyter Notebooks, Python Shell jobs, and the Visual Editor
AWS Glue also allows you to run the job as a notebook for interactive development sessions, Python shell jobs for legacy or custom code, and the Visual Editor which is AWS’ no-code solution to build ETL pipelines. Behind the scenes, the Visual Editor generates a Python script that uses their own proprietary abstractions over dynamic frames.
What do I do while I wait for my jobs to complete?
While you wait for your jobs to complete, you can watch an endless stream of logs and multiple graphs related to the execution of the jobs. Logging and monitoring is crucial for any enterprise. AWS Glue provides metrics for jobs that we can monitor. It also integrates with CloudWatch to provide near real-time logging. This helps us to debug, analyze the root cause behind the failures, as well as track the impact on performance over time. This is specially useful when the jobs are automated, and you want to make sure that a failure doesn’t blow up exponentially and cost a fortune.
In the AWS Glue console, we are provided with key metrics such as the success rate of the jobs over time, the amount of DPU-hours consumed so far. It also provides a view of the execution time each job took to complete, the time spent in starting-up the clusters, along with various performance related metrics such as ETL data movement, data shuffle across executors, memory profile of the drivers and the executors, CPU load, and job execution metrics.
AWS Glue sounds too good to be true. Does it cost a fortune?
No, while there are many pieces involved here, the pricing model of AWS Glue for ETL jobs is rather simple. The service tracks the amount of DPUs (Data Processing Units) used for each job, which is nothing but a unit that is associated with different configurations of machines that AWS Glue provides. It then factors in the time spent for each job, to account the pricing in the DPU-hours unit. It means that you only pay for what you use, and how much time it takes the ETL job to run. To calculate the absolute price to pay for each job, you multiply the DPU-hours value with the price per unit.
Quoting an example from AWS Glue pricing page:
ETL job: Consider an AWS Glue Apache Spark job that runs for 15 minutes and uses 6 DPU. The price of 1 DPU-Hour is $0.44. Since your job ran for 1/4th of an hour and used 6 DPUs, AWS will bill you 6 DPU * 1/4 hour * $0.44, or $0.66.
Note that this example calculation does not account for additional charges that you must pay for any costs associated with transferring data in and out. Be mindful of how much data you bring in to process!
Conclusion
AWS Glue is a useful service that one must keep in mind when processing with big data on the AWS cloud. It offers handy tools and services along with monitoring and logging capabilities that aid in the development process. PySpark provides rich APIs to develop a scalable pipeline effortlessly that also describes the intent from just reading the code, unlike other tools which require you to shift your mindset to use distributed computing. Additionally, with the ever-growing PySpark ecosystem, you aren’t locked in to a single vendor and can liberally move between different clouds. Moreover, utilizing a serverless service with a pay for what you use pricing model can help cut costs in time, development, and resources.