Data Engineering with AWS: Acquire the skills to design and build AWS-based data transformation pipelines like a pro 2nd Edition by Gareth Eagar – Ebook PDF Instant Download/DeliveryISBN: 1804613139, 9781804613139
Full download Data Engineering with AWS: Acquire the skills to design and build AWS-based data transformation pipelines like a pro 2nd Edition after payment.
Product details:
ISBN-10 : 1804613139
ISBN-13 : 9781804613139
Author : Gareth Eagar
This book, authored by a seasoned Senior Data Architect with 25 years of experience, aims to help you achieve proficiency in using the AWS ecosystem for data engineering. This revised edition provides updates in every chapter to cover the latest AWS services and features, takes a refreshed look at data governance, and includes a brand-new section on building modern data platforms which covers; implementing a data mesh approach, open-table formats (such as Apache Iceberg), and using DataOps for automation and observability. You’ll begin by reviewing the key concepts and essential AWS tools in a data engineer’s toolkit and getting acquainted with modern data management approaches. You’ll then architect a data pipeline, review raw data sources, transform the data, and learn how that transformed data is used by various data consumers. You’ll learn how to ensure strong data governance, and about populating data marts and data warehouses along with how a data lakehouse fits into the picture. After that, you’ll be introduced to AWS tools for analyzing data, including those for ad-hoc SQL queries and creating visualizations. Then, you’ll explore how the power of machine learning and artificial intelligence can be used to draw new insights from data. In the final chapters, you’ll discover transactional data lakes, data meshes, and how to build a cutting-edge data platform on AWS. By the end of this AWS book, you’ll be able to execute data engineering tasks and implement a data pipeline on AWS like a pro!
What you will learn
Data Engineering with AWS: Acquire the skills to design and build AWS-based data transformation pipelines like a pro 2nd Table of contents:
Section 1: AWS Data Engineering Concepts and Trends
An Introduction to Data Engineering
Technical requirements
The rise of big data as a corporate asset
The challenges of ever-growing datasets
The role of the data engineer as a big data enabler
Understanding the role of the data engineer
Understanding the role of the data scientist
Understanding the role of the data analyst
Understanding other common data-related roles
The benefits of the cloud when building big data analytic solutions
Hands-on – creating and accessing your AWS account
Creating a new AWS account
Accessing your AWS account
Summary
Data Management Architectures for Analytics
Technical requirements
The evolution of data management for analytics
Databases and data warehouses
Dealing with big, unstructured data
Cloud-based solutions for big data analytics
A deeper dive into data warehouse concepts and architecture
Dimensional modeling in data warehouses
Understanding the role of data marts
Distributed storage and massively parallel processing
Columnar data storage and efficient data compression
Feeding data into the warehouse – ETL and ELT pipelines
An overview of data lake architecture and concepts
Data lake logical architecture
The storage layer and storage zones
Catalog and search layers
Ingestion layer
The processing layer
The consumption layer
Data lake architecture summary
Bringing together the best of data warehouses and data lakes
The data lake house approach
New data lake table formats
Federated queries across database engines
Hands-on – using the AWS Command Line Interface (CLI) to create Simple Storage Service (S3) buckets
Accessing the AWS CLI
Using AWS CloudShell to access the CLI
Creating new Amazon S3 buckets
Summary
The AWS Data Engineer’s Toolkit
Technical requirements
An overview of AWS services for ingesting data
Amazon Database Migration Service (DMS)
Amazon Kinesis for streaming data ingestion
Amazon Kinesis Agent
Amazon Kinesis Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Data Analytics
Amazon Kinesis Video Streams
Amazon MSK for streaming data ingestion
Amazon AppFlow for ingesting data from SaaS services
AWS Transfer Family for ingestion using FTP/SFTP protocols
AWS DataSync for ingesting from on premises and multicloud storage services
The AWS Snow family of devices for large data transfers
AWS Glue for data ingestion
An overview of AWS services for transforming data
AWS Lambda for light transformations
AWS Glue for serverless data processing
Serverless ETL processing
AWS Glue DataBrew
AWS Glue Data Catalog
AWS Glue crawlers
Amazon EMR for Hadoop ecosystem processing
An overview of AWS services for orchestrating big data pipelines
AWS Glue workflows for orchestrating Glue components
AWS Step Functions for complex workflows
Amazon Managed Workflows for Apache Airflow (MWAA)
An overview of AWS services for consuming data
Amazon Athena for SQL queries in the data lake
Amazon Redshift and Redshift Spectrum for data warehousing and data lakehouse architectures
Overview of Amazon QuickSight for visualizing data
Hands-on – triggering an AWS Lambda function when a new file arrives in an S3 bucket
Creating a Lambda layer containing the AWS SDK for pandas library
Creating an IAM policy and role for your Lambda function
Creating a Lambda function
Configuring our Lambda function to be triggered by an S3 upload
Summary
Data Governance, Security, and Cataloging
Technical requirements
The many different aspects of data governance
Data security, access, and privacy
Common data regulatory requirements
Core data protection concepts
Personally identifiable information (PII)
Personal data
Encryption
Anonymized data
Pseudonymized data/tokenization
Authentication
Authorization
Putting these concepts together
Data quality, data profiling, and data lineage
Data quality
Data profiling
Data lineage
Business and technical data catalogs
Implementing a data catalog to avoid creating a data swamp
Business data catalogs
Technical data catalogs
AWS services that help with data governance
The AWS Glue/Lake Formation technical data catalog
AWS Glue DataBrew for profiling datasets
AWS Glue Data Quality
AWS Key Management Service (KMS) for data encryption
Amazon Macie for detecting PII data in Amazon S3 objects
The AWS Glue Studio Detect PII transform for detecting PII data in datasets
Amazon GuardDuty for detecting threats in an AWS account
AWS Identity and Access Management (IAM) service
Using AWS Lake Formation to manage data lake access
Permissions management before Lake Formation
Permissions management using AWS Lake Formation
Hands-on – configuring Lake Formation permissions
Creating a new user with IAM permissions
Transitioning to managing fine-grained permissions with AWS Lake Formation
Activating Lake Formation permissions for a database and table
Granting Lake Formation permissions
Summary
Section 2: Architecting and Implementing Data Engineering Pipelines and Transformations
Architecting Data Engineering Pipelines
Technical requirements
Approaching the data pipeline architecture
Architecting houses and pipelines
Whiteboarding as an information-gathering tool
Conducting a whiteboarding session
Identifying data consumers and understanding their requirements
Identifying data sources and ingesting data
Identifying data transformations and optimizations
File format optimizations
Data standardization
Data quality checks
Data partitioning
Data denormalization
Data cataloging
Whiteboarding data transformation
Loading data into data marts
Wrapping up the whiteboarding session
Hands-on – architecting a sample pipeline
Detailed notes from the project “Bright Light” whiteboarding meeting of GP Widgets, Inc
Meeting notes
Summary
Ingesting Batch and Streaming Data
Technical requirements
Understanding data sources
Data variety
Structured data
Semi-structured data
Unstructured data
Data volume
Data velocity
Data veracity
Data value
Questions to ask
Ingesting data from a relational database
AWS DMS
AWS Glue
Full one-off loads from one or more tables
Initial full loads from a table, and subsequent loads of new records
Creating AWS Glue jobs with AWS Lake Formation
Other ways to ingest data from a database
Deciding on the best approach to ingesting from a database
The size of the database
Database load
Data ingestion frequency
Technical requirements and compatibility
Ingesting streaming data
Amazon Kinesis versus Amazon Managed Streaming for Kafka (MSK)
Serverless services versus managed services
Open-source flexibility versus proprietary software with strong AWS integration
At-least-once messaging versus exactly once messaging
A single processing engine versus niche tools
Deciding on a streaming ingestion tool
Hands-on – ingesting data with AWS DMS
Deploying MySQL and an EC2 data loader via CloudFormation
Creating an IAM policy and role for DMS
Configuring DMS settings and performing a full load from MySQL to S3
Querying data with Amazon Athena
Hands-on – ingesting streaming data
Configuring Kinesis Data Firehose for streaming delivery to Amazon S3
Configuring Amazon Kinesis Data Generator (KDG)
Adding newly ingested data to the Glue Data Catalog
Querying the data with Amazon Athena
Summary
Transforming Data to Optimize for Analytics
Technical requirements
Overview of how transformations can create value
Cooking, baking, and data transformations
Transformations as part of a pipeline
Types of data transformation tools
Apache Spark
Hadoop and MapReduce
SQL
GUI-based tools
Common data preparation transformations
Protecting PII data
Optimizing the file format
Optimizing with data partitioning
Data cleansing
Common business use case transformations
Data denormalization
Enriching data
Pre-aggregating data
Extracting metadata from unstructured data
Working with Change Data Capture (CDC) data
Traditional approaches – data upserts and SQL views
Modern approaches – Open Table Formats (OTFs)
Apache Iceberg
Apache Hudi
Databricks Delta Lake
Hands-on – joining datasets with AWS Glue Studio
Creating a new data lake zone – the curated zone
Creating a new IAM role for the Glue job
Configuring a denormalization transform using AWS Glue Studio
Finalizing the denormalization transform job to write to S3
Create a transform job to join streaming and film data using AWS Glue Studio
Summary
Identifying and Enabling Data Consumers
Technical requirements
Understanding the impact of data democratization
A growing variety of data consumers
How a data mesh helps data consumers
Meeting the needs of business users with data visualization
AWS tools for business users
A quick overview of Amazon QuickSight
Meeting the needs of data analysts with structured reporting
AWS tools for data analysts
Amazon Athena
AWS Glue DataBrew
Running Python or R in AWS
Meeting the needs of data scientists and ML models
AWS tools used by data scientists to work with data
SageMaker Ground Truth
SageMaker Data Wrangler
SageMaker Clarify
Hands-on – creating data transformations with AWS Glue DataBrew
Configuring new datasets for AWS Glue DataBrew
Creating a new Glue DataBrew project
Building your Glue DataBrew recipe
Creating a Glue DataBrew job
Summary
A Deeper Dive into Data Marts and Amazon Redshift
Technical requirements
Extending analytics with data warehouses/data marts
Cold and warm data
Cold data
Warm data
Amazon S3 storage classes
Hot data
What not to do – anti-patterns for a data warehouse
Using a data warehouse as a transactional datastore
Using a data warehouse as a data lake
Storing unstructured data
Redshift architecture review and storage deep dive
Data distribution across slices
Redshift Zone Maps and sorting data
Designing a high-performance data warehouse
Provisioned versus Redshift Serverless clusters
Selecting the optimal Redshift node type for provisioned clusters
Selecting the optimal table distribution style and sort key
Selecting the right data type for columns
Character types
Numeric types
Datetime types
Boolean type
HLLSKETCH type
SUPER type
Selecting the optimal table type
Local Redshift tables
External tables for querying data in Amazon S3 with Redshift Spectrum
Temporary staging tables for loading data into Redshift
Data caching using Redshift materialized views
Moving data between a data lake and Redshift
Optimizing data ingestion in Redshift
Automating data loads from Amazon S3 into Redshift
Exporting data from Redshift to the data lake
Exploring advanced Redshift features
Data sharing between Redshift clusters
Machine learning capabilities in Amazon Redshift
Running Redshift clusters across multiple Availability Zones
Redshift Dynamic Data Masking
Zero-ETL between Amazon Aurora and Amazon Redshift
Resizing a Redshift cluster
Hands-on – deploying a Redshift Serverless cluster and running Redshift Spectrum queries
Uploading our sample data to Amazon S3
IAM roles for Redshift
Creating a Redshift cluster
Querying data in the sample database
Using Redshift Spectrum to directly query data in the data lake
Summary
Orchestrating the Data Pipeline
Technical requirements
Understanding the core concepts for pipeline orchestration
What is a data pipeline, and how do you orchestrate it?
What is a directed acyclic graph?
How do you trigger a data pipeline to run?
Using manifest files as pipeline triggers
How do you handle the failures of a step in your pipeline?
Common reasons for failure in data pipelines
Pipeline failure retry strategies
Examining the options for orchestrating pipelines in AWS
AWS Data Pipeline (now in maintenance mode)
AWS Glue workflows to orchestrate Glue resources
Monitoring and error handling
Triggering Glue workflows
Apache Airflow as an open-source orchestration solution
Core concepts for creating Apache Airflow pipelines
AWS Step Functions for a serverless orchestration solution
A sample Step Functions state machine
Deciding on which data pipeline orchestration tool to use
Hands-on – orchestrating a data pipeline using AWS Step Functions
Creating new Lambda functions
Using a Lambda function to determine the file extension
Using Lambda to randomly generate failures
Creating an SNS topic and subscribing to an email address
Creating a new Step Functions state machine
Configuring our S3 bucket to send events to EventBridge
Creating an EventBridge rule for triggering our Step Functions state machine
Testing our event-driven data orchestration pipeline
Summary
Section 3: The Bigger Picture: Data Analytics, Data Visualization, and Machine Learning
Ad Hoc Queries with Amazon Athena
Technical requirements
An introduction to Amazon Athena
Tips and tricks to optimize Amazon Athena queries
Common file format and layout optimizations
Transforming raw source files to optimized file formats
Partitioning the dataset
Other file-based optimizations
Writing optimized SQL queries
Selecting only the specific columns that you need
Using approximate aggregate functions
Reusing Athena query results
Exploring advanced Athena functionality
Querying external data sources using Athena Federated Query
Pre-built connectors and custom connectors
Using Apache Spark in Amazon Athena
Working with open table formats in Amazon Athena
Provisioning capacity for queries
Managing groups of users with Amazon Athena workgroups
Managing Athena costs with Athena workgroups
Per query data usage control
Athena workgroup data usage controls
Implementing governance controls with Athena workgroups
Hands-on – creating an Amazon Athena workgroup and configuring Athena settings
Hands-on – switching workgroups and running queries
Summary
Visualizing Data with Amazon QuickSight
Technical requirements
Representing data visually for maximum impact
Benefits of data visualization
Popular uses of data visualizations
Trends over time
Data over a geographic area
Heat maps to represent the intersection of data
Understanding Amazon QuickSight’s core concepts
Standard versus Enterprise edition
SPICE – the in-memory storage and computation engine for QuickSight
Managing SPICE capacity
Ingesting and preparing data from a variety of sources
Preparing datasets in QuickSight versus performing ETL outside of QuickSight
Creating and sharing visuals with QuickSight analyses and dashboards
Visual types in Amazon QuickSight
AutoGraph for automatic graphing
Line, geospatial, and heat maps
Bar charts
Key performance indicators
Tables as visuals
Custom visual types
Other visual types
Understanding QuickSight’s advanced features
Amazon QuickSight ML Insights
Amazon QuickSight autonarratives
ML-powered anomaly detection
ML-powered forecasting
Amazon QuickSight Q for natural language queries
Generative BI dashboarding authoring capabilities
QuickSight Q Topics
Fine-tuning your QuickSight Q Topics
Amazon QuickSight embedded dashboards
Embedding for registered QuickSight users
Embedding for unauthenticated users
Generating multi-page formatted reports
Hands-on – creating a simple QuickSight visualization
Setting up a new QuickSight account and loading a dataset
Creating a new analysis
Publishing our visual as a dashboard
Summary
Enabling Artificial Intelligence and Machine Learning
Technical requirements
Understanding the value of AI and ML for organizations
Specialized AI projects
Medical clinical decision support platform
Early detection of diseases
Making sports safer
Everyday use cases for AI and ML
Forecasting
Personalization
Natural language processing
Image recognition
Exploring AWS services for ML
AWS ML services
SageMaker in the ML preparation phase
SageMaker in the ML build phase
SageMaker in the ML training and tuning phase
SageMaker in the ML deployment and management phase
Exploring AWS services for AI
AI for unstructured speech and text
Amazon Transcribe for converting speech into text
Amazon Textract for extracting text from documents
Amazon Comprehend for extracting insights from text
AI for extracting metadata from images and video
Amazon Rekognition
AI for ML-powered forecasts
Amazon Forecast
AI for fraud detection and personalization
Amazon Fraud Detector
Amazon Personalize
Building generative AI solutions on AWS
Understanding the foundations of generative AI technology
Building on foundational models using Amazon SageMaker JumpStart
Building on foundational models using Amazon Bedrock
Common use cases for LLMs
Hands-on – reviewing reviews with Amazon Comprehend
Setting up a new Amazon SQS message queue
Creating a Lambda function for calling Amazon Comprehend
Adding Comprehend permissions for our IAM role
Adding a Lambda function as a trigger for our SQS message queue
Testing the solution with Amazon Comprehend
Summary
Section 4: Modern Strategies: Open Table Formats, Data Mesh, DataOps, and Preparing for the Real World
Building Transactional Data Lakes
Technical requirements
What does it mean for a data lake to be transactional?
Limitations of Hive-based data lakes
High-level benefits of open table formats
ACID transactions
Record level updates
Schema evolution
Time travel
Overview of how open table formats work
Approaches used by table formats for updating tables
COW approach to table updates
MOR approach to table updates
Choosing between COW and MOR
An overview of Delta Lake, Apache Hudi, and Apache Iceberg
Deep dive into Delta Lake
Advanced features available in Delta Lake
Deep dive into Apache Hudi
Hudi Primary Keys
File groups
Compaction
Record level index
Deep dive into Apache Iceberg
Iceberg Metadata file
The manifest list file
The manifest file
Putting it together
Maintenance tasks for Iceberg tables
AWS service integrations for building transactional data lakes
Open table format support in AWS Glue
AWS Glue crawler support
AWS Glue ETL engine support
Open table support in AWS Lake Formation
Open table support in Amazon EMR
Open table support in Amazon Redshift
Open table support in Amazon Athena
Hands-on – Working with Apache Iceberg tables in AWS
Creating an Apache Iceberg table using Amazon Athena
Adding data to our Iceberg table and running queries
Modifying data in our Iceberg table and running queries
Iceberg table maintenance tasks
Optimizing the table layout
Reducing disk space by deleting snapshots
Summary
Implementing a Data Mesh Strategy
Technical requirements
What is a data mesh?
Domain-oriented, decentralized data ownership
Data as a product
Self-service data infrastructure as a platform
Federated computational governance
Data producers and consumers
Challenges that a data mesh approach attempts to resolve
Bottlenecks with a centralized data team
The “Analytics is not my problem” problem
No organization-wide visibility into datasets that are available
The organizational and technical challenges of building a data mesh
Changing the way that an organization approaches analytical data
Changes for the centralized data & analytics team
Changes for line of business teams
Technical challenges for building a data mesh
Integrating existing analytical tools
Centralizing dataset metadata in a single catalog and building automation
Compromising on integrations
AWS services that help enable a data mesh approach
Querying data across AWS accounts
Sharing data with AWS Lake Formation
Amazon DataZone, a business data catalog with data mesh functionality
DataZone concepts
DataZone components
A sample architecture for a data mesh on AWS
Architecture for a data mesh using AWS-native services
Architecture for a data mesh using non-AWS analytic services
Automating the sharing of data in Snowflake
Using query federation instead of data sharing
Hands-on – Setting up Amazon DataZone
Setting up AWS Identity Center
Enabling and configuring Amazon DataZone
Adding a data source to our DataZone project
Adding business metadata
Creating a project for data analysis
Search the data catalog and subscribe to data
Approving the subscription request
Summary
Building a Modern Data Platform on AWS
Technical requirements
Goals of a modern data platform
A flexible and agile platform
A scalable platform
A well-governed platform
A secure platform
An easy-to-use, self-serve platform
Deciding whether to build or buy a data platform
Choosing to buy a data platform
When to buy a data platform
Choosing to build a data platform
When to build a data platform
A third way – implementing an open-source data platform
The Serverless Data Lake Framework (SDLF)
Core SDLF concepts
DataOps as an approach to building data platforms
Automation and observability as a key for DataOps
Automating infrastructure and code deployment
Automating observability
AWS services for implementing a DataOps approach
AWS services for infrastructure deployment
AWS code management and deployment services
Hands-on – automated deployment of data platform components and data transformation code
Setting up a Cloud9 IDE environment
Setting up our AWS CodeCommit repository
People also search for Data Engineering with AWS: Acquire the skills to design and build AWS-based data transformation pipelines like a pro 2nd:
data engineering with aws pdf
data engineering with aws cookbook
data engineering with aws udacity
data engineering with aws github
data engineering with aws pdf free download
Tags: Data Engineering, the skills, data transformation, pipelines, Gareth Eagar