0% found this document useful (0 votes)
29 views91 pages

Ade Notes

The document outlines the Mission100 Azure Data Engineer Course by Deepak Goyal, covering essential aspects of cloud computing, including its definition, advantages, types, and services such as IaaS, PaaS, and SaaS. It also discusses network latency, ways to connect to Azure accounts, Azure regions, availability zones, resource groups, and storage accounts. Overall, it serves as a comprehensive guide for understanding cloud computing and Azure services.

Uploaded by

Monalisa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views91 pages

Ade Notes

The document outlines the Mission100 Azure Data Engineer Course by Deepak Goyal, covering essential aspects of cloud computing, including its definition, advantages, types, and services such as IaaS, PaaS, and SaaS. It also discusses network latency, ways to connect to Azure accounts, Azure regions, availability zones, resource groups, and storage accounts. Overall, it serves as a comprehensive guide for understanding cloud computing and Azure services.

Uploaded by

Monalisa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 91

Mission100 Azure Data Engineer Course By Deepak Goyal

https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/

● What is Cloud Computing?

Cloud computing is the delivery of computing services over the internet ("the cloud"), allowing users to access
and use resources such as servers, storage, databases, networking, software, and analytics without needing to
manage the underlying hardware and software infrastructure.

Cloud service providers offer many services such as:

• Storage
• computing
• Networking
• Servers
• Databases
• Developer tools
• Security
• Analytics, etc

Advantages of Cloud Computing

Scalability:
Cloud computing provides the ability to easily increase or decrease IT resources based on demand or
requirements. For example, when companies like Flipkart and Amazon start their Big Billion Day sale or when
there's an India-Pakistan cricket match on Hotstar or a football World Cup final, the user traffic on these
platforms increases rapidly. To manage this surge in traffic, they can rent the required resources from a cloud
service provider for those specific days. This is known as scaling up.

Increasing resources is called scaling up, while decreasing resources is called scaling down.

There are two types of scaling:

Vertical Scaling:
Vertical scaling involves increasing or decreasing resources within the same server. This is done by adding more
RAM, hard disk space, or CPU power to the existing server without purchasing a new one.

Horizontal Scaling:
Horizontal scaling involves adding or removing servers to handle increased traffic or improve performance.
This is achieved by adding additional virtual machines (VMs), containers, or servers to the existing
infrastructure.

Geo-Distribution:
Cloud computing helps mitigate network latency issues by providing services based on geographical location.
Organizations can rent cloud servers in any region to provide services to users with minimal latency.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Agility:

Cloud computing enables organizations to access services very quickly, allowing them to adapt to changes or
new demands efficiently.

Cost-Effectiveness:
Cloud computing offers two cost models: CapEx and OpEx.

CapEx (Capital Expenditure):


This model involves large upfront investments, where an organization needs to buy everything required for its
business, such as IT infrastructure, servers, routers, switches, software, network cabling, and employees to
manage these resources.

OpEx (Operational Expenditure):


In this model, organizations do not need to make large investments. They pay for cloud services on a pay-as-
you-go basis, renting servers and services as needed for short periods.

Challenges
Security:
Security can be a concern, especially when using public cloud services or services from third-party providers.
The highest level of security is usually achieved when an organization uses its own private cloud, often referred
to as "on-prem" or "on-premises" cloud.

Types of Cloud
Private Cloud:
A private cloud is when an organization uses its own cloud infrastructure, managed either internally or by a
third party. This cloud environment is dedicated to a single organization, providing greater control, security,
and customization. It is often used by large enterprises that need to handle sensitive data and require stringent
compliance.

Public Cloud:
A public cloud is when an organization utilizes cloud services provided by a third-party cloud provider, such as
Microsoft Azure, Amazon Web Services (AWS), or Google Cloud Platform (GCP). In this model, the cloud
infrastructure is shared among multiple users, also known as tenants, making it a cost-effective option for
businesses. Public clouds offer scalability, flexibility, and ease of access to a wide range of services.

Hybrid Cloud:
A hybrid cloud is a combination of both private and public clouds, designed to allow data and applications to
be shared between them. This setup enables organizations to take advantage of the scalability and cost-
efficiency of public cloud services while keeping sensitive operations in a private cloud environment. Hybrid
clouds are ideal for businesses that want to optimize their existing infrastructure and meet specific regulatory
or business requirements.

Types of Cloud Services


Cloud computing provides three primary types of services: IaaS, PaaS, and SaaS.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/

IaaS (Infrastructure as a Service):

IaaS offers on-demand access to fundamental computing resources such as virtual machines, storage, and
networking. Organizations can rent these resources and scale them up or down based on their needs, without
having to manage the physical hardware. Examples include AWS EC2, Azure Virtual Machines, and Google
Compute Engine. IaaS is ideal for businesses that want to retain control over their infrastructure while
benefiting from the flexibility and cost savings of the cloud.

PaaS (Platform as a Service):


PaaS provides on-demand access to a complete development and deployment environment in the cloud. It
includes everything needed to build, test, deploy, manage, and update applications. With PaaS, developers can
focus on writing code without worrying about managing the underlying infrastructure. Examples include Azure
App Service, AWS Lambda, and Google App Engine. PaaS is perfect for organizations that need a ready-to-use
platform for developing and managing applications efficiently.

SaaS (Software as a Service):


SaaS delivers ready-to-use application software over the internet on a subscription basis. Users can access
these applications from any device with an internet connection, without needing to manage or install software.
Examples include Google Workspace, Microsoft Office 365, and Salesforce. SaaS is ideal for businesses that
want to use software applications without the overhead of managing infrastructure or software updates.

What is Network Latency?


In simple terms, network latency is the delay or time it takes for data to travel from one point to
another over a network. Imagine you're sending a message to a friend, but instead of it being delivered
instantly, there's a slight pause before your friend receives it. That pause is like network latency.
For example, when you click on a website link, your computer sends a request to a server (where the
website is hosted), and the server sends back the information to load the page. If there’s high network
latency, it means there’s a noticeable delay between your click and the page fully loading.
How Can Cloud Computing Help Reduce Network Latency?
Cloud computing can help reduce network latency in a few key ways:
1. Geographical Distribution: Cloud providers like AWS, Azure, and Google Cloud have data
centers all over the world. When you use cloud services, your data and applications can be
hosted in a data center that’s physically closer to your location. The closer the server is to you,
the faster the data can travel, which reduces latency.
For instance, if you’re in India and accessing a website hosted on a server in the US, it might
take a bit longer for the data to travel back and forth. But if the website is hosted on a cloud
server in India, the data has a much shorter distance to travel, so the website loads faster.

2. Edge Computing: Cloud providers often use something called edge computing, which means
they place smaller data centers (called edge locations) closer to users. These edge locations can
handle certain tasks or store frequently accessed data locally, reducing the need to send data all
the way back to the main data center. This also helps in reducing latency.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
3. Content Delivery Networks (CDNs): Many cloud providers offer CDNs, which are networks
of servers distributed around the world. CDNs store copies of your website or content on
multiple servers in different locations. When someone tries to access your content, the CDN
delivers it from the nearest server, reducing the distance the data needs to travel and,
consequently, the latency.

what are the different ways to connect to azure account

1. Azure Portal

● Description: The Azure Portal is a web-based interface where you can manage and monitor your Azure
services.
● Usage: Simply go to https://portal.azure.com, sign in with your Azure account credentials, and you'll
have access to all your Azure resources.
● Best For: Managing resources, creating services, monitoring activity, and performing administrative
tasks via a graphical interface.

2. Azure CLI (Command-Line Interface)

● Description: Azure CLI is a cross-platform command-line tool that allows you to manage your Azure
resources directly from your terminal or command prompt.
● Usage: After installing Azure CLI, you can connect by running the command az login, which will open
a web browser for you to sign in.
● Best For: Automating tasks, scripting, and managing Azure resources via commands. It's popular
among developers and DevOps engineers.

3. Azure PowerShell

● Description: Azure PowerShell is a set of modules that allow you to manage Azure resources using
PowerShell scripts and commands.
● Usage: After installing Azure PowerShell, you can connect by running Connect-AzAccount, which
will prompt you to sign in through a web browser.
● Best For: Scripting and automating Azure resource management tasks, especially for those familiar
with PowerShell.

4. Azure Mobile App

● Description: The Azure mobile app lets you monitor and manage your Azure resources on the go.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
● Usage: Download the Azure app from the App Store or Google Play, sign in with your Azure credentials,
and you can monitor resources, check alerts, and perform basic management tasks.
● Best For: Quick checks and light management tasks from a mobile device.

5. Azure Cloud Shell

● Description: Azure Cloud Shell is an online, browser-based shell that provides you with a command-
line experience in the Azure portal.
● Usage: Available directly in the Azure Portal, you can choose either Bash or PowerShell environments
to manage Azure resources without installing anything locally.
● Best For: Quick command-line tasks without needing to install anything on your local machine.

6. SDKs and APIs

● Description: Azure provides Software Development Kits (SDKs) and REST APIs that allow you to
connect to and manage Azure services programmatically.
● Usage: By using Azure SDKs available for various programming languages (like Python, .NET, Java, etc.),
you can connect to Azure directly within your application code.
● Best For: Developers building applications that need to interact with Azure services programmatically.

Azure Regions:
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
An Azure region is a specific geographical area where Microsoft has data centers to host cloud
services and resources. Each region consists of multiple data centers located within a defined perimeter
and connected through a dedicated, low-latency network.

● Purpose: Regions allow you to deploy your resources closer to your users to reduce latency,
meet data residency requirements, and ensure compliance with local regulations.
● Examples of Azure Regions:
o East US (Virginia)
o West Europe (Netherlands)
o Southeast Asia (Singapore)
o Australia East (New South Wales)
o Central India (Pune)

Azure offers over 60 regions globally, making it one of the most extensive cloud networks in the
world.

Availability Zones:

An Availability Zone is a physically separate location within an Azure region. Each Availability Zone
consists of one or more data centers equipped with independent power, cooling, and networking. By
deploying resources across multiple Availability Zones, you can ensure high availability and fault
tolerance for your applications.

● Purpose: Availability Zones are designed to protect your applications and data from data
center failures within a region. If one zone goes down, the other zones in the region continue to
operate, minimizing downtime.
● Structure: Typically, an Azure region will have three or more Availability Zones. These zones
are interconnected with high-speed, private fiber-optic networks.
● Examples of Services Using Availability Zones:
o Virtual Machines: You can deploy VMs across multiple zones to ensure that if one
zone fails, your application remains available in the other zones.
o Managed Disks: Zone-redundant storage (ZRS) replicates your data across multiple
zones to ensure durability.
o Load Balancers: Azure Load Balancer can distribute traffic across VMs in different
zones, providing high availability.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/

What are Resource Groups in Azure?


A Resource Group in Azure is a logical container that holds related Azure resources. These resources
might include virtual machines, storage accounts, virtual networks, databases, and more. Resource
groups help you manage and organize your Azure resources effectively.

Key Features of Resource Groups:


1. Logical Organization:
o Resource groups allow you to group together resources that share the same lifecycle or
purpose. For example, all resources related to a specific application or project can be placed in
the same resource group, making it easier to manage, monitor, and deploy.
2. Resource Management:
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
o You can perform management tasks on the resource group as a whole, such as applying
policies, setting access controls, or deleting the entire group. This simplifies operations,
especially when dealing with multiple resources.
3. Access Control:
o Azure Role-Based Access Control (RBAC) can be applied at the resource group level, allowing
you to define who has permission to manage the resources within the group. This provides a
centralized way to manage permissions.
4. Cost Management:
o Since all resources in a resource group can be tracked together, it becomes easier to monitor
and manage costs associated with a specific application or project.
5. Lifecycle Management:
o Resources within a resource group typically share the same lifecycle, which means you can
deploy, update, or delete all the resources together. For instance, when a project ends, you
can delete the resource group, and all its resources will be deleted as well.
6. Dependency Management:
o When deploying resources via templates (like Azure Resource Manager (ARM) templates), the
resources in a resource group can be deployed in a specific order, with dependencies defined
between them.

Important Considerations:
● Location: The resource group itself has a location (region), which determines where its
metadata is stored. However, the resources within the group can be in different regions. It's
generally recommended to keep resources in the same region for performance reasons.
● Naming: Resource groups should be named in a way that reflects their purpose, such as RG-
Production-ApplicationName or RG-Dev-ProjectX, to make them easily identifiable.
● Scope: A resource can only belong to one resource group at a time, but you can move resources
between groups if needed.
● Limits: While there are limits on the number of resource groups and resources per group, these
limits are generally high and sufficient for most use cases.
Example Use Cases:
● Application Lifecycle: You might create a resource group for a web application that includes
resources like a web server, database, and storage account. When you update the application,
you can update all related resources together.
● Development and Testing: For a development environment, you could create a resource group
that contains all the necessary resources (VMs, databases, etc.) for testing purposes. Once
testing is complete, you can delete the resource group to clean up all associated resources.

● Cost Management: By grouping all resources for a specific department or project, you can
easily track the costs associated with that group and manage budgets accordingly.
In summary, resource groups in Azure are a fundamental feature for organizing, managing, and
controlling access to your cloud resources. They help streamline operations and provide a structured
way to manage your Azure environment.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/

What is a Storage Account in Azure?


An Azure Storage Account is a service provided by Microsoft Azure that allows you to store and
manage various types of data in the cloud. It acts as a container that provides a namespace for Azure
Storage services and gives you access to Azure's scalable, secure, and durable storage options.

Key Features of Azure Storage Accounts:


1. Data Services: An Azure Storage Account can hold several types of data services, including:
o Blob Storage: For storing unstructured data such as images, videos, and documents. It’s
particularly useful for storing files that are accessed over the internet.
o File Storage: Provides fully managed file shares in the cloud that are accessible via the Server
Message Block (SMB) protocol. It’s like having a network file share in the cloud.
o Queue Storage: For storing large numbers of messages that can be accessed from anywhere
via authenticated calls. It’s often used for task scheduling and asynchronous processing.
o Table Storage: A NoSQL key-value store for structured data. It’s designed to handle large
amounts of data with simple query operations.
o Disk Storage: Provides persistent, durable, and high-performance storage for Azure Virtual
Machines (VMs) in the form of managed disks.
2. Scalability: Azure Storage Accounts are designed to scale automatically to meet your needs.
Whether you need to store a few gigabytes or several petabytes, Azure Storage can handle the
load.
3. Durability and Redundancy: Azure provides multiple options for ensuring that your data is
highly available and durable:
o Locally-Redundant Storage (LRS): Copies your data three times within a single data center in a
region.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
o Zone-Redundant Storage (ZRS): Copies your data synchronously across three Azure availability
zones in a region.
o Geo-Redundant Storage (GRS): Copies your data to a secondary region, hundreds of miles
away from the primary location, for disaster recovery.
o Read-Access Geo-Redundant Storage (RA-GRS): Provides read access to the data in the
secondary region, in addition to the features of GRS.
4. Security:
o Encryption: All data stored in Azure Storage is encrypted by default. Azure uses both server-
side and client-side encryption to protect your data.
o Access Control: You can control access to your storage account using Azure Active Directory
(AAD) and shared access signatures (SAS), allowing you to define who can access your data and
for how long.
o Firewall and Virtual Network Integration: You can restrict access to your storage account to
specific networks or IP ranges for additional security.
5. Cost-Effectiveness: Azure Storage offers different pricing tiers based on the type of storage,
redundancy option, and access patterns (hot, cool, or archive tiers). This flexibility allows you
to optimize costs based on your storage needs.
6. Access Methods: Data in an Azure Storage Account can be accessed via several methods:
o Azure Portal: A graphical user interface for managing your storage account and its contents.
o Azure CLI/PowerShell: Command-line tools for automating and managing your storage
resources.
o Azure SDKs: Software Development Kits (SDKs) for various programming languages (like .NET,
Java, Python) that allow you to interact with Azure Storage programmatically.
o REST API: Provides direct access to Azure Storage services for custom integration.

Types of Storage Accounts:

● General-purpose v2 (GPv2): Supports all the latest features and is the recommended type for most
scenarios. It provides access to all Azure Storage services.
● Blob Storage Account: Specifically optimized for blob storage, with tiering options (hot, cool, and
archive) to optimize cost based on access frequency.
● File Storage Account: Optimized for Azure Files, supporting premium file shares.

● BlockBlobStorage Account: Designed for workloads with high transaction rates or that require
consistent, low-latency data access.

Use Cases:

● Storing Files and Documents: Store and access files, images, videos, and other unstructured data.

● Backup and Restore: Use Azure Storage as a backup destination for on-premises or cloud-based
systems.
● Disaster Recovery: Ensure data availability with geo-redundant storage options.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
● Big Data Analytics: Store large datasets for analysis with tools like Azure Data Lake Analytics or Azure
Synapse Analytics.
● Web and Mobile Applications: Host and serve content such as web pages, videos, or static files
directly from Azure Storage.

How to Create a Storage Account:

1. Sign in to the Azure Portal.


2. Navigate to 'Storage Accounts' and click on 'Create'.
3. Configure your storage account settings, such as choosing the resource group, name, region, and
performance tier.
4. Select the desired redundancy option (LRS, ZRS, GRS, or RA-GRS).
5. Review and create the storage account.

Once created, you can start using your storage account to store and manage data in the cloud.

To keep history you can enable disable the soft delete feature.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Encryption in Azure Storage Account
Azure Storage provides robust encryption features to protect your data at rest. This ensures that all
your data is automatically encrypted before it is stored and decrypted before it is retrieved, without
requiring any additional configuration or management from you.

 Microsoft-managed keys: By default, Azure manages the encryption keys for you, which simplifies
key management and ensures that your data is protected using Microsoft-managed keys.
 Customer-managed keys (CMK): If you prefer more control, you can manage your own
encryption keys using Azure Key Vault. This gives you full control over the key lifecycle, including
rotation and revocation. Customer-managed keys can also be used for auditing purposes, as you have
visibility into the key usage.

Access Tiers in Azure Storage Account


Azure Storage offers different access tiers to help you optimize the cost of storing data based on how
frequently the data is accessed. These access tiers allow you to store your data in a way that aligns with
its usage patterns, enabling you to balance cost and performance.
Here’s a breakdown of the Azure Storage access tiers:
1. Hot Tier:

● Description: The Hot tier is designed for data that is accessed frequently. It offers the lowest access
latency and the highest throughput, making it ideal for data that needs to be accessed and processed
regularly.
● Use Cases:
o Active datasets, such as files and databases that are accessed frequently.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
o Content that is frequently updated or queried, like transaction logs.
o Data for applications that require low latency access, such as web and mobile apps.
● Cost: Higher storage costs compared to the Cool and Archive tiers, but lower access costs.

2. Cool Tier:

● Description: The Cool tier is optimized for data that is infrequently accessed but needs to be stored for
at least 30 days. It offers lower storage costs than the Hot tier but higher access costs.
● Use Cases:
o Data that is not accessed frequently but still needs to be available for occasional access, like
backups, archived data, or media content that is accessed seasonally.
o Data that is stored for compliance or business continuity purposes.
● Cost: Lower storage costs than the Hot tier, but higher costs for data access and retrieval.

3. Archive Tier:

● Description: The Archive tier is intended for data that is rarely accessed and can tolerate higher
retrieval times. This tier offers the lowest storage costs but the highest costs and latency for data
retrieval.
● Use Cases:
o Long-term archival data, such as compliance records, legal documents, or historical data that
may be accessed once in a while.
o Data that needs to be kept for extended periods but is not likely to be needed frequently.
● Cost: The lowest storage cost among all tiers, but the highest cost and latency for data access. Data in
the Archive tier must be rehydrated (moved to the Hot or Cool tier) before it can be accessed.

Access Keys, SAS Tokens, and Why Azure Provides Two Keys (Key1 & Key2)
1. Access Keys:
● What Are They? Access keys are a pair of 512-bit keys generated by Azure for your storage
account. These keys are used to authenticate and authorize access to your storage account's data
services, including Blob Storage, Queue Storage, Table Storage, and File Storage.
● Purpose: Access keys allow full access to your storage account, including read, write, and
delete operations across all data services. They are the primary means of programmatic access
to Azure Storage services.

● Usage:
o SDKs and APIs: When developing applications that need to interact with Azure Storage, you
can use these keys to authenticate requests. The keys are included in the connection strings
used by the Azure Storage SDKs or directly in API calls.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
o Administrative Tools: Tools like Azure Storage Explorer or custom scripts often use access keys
to connect to and manage storage resources.

2. Shared Access Signature (SAS) Tokens:


● What Are They? A Shared Access Signature (SAS) token is a URI that grants restricted access
rights to Azure Storage resources. Unlike access keys, which provide full access to a storage
account, SAS tokens allow you to delegate access to specific resources with fine-grained
control over permissions, duration, and IP restrictions.
● Purpose: SAS tokens are used to provide secure, temporary access to storage resources without
sharing the full access keys. They are ideal for scenarios where you need to provide access to
third parties or applications that should only have limited permissions.
● Usage:
o Time-Limited Access: SAS tokens can be configured to expire after a certain period, ensuring
that access is only available for a limited time.
o Permission Control: You can specify which operations (read, write, delete, list, etc.) are
allowed with the SAS token.
o IP Restrictions: SAS tokens can be configured to only allow access from specific IP addresses or
ranges, adding an extra layer of security.
o Resource-Specific Access: SAS tokens can be scoped to a specific resource, such as a single
blob or container, rather than the entire storage account.

3. Why Two Keys? (Key1 & Key2):


● Key Rotation: Azure provides two access keys (Key1 and Key2) for each storage account to
facilitate key rotation, which is a best practice for security. Key rotation involves periodically
changing your access keys to reduce the risk of compromise.
● Purpose: Having two keys allows you to rotate your keys without downtime. You can switch
your applications from using one key to the other while you regenerate the first key. This
ensures that your applications continue to function seamlessly during the key rotation process.
● How It Works:
1. Use Key1 for Authentication: Initially, your application might use Key1 for accessing the
storage account.
2. Regenerate Key2: When you need to rotate keys, you can regenerate Key2.
3. Switch to Key2: Update your application to start using Key2 for authentication.
4. Regenerate Key1: Once the application is using Key2, you can safely regenerate Key1.
5. Switch Back if Needed: You can continue this process to periodically rotate the keys.
● Security Best Practice: Regularly rotating access keys helps to minimize the risk if a key is
accidentally exposed or compromised. By using the two-key system, you can ensure that your
storage account remains secure while maintaining continuous access.

ADLS vs Blob Storage

Feature Azure Data Lake Storage (ADLS) Azure Blob Storage


Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Optimized for big data analytics General-purpose object storage for
Purpose
workloads unstructured data
Supports hierarchical namespace (folders Flat namespace; no native support for
Data Hierarchy
and subfolders) directories
Designed for high throughput and
High performance, but not specifically
Performance optimized performance for big data
optimized for big data
scenarios
Granular access control with Azure Active Role-based access control (RBAC) and
Security
Directory (AAD) integration shared access signatures (SAS)
Supports Hadoop Distributed File System Standard REST APIs, SDKs, and tools like
Access Protocols
(HDFS) and POSIX-compliant ACLs Azure Storage Explorer
Built-in support for analytics engines like
Limited integration with big data analytics
Analytics Integration Azure HDInsight, Azure Databricks, and
engines
Azure Synapse
Optimized for structured, semi- Best for unstructured data (images, videos,
Data Types
structured, and unstructured data documents, etc.)
Higher cost, especially for hierarchical Generally lower cost, suitable for simple
Cost Structure
namespace features storage needs
Big data analytics, complex data Backup, archiving, web applications, media
Use Cases
processing, enterprise-level data lakes storage
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/

Azure File Share Service


Azure File Share is a fully managed cloud file storage service provided by Azure, which allows you to
create and manage file shares that can be accessed via the Server Message Block (SMB) protocol or
the Network File System (NFS) protocol. It is designed to replace or complement traditional on-
premises file servers, providing scalable, secure, and accessible storage in the cloud.
Key Features of Azure File Share Service:
1. Fully Managed Service:
o Azure File Share is a fully managed service, meaning Azure handles the underlying
infrastructure, including hardware, networking, and software, allowing you to focus on your
data and applications.
2. SMB and NFS Protocols:
o Supports SMB protocol versions 2.1, 3.0, and 3.1.1, making it compatible with a wide range of
Windows, Linux, and macOS operating systems.
o NFS 3.0 protocol support is also available for Linux and Unix-based systems, which allows
seamless integration with existing NFS-based workloads.
3. Access from Anywhere:
o You can mount Azure file shares on your on-premises devices or in the cloud, providing
seamless access to files from any location with an internet connection.
o The service supports Azure AD integration for secure access and management.
4. Scalability:
o Azure File Shares can scale up to hundreds of terabytes, accommodating large volumes of
data. You can create multiple file shares within a single storage account, each with its own
capacity limit.
5. Pricing Tiers:
o Standard: Designed for general-purpose file shares with cost-effective storage, supporting
both SMB and NFS protocols.
o Premium: Provides high-performance file shares optimized for low latency and high
throughput, ideal for I/O-intensive workloads.

Use Cases for Azure File Share:

● File Server Migration: Migrate on-premises file servers to Azure to reduce infrastructure management
overhead and improve accessibility.
● Hybrid Cloud Solutions: Use Azure File Sync to maintain a synchronized copy of your data on-premises
and in Azure.
● Application Storage: Store configuration files, logs, and other application-related data that need to be
shared across multiple instances or VMs.
● Lift and Shift Applications: For legacy applications that rely on SMB or NFS protocols, Azure File Shares
offer a cloud-native solution without requiring application changes.
● Persistent Storage for Containers: Use Azure File Shares with Azure Kubernetes Service (AKS) to
provide persistent storage for containerized applications.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
How to Create and Use Azure File Shares:
1. Create a Storage Account:
o In the Azure Portal, navigate to "Storage accounts" and create a new storage account.
2. Create a File Share:
o Inside the storage account, go to "File shares" and click "Add" to create a new file share.
Specify the name and quota for the file share.
3. Mount the File Share:
o Use the connection string or script provided by Azure to mount the file share on your local
machine or VM. This can be done using SMB or NFS protocols.
4. Manage Files:
o Once mounted, you can manage files just like you would with a local file system, including
creating, deleting, copying, and moving files.
5. Use Azure File Sync (Optional):
o Install the Azure File Sync agent on a Windows Server and configure it to sync with your Azure
File Share, enabling local caching and multi-site synchronization.

What is Azure Queue Storage?


Azure Queue Storage is a service provided by Microsoft Azure that allows you to store large numbers
of messages that can be accessed from anywhere via authenticated calls. It is designed to help decouple
components of a cloud application by enabling asynchronous communication between them. Azure
Queues are particularly useful for handling tasks that do not require immediate processing or for
distributing workloads across multiple processors.
Key Features of Azure Queue Storage:
1. Message Storage:
o Each message in a queue can be up to 64 KB in size.
o A queue can contain millions of messages, making it suitable for large-scale applications.
2. Asynchronous Communication:
o Azure Queue Storage allows different parts of an application to communicate asynchronously.
For example, a web application can place tasks in a queue, and background processes can
process them later.
o This decoupling enables better scalability and fault tolerance.
3. Visibility Timeout:
o When a message is retrieved from a queue, it becomes invisible to other clients for a specified
period (known as the visibility timeout). If the message is processed within this time, it can be
deleted. If not, it becomes visible again for another process to handle.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
o This feature ensures that messages are processed at least once, even if a worker fails after
retrieving a message.
4. Poison Messages Handling:
o If a message repeatedly fails to process and exceeds a certain threshold of visibility requests, it
is often moved to a "poison message" queue for further investigation. This prevents a single
problematic message from blocking the processing of other messages.
5. Time-to-Live (TTL):
o Messages in a queue can have a time-to-live (TTL) setting. If a message is not retrieved within
the TTL, it is automatically deleted. By default, messages remain in the queue until they are
explicitly deleted.
6. Scalability:
o Azure Queue Storage is highly scalable, able to handle vast numbers of messages across
distributed components of an application.
7. Integration with Other Azure Services:
o Azure Queue Storage can integrate seamlessly with other Azure services like Azure Functions,
Azure Logic Apps, and Azure WebJobs. For instance, an Azure Function can be triggered to
execute whenever a new message is added to a queue.
o This integration makes it easy to automate processing workflows based on queue messages.
8. Cost-Effective:
o Queue Storage is a cost-effective solution for building scalable, decoupled, and distributed
applications. You pay for the storage used by the messages and the operations performed on
the queue (such as adding and retrieving messages).

Use Cases for Azure Queue Storage:


1. Task Scheduling:
o Queue Storage is often used for scheduling tasks that do not need to be executed immediately.
For example, an e-commerce site might use a queue to process orders asynchronously after a
customer completes a purchase.
2. Workload Distribution:
o Azure Queues can distribute tasks across multiple processing nodes. For example, a queue can
be used to distribute image processing tasks across multiple virtual machines, ensuring that
each VM processes a part of the workload.
3. Event-Driven Architectures:
o In an event-driven architecture, different parts of the system might generate events that need
to be processed asynchronously. Queue Storage can be used to manage these events, ensuring
that they are processed in the correct order.
4. Decoupling Components:
o By using queues, different parts of an application can be decoupled, making the system more
modular and easier to scale. For example, a front-end web application might place user
requests in a queue, which are then processed by a separate backend service.

How to Use Azure Queue Storage:


1. Create a Queue:
o In the Azure Portal, navigate to your storage account, and under "Queues," you can create a
new queue by providing a unique name.
2. Add Messages to the Queue:
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
o You can add messages to the queue using the Azure Portal, Azure SDKs, or REST API. Each
message can contain text data up to 64 KB in size.
3. Retrieve and Process Messages:
o Your application can retrieve messages from the queue, process them, and then delete them
from the queue. If the processing fails, the message will become visible again after the visibility
timeout.
4. Monitor and Manage the Queue:
o Azure provides tools to monitor the number of messages in a queue, the rate at which
messages are being added and processed, and other key metrics.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/

What is a NoSQL Database?


A NoSQL database is a type of database that provides a mechanism for storage and retrieval of data
that is modeled in means other than the tabular relations used in traditional relational databases (SQL
databases). NoSQL databases are designed to handle large volumes of unstructured, semi-structured, or
structured data, and are optimized for performance, flexibility, and scalability.
Key Characteristics of NoSQL Databases:
1. Non-Relational Data Model:
o Unlike SQL databases, which use tables with rows and columns to store data, NoSQL databases
use various data models, such as document, key-value, wide-column, and graph models.
2. Scalability:
o NoSQL databases are designed to scale out horizontally, meaning you can add more servers or
nodes to handle increased load, making them suitable for large-scale applications and big data
scenarios.
3. Flexible Schema:
o NoSQL databases often have a flexible schema, allowing you to store data without a
predefined structure. This means you can easily add new fields or attributes to your data
without having to alter the entire database schema.
4. Distributed Architecture:
o Many NoSQL databases are distributed, meaning data is stored across multiple servers or
nodes, which improves availability, fault tolerance, and scalability.
5. High Performance:
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
o NoSQL databases are optimized for high read and write performance, making them suitable for
real-time applications, such as social media platforms, online gaming, and IoT systems.

Types of NoSQL Databases:


1. Document Databases:
o Description: Store data as documents, usually in JSON or BSON format. Each document is a
self-contained unit of data that can have a different structure from other documents in the
same collection.
o Examples: MongoDB, CouchDB, RavenDB.
o Use Cases: Content management systems, catalogs, user profiles, and real-time analytics.
2. Key-Value Stores:
o Description: Store data as key-value pairs, where each key is unique and maps to a value. This
is the simplest form of NoSQL database.
o Examples: Redis, DynamoDB, Riak, Azure Table Storage.
o Use Cases: Caching, session management, and real-time data processing.
3. Wide-Column Stores:
o Description: Store data in tables, rows, and dynamic columns, where each row can have a
different set of columns. This model is suitable for handling large datasets with high write
throughput.
o Examples: Cassandra, HBase, ScyllaDB.
o Use Cases: Time-series data, sensor data, and large-scale analytical workloads.
4. Graph Databases:
o Description: Store data in nodes and edges, where nodes represent entities, and edges
represent the relationships between them. This model is highly effective for querying complex
relationships in data.
o Examples: Neo4j, Amazon Neptune, ArangoDB.
o Use Cases: Social networks, recommendation engines, fraud detection, and network analysis.

Advantages of NoSQL Databases:

● Scalability: Easily scale horizontally to accommodate large volumes of data and high transaction rates.

● Flexibility: Store unstructured, semi-structured, or structured data without requiring a rigid schema.

● Performance: Optimized for fast reads and writes, making them ideal for real-time applications.

● Distributed and Fault-Tolerant: Often designed to run on distributed systems, ensuring high
availability and fault tolerance.

Disadvantages of NoSQL Databases:

● Lack of ACID Transactions: Many NoSQL databases do not provide strong consistency guarantees
(ACID transactions), which can be a disadvantage for applications requiring strict data consistency.
● Complexity: NoSQL databases may require more complex data modeling and querying, especially for
developers accustomed to SQL databases.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
● Limited Support for Complex Queries: While NoSQL databases excel in performance and scalability,
they may not offer the advanced query capabilities found in SQL databases.

Use Cases for NoSQL Databases:

● Big Data and Analytics: Handling large volumes of diverse data for analytics, such as in IoT applications
or social media platforms.
● Content Management: Storing and retrieving unstructured data, such as documents, images, and
videos.
● Real-Time Applications: Supporting high-performance, real-time applications like online gaming, chat
applications, and real-time analytics.
● Graph-Based Queries: Managing data with complex relationships, such as social networks,
recommendation systems, and fraud detection.

ETL Tools

1. Informatica PowerCenter
2. Talend
3. Apache NiFi
4. Microsoft SQL Server Integration Services (SSIS)
5. AWS Glue

What is Azure Data Factory and Its Key Components?


Azure Data Factory is a cloud-based data integration service provided by Microsoft Azure. It allows
users to create data-driven workflows for orchestrating and automating data movement and
transformation between various data sources.
Key Components of Azure Data Factory:
1. Linked Services:
o Linked Services define the connection information needed for Azure Data Factory to connect
to external resources. They specify how Data Factory should authenticate and access these
resources, such as by using a username and password, or key-based authentication.
2. Datasets:
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
o Datasets represent the data structures within the data stores. An input dataset represents the
data used as input for an activity in the pipeline, while an output dataset represents the data
produced by an activity. For example, an Azure Blob dataset might specify the blob container
and folder in Azure Blob Storage from which the pipeline should read data, while an Azure SQL
Table dataset specifies the table to which the activity writes its output.
3. Activities:
o Activities define the actions that are performed on your data within a pipeline. In Azure Data
Factory, activities can be used for data movement, data transformation, or control operations.
Examples include copying data from one source to another, transforming data using Azure
Data Lake Analytics, or executing stored procedures in a SQL database.

Step 1: Create an Azure Data Factory (ADF) Account


1. Sign in to the Azure Portal:
o Go to https://portal.azure.com and sign in with your Azure credentials.
2. Create a New Data Factory:
o Navigate to the "Create a resource" page by clicking on the + Create a resource button.
o In the search bar, type Data Factory and select Data Factory from the list.
o Click Create.
3. Configure Data Factory:
o Subscription: Select your Azure subscription.
o Resource Group: Either select an existing resource group or create a new one.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
o Region: Choose the region where you want to deploy the Data Factory.
o Name: Provide a unique name for your Data Factory instance.
o Version: Select V2 (the latest version).
o Click Review + create and then Create after validation.
4. Wait for Deployment:
o The deployment process may take a few minutes. Once complete, navigate to your newly
created Data Factory.

Step 2: Create an Azure SQL Database


1. Create a New SQL Database:
o In the Azure Portal, click on + Create a resource.
o Search for SQL Database and select it.
o Click Create.
2. Configure SQL Database:
o Subscription: Select your Azure subscription.
o Resource Group: Use the same resource group as your Data Factory or create a new one.
o Database Name: Provide a name for your database, e.g., CustomerDB.
o Server: Either create a new SQL server or use an existing one. If creating a new one, specify the
server name, admin login, password, and region.
o Compute + Storage: Choose the appropriate pricing tier based on your requirements.
o Networking, Additional Settings, Tags: Configure these based on your environment or accept
the defaults.
3. Review and Create:
o Click Review + create and then Create after validation. The database creation may take a few
minutes.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Step 3: Create a Pipeline to Pull Data from the Customer Table to ADLS
1. Create Linked Services:
o Azure SQL Database Linked Service:
1. In your Data Factory, go to Manage > Linked services.
2. Click on + New and search for Azure SQL Database.
3. Configure the connection by selecting your Azure subscription, SQL server name,
database name, and authentication type (SQL Authentication or Managed Identity).
4. Test the connection and then click Create.
o Azure Data Lake Storage Gen2 Linked Service:
1. Still in Linked services, click on + New and search for Azure Data Lake Storage Gen2.
2. Configure the connection by selecting your subscription, storage account name, and
authentication method.
3. Test the connection and then click Create.

2. Create Datasets:
o SQL Dataset:
1. Go to Author > Datasets and click + New Dataset.
2. Select Azure SQL Database and configure it to point to the Customer table.
3. Provide a meaningful name, e.g., CustomerDataset.
o ADLS Dataset:
1. Create another dataset, this time selecting Azure Data Lake Storage Gen2.
2. Choose the file format (e.g., DelimitedText for CSV files).
3. Specify the ADLS path where the data should be stored, e.g.,
adls-container/customer-data/.
4. Name it appropriately, e.g., CustomerADLSDataset.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/

3. Create a Pipeline:
o In the Author section, click on + New Pipeline.
o Name the Pipeline (e.g., CustomerToADLSPipeline).
4. Add a Copy Activity:
o Drag the Copy data activity from the Activities pane into the pipeline canvas.
o Source: Configure the source to use the CustomerDataset.
o Sink: Configure the sink to use the CustomerADLSDataset.
5. Configure Pipeline Settings:
o Mapping: Optionally map columns from the SQL table to the ADLS output.
o Settings: Configure additional settings like performance tuning or logging if necessary.
6. Validate and Debug the Pipeline:
o Click on Validate to ensure there are no errors in the pipeline.
o Use the Debug feature to test the pipeline and ensure that data is correctly copied from the
SQL database to ADLS.
7. Publish and Trigger the Pipeline:
o Once validated, click Publish All to save your pipeline.
o You can trigger the pipeline manually using the Trigger Now option or schedule it using a
trigger (e.g., time-based or event-based).

Summary:

● Azure Data Factory: Set up a cloud-based data integration service.


Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
● Azure SQL Database: Create a database and set up a table with customer data.

● ADF Pipeline: Create a pipeline that pulls data from the SQL database and stores it in Azure Data Lake
Storage.

Do We Need to Create a Linked Service Every Time?


In Azure Data Factory (ADF), you don't need to create a linked service repeatedly because a linked
service defines the connection information to a data source. Once a linked service is created, it can be
reused across multiple activities, datasets, and pipelines within the data factory. This reuse helps to
streamline your workflow and maintain consistency in how your data sources are accessed.
Do We Need to Create Multiple Datasets Every Time in ADF?
In Azure Data Factory (ADF), you don't always need to create multiple datasets for every operation.
The need to create multiple datasets depends on your specific use case and the characteristics of the
data sources or destinations involved. Here are some key points to consider:
When You Need Multiple Datasets:
1. Different Data Sources or Destinations:
o If you're working with different types of data sources or destinations (e.g., Azure Blob Storage,
Azure SQL Database, on-premises SQL Server), you'll need separate datasets for each since
each type has its unique connection and structure requirements.
2. Different Schemas:
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
o If the data structures (schemas) differ significantly, you'll need separate datasets to accurately
define the schema of each data source or destination. This ensures that the data is correctly
interpreted and handled by the pipeline activities.
3. Parameterization Limitations:
o While parameterized datasets can handle variations, there are cases where the differences
between datasets are significant enough that it makes more sense to create separate datasets.
This might be due to entirely different data models, specific formatting needs, or distinct
processing requirements.

When You Can Use the Same Dataset:


1. Same Data Source Type and Schema:
o If the data source type and schema are consistent, you can reuse the same dataset by
parameterizing it to handle different instances or locations of the data. This approach reduces
redundancy and simplifies the management of your data pipelines.
2. Parameterized Datasets:
o By using parameterized datasets, you can create a flexible dataset definition that can be
reused across different activities. For instance, you can parameterize aspects like table names,
file paths, or storage locations, enabling you to handle variations without creating new
datasets for each slight change.

How to Create Parameterized Datasets:


Let's consider a scenario where you want to pull data from the Customer table in an Azure SQL
Database and store it in Azure Data Lake Storage (ADLS) in CSV format. If you already have a linked
service for the Azure SQL Database, you don't need to create it again. Instead, you can use
parameterization to adapt your dataset to different tables or file names.
Steps to Create a Parameterized Dataset:
1. Create a New Dataset:
o In Azure Data Factory Studio, go to the Author section and click on + New Dataset.
2. Select Data Source:
o Search for SQL and select Azure SQL Database as the dataset type.
3. Configure the Dataset:
o Provide a name for your dataset.
o Under Linked Service, select the previously created linked service for your Azure SQL
Database.
4. Parameterize the Dataset:
o Click on the Parameters tab in the dataset editor.
o Add a parameter for the table name, e.g., TableName.
o In the Connection tab, under the Table option, click the Add dynamic content button next to
the table name field.
o Insert the parameter you created, so the table name is now dynamic and can be set when the
pipeline runs.
5. Use the Parameterized Dataset in a Pipeline:
o When configuring the pipeline activity (e.g., Copy Data), set the value for the TableName
parameter. This allows the dataset to pull data from different tables as needed without
creating new datasets for each table.
6. Output to ADLS:
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
o Similarly, create a parameterized dataset for the output in ADLS, specifying dynamic content
for the file path or name based on the dataset parameters.

By following these steps, you can create a highly reusable and flexible dataset configuration in Azure
Data Factory, minimizing the need to create multiple datasets while efficiently handling different data
sources and destinations.

What is the ForEach Activity in Azure Data Factory (ADF)?


The ForEach activity in Azure Data Factory (ADF) is a control flow activity that allows you to iterate
over a collection of items and execute a set of activities for each item in the collection. It's similar to a
loop in traditional programming, where you can repeat a certain set of operations for each element in a
list or array.
Key Features of the ForEach Activity:
1. Iteration Over Collections:
o The ForEach activity is designed to process collections, such as arrays or lists, where each item
in the collection can be used as an input to a subsequent activity within the loop.
2. Parallel Execution:
o You can configure the ForEach activity to execute its inner activities either sequentially (one at
a time) or in parallel (multiple items at once). This parallelism can significantly improve
performance, especially when processing large datasets.
3. Integration with Other Activities:
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
o Inside a ForEach activity, you can include various other activities, such as Copy Data, Execute
Pipeline, Lookup, Stored Procedure, and more. These activities will be executed for each item
in the collection.
4. Dynamic Content and Expressions:
o The ForEach activity allows the use of dynamic content and expressions to tailor the processing
logic for each item in the collection. For example, you can dynamically pass different
parameters or modify the behavior of inner activities based on the current item being
processed.
5. Error Handling and Logging:
o The ForEach activity supports error handling through the use of Success, Failure, and
Completion paths, enabling you to manage errors effectively and log results or retry failed
operations.

Common Use Cases:


1. Batch Processing:
o When you need to process multiple files stored in a data lake or blob storage, the ForEach
activity can iterate over the list of files and perform actions like copying, transforming, or
moving the files.
2. Data Transformation:
o If you have a list of tables in a database and need to apply a transformation to each one, you
can use a ForEach activity to iterate over the list of tables and execute a transformation
activity for each table.
3. Parameter Passing:
o You might have a scenario where different pipeline runs require different parameters. The
ForEach activity can loop through a list of parameter sets and execute a pipeline for each set of
parameters, enabling flexible and dynamic pipeline execution.
4. Multi-Stage Data Processing:
o In complex ETL (Extract, Transform, Load) processes, the ForEach activity can help break down
the task into smaller, manageable parts by processing each stage of the ETL pipeline for every
item in a collection.

How to Configure a ForEach Activity:


1. Create a Pipeline:
o In Azure Data Factory Studio, go to the Author tab and create a new pipeline.
2. Add ForEach Activity:
o Drag the ForEach activity from the Activities pane (under Iterate & conditionals) onto the
pipeline canvas.
3. Define Items to Iterate:
o In the Settings tab of the ForEach activity, specify the collection or array you want to iterate
over. This could be an array of file names, a list of table names, or a dynamic output from a
previous activity like a Lookup.
4. Configure Inner Activities:
o Inside the ForEach activity, click on the Activities box. This opens a new canvas where you can
add and configure the activities that should be executed for each item in the collection.
5. Set Execution Mode:
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
o In the Settings tab, choose whether the activities should be executed sequentially or in parallel
by setting the Batch Count (maximum number of parallel executions).
6. Publish and Run the Pipeline:
o After configuring the ForEach activity and its inner activities, publish the pipeline and trigger it
to see the ForEach loop in action.

Example Scenario:
Suppose you want to copy multiple files from one blob storage container to another. You can use a
ForEach activity to loop over a list of file names (retrieved using a Get Metadata or Lookup activity)
and execute a Copy Data activity for each file.

● Input Array: List of file names.

● ForEach Activity: Loops over the file names.

● Copy Data Activity: Executes for each file, copying it to the destination container.

some common challenges associated with using the ForEach activity in ADF:
1. Performance and Scalability:

● Limited Parallelism: Although the ForEach activity supports parallel execution, there is a limit to the
number of activities that can run in parallel. The default parallelism is often limited by the Batch Count
setting, and setting this too high can overwhelm the underlying resources, leading to performance
bottlenecks.
● Resource Constraints: Running multiple activities in parallel can consume significant resources (CPU,
memory, etc.), especially when working with large datasets or complex operations. This may lead to
slower execution times or even failures due to resource exhaustion.

2. Execution Time and Cost:

● Extended Pipeline Duration: If the ForEach activity processes a large number of items, especially with
sequential execution, the overall pipeline execution time can become very long. This may lead to
higher costs, particularly if your ADF instance is based on a consumption model.
● Cost Implications: Running multiple activities in parallel or over an extended period can increase the
cost of pipeline execution. Each activity within the ForEach loop consumes resources, and if the loop
contains many activities, this can add up quickly.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
3. Complex Debugging and Monitoring:

● Difficulty in Troubleshooting: When the ForEach activity runs many activities in parallel, it can be
challenging to pinpoint which iteration or specific activity caused a failure. This complexity increases as
the number of iterations grows.
● Log Management: Monitoring the execution of each iteration can generate a large volume of logs,
making it difficult to manage and analyze these logs effectively. Identifying patterns or issues across
many iterations requires careful log management.

4. Error Handling:

● Handling Failures: If one iteration of the ForEach loop fails, it can be tricky to decide how to handle the
error. Should the entire loop stop, or should the pipeline continue with the remaining items? This
decision depends on the business logic but can complicate error handling.
● Retries and Idempotency: Retrying failed activities in a ForEach loop requires careful consideration.
Some operations may not be idempotent (i.e., they cannot be safely retried without causing side
effects), which can lead to data inconsistencies or unintended consequences.

5. Parameter Management:

● Complex Parameterization: When dealing with dynamic content and passing parameters into the
ForEach activity, the logic can become complex, especially if multiple parameters or nested loops are
involved. Managing these parameters effectively requires careful planning and testing.

6. Data Dependencies:

● Sequential Dependencies: If the activities inside the ForEach loop have dependencies on each other,
you may be forced to run the loop sequentially, which can significantly slow down the pipeline.
Balancing the need for parallelism with dependency management is often a challenge.
● Data Integrity: Ensuring data integrity across iterations, especially in parallel execution scenarios, can
be complex. Care must be taken to avoid data conflicts, especially when multiple iterations are
modifying the same data.

What is lookup activity?


In Azure Data Factory (ADF), the Lookup activity is used to retrieve data from an external source
(like a database or a file) and store the result in a form that can be used within your data pipeline. It
typically fetches a single row or value but can also handle more complex datasets if needed.
Key Features of Lookup Activity:

1. Data Source Connectivity: It supports a variety of data sources, such as SQL databases, Azure Blob
Storage, Azure Data Lake, and more.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
2. Single Row Retrieval: By default, it retrieves a single row, which is often useful for configurations or
control flows.
3. Multi-row Support: It can fetch multiple rows as an array if needed, which is useful when iterating
over a set of values using a ForEach activity.
4. Integration with Other Activities: The output of the Lookup activity can be used as dynamic input for
subsequent activities, such as running a stored procedure, copying data, or making decisions based on
conditions.

Common Use Cases:

● Parameter Passing: Fetching a value from a configuration table that can be passed to other activities.

● Conditional Logic: Using the retrieved data in a If Condition or Switch activity to control the flow of the
pipeline.
● Looping through Data: When combined with a ForEach activity, it can loop through a dataset and
perform actions for each row.

Limitation:

1. Data Size Limitation:

● Single Row Mode: When set to return a single row, the Lookup activity works without issues for most
scenarios.
● Multi-row Mode: If the Lookup activity is configured to return multiple rows, the output size is limited
to 4 MB or 5000 rows. If the result exceeds this limit, it will fail.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Dynamic Date Format in Azure Data Factory (ADF)
In Azure Data Factory (ADF), you can dynamically format dates using the built-in expression
language, which allows you to manipulate date and time values flexibly. This is particularly useful
when you need to create dynamic file paths, file names, or parameter values based on the current date,
time, or other date-related values.
Use Cases:

● Automatically create file names or directories: Dynamically generate file names or directories based
on the current date or time for organizing data files.
● Partitioning large datasets by date: This helps improve query performance and manageability by
organizing data into date-based partitions.
● Move or copy data to specific locations: Dynamically route data to specific folders or locations based
on the date.
● Filter data dynamically: Process specific subsets of data by filtering based on the current date or time
range.

Example:
Let’s see how this works in practice.

1. Create a new pipeline in Azure Data Factory.


2. Add a Copy activity to this pipeline.
3. Select the source dataset.
4. Select the sink dataset (the sink dataset should be parameterized). If you haven't already, create a
parameterized dataset.
5. Now, in the Folder Name parameter field, click on Add Dynamic Content.

Using the Expression Builder:

● In the Expression Builder, write the folder name as customer.

● Go to the Functions tab and search for the formatDateTime() function.

● Use string interpolation to format the date dynamically.

Here’s an example of an expression that creates a folder path with the current year, month, and day:
Customer/@{formatDateTime(utcnow(), 'yyyy')}/@{formatDateTime(utcnow(),
'MM')}/@{formatDateTime(utcnow(), 'dd')}

This will dynamically create folder paths like Customer/2024/09/06 based on the current date.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/

What is If activity?
The If activity in Azure Data Factory (ADF) is a control flow activity that allows you to implement
conditional logic within your data pipelines. It evaluates a condition and executes one of two sets of
activities based on whether the condition evaluates to true or false. This activity is particularly useful
when you need to create dynamic workflows based on different conditions or criteria.
Key Features of the If Activity:
1. Condition Evaluation: The If activity takes a condition in the form of an expression. It checks
if the condition evaluates to true or false and then proceeds accordingly.
2. Two Branches:
o True branch: If the condition is true, the set of activities in the "If True" branch will be
executed.
o False branch: If the condition is false, the set of activities in the "If False" branch will be
executed.
3. Dynamic Expressions: You can use ADF’s expression language to create complex conditions
that evaluate different parameters, variables, or input data.
4. Integration with Other Activities: The If activity works in combination with other activities
such as Lookup, Get Metadata, Copy, and more, allowing you to create highly flexible and
dynamic data workflows.
Use Cases:

● Conditional Data Processing: Depending on a value or state, you can decide whether to perform
certain transformations or data movements.
● Branching Logic: Dynamically determine the path your pipeline should take based on input
parameters, status flags, or metadata.
● File Handling: For instance, process files only if they exist or meet certain criteria (e.g., if a file is larger
than a certain size).
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/

What Is getmetadata activity in ADF

The Get Metadata activity in Azure Data Factory (ADF) is a control activity that allows you to retrieve
metadata information from a variety of data sources. This metadata can include properties like file size, last
modified date, column names, and data types, among others. It's useful for making decisions based on the
characteristics of your data before further processing it in your data pipeline.

Create a Get Metadata activity with UI


To use a Get Metadata activity in a pipeline, complete the following steps:
1. Search for Get Metadata in the pipeline Activities pane, and drag a Fail activity to
the pipeline canvas.
2. Select the new Get Metadata activity on the canvas if it is not already selected, and
its Settings tab, to edit its details.
3. Choose a dataset, or create a new one with the New button. Then you can specify
filter options and add columns from the available metadata for the dataset.

4. Use the output of the activity as an input to another activity, like a Switch activity in
this example. You can reference the output of the Metadata Activity anywhere
dynamic content is supported in the other activity.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/

5. In the dynamic content editor, select the Get Metadata activity output to reference
it in the other activity.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/

Execute Pipeline Activity

The Execute Pipeline activity in Azure Data Factory (ADF) is used to trigger or run another pipeline from within
a pipeline. This activity helps in organizing complex workflows by allowing you to break them down into
smaller, reusable pipelines, promoting modularity and better management.

Use Cases:

● Breaking Down Complex Workflows: When you have a large, complex pipeline, you can split it into
smaller pipelines and use the Execute Pipeline activity to call them in sequence or based on conditions.
● Reusability: If you have logic that is commonly used across multiple pipelines, you can create a
reusable pipeline and trigger it from multiple parent pipelines.
● Parameterization: Triggering the same pipeline multiple times with different parameters (e.g., for
different datasets or environments).

Type properties
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/

Remember Nested if and Nested Foreach is not allowed in ADF at the moment hence execute pipeline can
be used as the work around.

Parameterize linked services in Azure Data Factory

You can now parameterize a linked service and pass dynamic values at run time. For example, if you
want to connect to different databases on the same logical SQL server, you can now parameterize the
database name in the linked service definition. This prevents you from having to create a linked service
for each database on the logical SQL server. You can parameterize other properties in the linked service
definition as well - for example, User name.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/

Incremental Pipeline in Azure Data Factory (ADF)


In Azure Data Factory (ADF), an incremental pipeline is designed to process and transfer only the
new or changed data since the last time the pipeline was run. This approach is useful for managing
large datasets while improving performance, reducing processing time, and lowering costs.
Key Concepts:
1. Incremental Data Load:
o The main goal of an incremental pipeline is to load only the new or updated data from the
source to the destination. Unlike a full load, where all data is processed every time the pipeline
runs, an incremental load processes only the data that has changed since the last execution.
2. Source Data Identification:
o To identify new or changed data, the pipeline typically uses a mechanism like a timestamp or a
"Last Modified" column. These fields help determine which records have been added or
modified since the previous pipeline run, ensuring that only relevant data is processed.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Benefits of Incremental Pipelines:

● Improved Performance: Processing only the new or updated data significantly reduces the amount of
data to be handled, resulting in faster execution times.
● Cost-Effective: By reducing the amount of data processed, incremental pipelines help save on compute
and storage costs.
● Efficient Data Management: Incremental processing makes it easier to manage and process large
datasets without overloading the system.

High WaterMark Overview

Here are the important steps to create this solution:

1. Select the watermark column. Select one column in the source data store, which can be
used to slice the new or updated records for every run. Normally, the data in this selected
column (for example, last_modify_time or ID) keeps increasing when rows are created or
updated. The maximum value in this column is used as a watermark.
2. Prepare a data store to store the watermark value. In this tutorial, you store the watermark
value in a SQL database.
3. Create a pipeline with the following workflow:

The pipeline in this solution has the following activities:

o Create two Lookup activities. Use the first Lookup activity to retrieve the last watermark
value. Use the second Lookup activity to retrieve the new watermark value. These
watermark values are passed to the Copy activity.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
o Create a Copy activity that copies rows from the source data store with the value of the
watermark column greater than the old watermark value and less than the new watermark
value. Then, it copies the delta data from the source data store to Blob storage as a new file.
o Create a StoredProcedure/Lookup activity that updates the watermark value for the pipeline
that runs next time.

What is an API?
API stands for Application Programming Interface. It is a mechanism that enables two software
components to communicate with each other using a set of definitions and protocols. APIs are a way to
extract and share data within and across organizations.
What are REST APIs?
REST (Representational State Transfer) defines a set of functions like GET, POST, PUT, and DELETE that
clients can use to access server data. REST APIs use the HTTP protocol for communication between
clients and servers.
One of the main features of REST APIs is statelessness, meaning the server does not store any
information about client requests between requests. Each request from the client contains all the
necessary information to process that request, similar to how you visit a URL in a browser and receive
a response (typically in plain data like JSON, not a graphical web page).
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Key Concepts of REST APIs:

● Resources/Endpoint: In REST, data and functionalities are treated as resources, each identified by a
unique URL. For example, accessing user information might use the endpoint
https://restcountries/#endpoints-all, which provides data in JSON format.
● HTTP Methods:
o GET: Retrieve information about a resource.
o POST: Create a new resource.
o PUT: Update an existing resource.
o DELETE: Remove a resource.
● Stateless: Each request contains all the information the server needs to process it. The server doesn't
maintain any session or context between requests.
● Representation: Resources can be represented in various formats such as JSON (most common), XML,
HTML, or plain text.
● Uniform Interface: REST APIs adhere to a uniform interface, simplifying the architecture and making it
more scalable. This includes using standard methods, resource URIs, and response codes.
● Client-Server Architecture: REST separates the client (which requests resources) from the server
(which provides resources). This decoupling enhances flexibility and scalability.

owing these principles, REST APIs allow for the creation of scalable, maintainable, and flexible web
services that can be consumed by various clients, including web browsers, mobile apps, and other
servers.
What are Microservices?
Microservices are an architectural and organizational approach to software development where
software is composed of small, independent services that communicate over well-defined APIs. These
services are managed by small, autonomous teams.
Microservices architectures make applications easier to scale and faster to develop, allowing for
quicker innovation and faster delivery of new features.
How do REST APIs and Microservices Relate?
In a microservices architecture, REST APIs are commonly used as the communication mechanism
between different services. Each microservice exposes its functionality as a RESTful API, allowing
other microservices or external systems to interact with them.

Create a REST linked service using UI


Use the following steps to create a REST linked service in the Azure portal UI.
1. Browse to the Manage tab in your Azure Data Factory or Synapse workspace and
select Linked Services, then select New:
o Azure Data Factory
o Azure Synapse
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/

2. Search for REST and select the REST connector.


Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/

3. Configure the service details, test the connection, and create the new linked
service.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/

What is logic app?

In simple terms, a Logic App in Azure is like a digital "flowchart" that helps automate tasks and
processes without needing to write code. It connects different apps, services, and systems, letting them
work together automatically.

For example, you can use a Logic App to:

● Send an email whenever a new file is uploaded to cloud storage.


● Trigger a message in a chat app when a form is submitted on a website.
● Move data between different apps like databases, Excel, or social media accounts.

It’s a tool for building workflows that take care of repetitive tasks, so you don’t have to do them
manually. You set up steps (called triggers and actions) using a simple interface, and the Logic App
takes care of making everything work together.

In short, it’s a way to automate tasks across apps and services, saving time and effort!

What is web activity?

The Web Activity in Azure Data Factory (ADF) is used to make HTTP requests to a web service or
API endpoint within a data pipeline. It allows your ADF pipeline to interact with external systems by
calling REST APIs or web services, retrieve data, trigger processes, or send information to external
applications.
Key Features of Web Activity:

1. HTTP Methods: Supports HTTP methods such as GET, POST, PUT, and DELETE, which allow you to
perform different actions depending on the API you're calling.
2. Headers and Body: You can customize the request by adding headers (e.g., for authorization) and body
content (e.g., sending data in JSON format).
3. Response Handling: The Web activity can capture responses from the API, and the output can be used
in downstream activities within the pipeline.

Common Use Cases:

● Calling APIs: You can use the Web activity to call external REST APIs, such as to retrieve data from an
external system or send data to an application.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
● Triggering External Processes: For example, triggering a web service to kick off a workflow in another
system after certain pipeline actions are complete.
● Interacting with Cloud Services: Communicating with other Azure services or third-party cloud services
via their API.

Example:

● If you want to notify an external system after data processing in ADF, you can use the Web activity to
make a POST request to that system's API and pass the necessary information.

Data Factory alert rules

To create and manage alerts, select Alerts under Monitoring in the left navigation of your Data
Factory page in the Azure portal.

You can set the Email/SMS notification at the entire account level using this.

A Data Flow in Azure Data Factory (ADF) is a powerful feature that allows you to perform
transformations on data at scale, without writing code. It provides a visual interface where you can
define complex data transformation logic, and ADF takes care of executing it in a scalable and efficient
manner using Azure's underlying infrastructure.
Key Components of Data Flow:
1. Source: Defines the input data for your data flow. This could be from various data stores such
as Azure Blob Storage, Azure Data Lake, SQL databases, etc.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
2. Transformations: The core of the data flow, where you define various transformation rules.
ADF offers several types of transformations:
o Filter: Filter rows based on a condition.
o Aggregate: Perform aggregations like sum, count, average, etc.
o Join: Join two datasets based on matching conditions.
o Sort: Sort data by specified columns.
o Derived Column: Create new columns or modify existing ones using expressions.
o Lookup: Perform lookups from external sources.
o Conditional Split: Split data into different streams based on conditions.
o Union: Combine multiple datasets into a single output.
3. Sink: Defines where the transformed data will be written to. You can output data to a variety of
destinations such as Azure SQL Database, Data Lake, Blob Storage, and more.
4. Data Flow Debug: ADF provides a debug mode where you can preview and test your
transformations using sample data, helping you refine your logic before scaling it.
Key Features of Data Flow in ADF:

● No-Code Environment: You can design complex data transformations without needing to write any
code. Everything is done through a drag-and-drop interface.
● Scalability: ADF Data Flows are executed using Azure's Spark-based infrastructure, which means
transformations can scale to handle large datasets.
● Data Flow Parameters: You can define parameters to make your data flows dynamic, allowing you to
reuse the same data flow for different datasets or transformation rules.
● Mapping and Wrangling Data Flows:
o Mapping Data Flows: These are used for data transformation processes where you map data
from source to destination.
o Wrangling Data Flows: These are designed for data preparation and are more focused on
interactive data wrangling using Power Query.

How Data Flows Work in ADF:

1. Design the Data Flow: Use the visual interface to add a source, apply transformations, and configure
the sink.
2. Integrate into a Pipeline: Once the data flow is designed, you can integrate it into an ADF pipeline. The
pipeline can schedule the data flow to run at specific times or trigger it based on events.
3. Execution: When the pipeline triggers the data flow, ADF converts the flow logic into a Spark execution
plan and runs it on Azure Databricks or other Azure compute services.
4. Monitor and Optimize: Use the Monitor section in ADF to track the performance of your data flows,
troubleshoot issues, and optimize performance by scaling up compute resources if necessary.

Common Use Cases for Data Flow:

● ETL (Extract, Transform, Load): Transform raw data into a format that is ready for analysis or storage
in a data warehouse.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
● Data Cleansing: Remove or correct errors in the data before loading it into a target system.

● Data Aggregation: Summarize data (e.g., sales data by region) before reporting or analysis.

● Data Enrichment: Combine data from multiple sources, adding additional information to datasets.

Example:

1. Source: Load customer data from an Azure SQL Database.


2. Transformation: Filter customers based on purchase history, add a derived column for customer
segment, and join it with another dataset that contains demographic data.
3. Sink: Store the transformed data into Azure Data Lake for further analysis.

Benefits of Data Flow in ADF:

● Ease of Use: You don’t need to be a developer or data engineer to define transformations; everything
is visual.
● Cost-Effective: You can choose the compute resources you need, only paying for what you use when
the data flow runs.
● Flexible: Support for a wide range of data sources and destinations, along with numerous
transformations, makes it versatile for many data scenarios.

Data Flow vs Databricks


Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/

Data Flow Trasformations

Name Category Description


Aggregate Schema modifier Define different types of aggregations such as SUM, MIN, MAX,
and COUNT grouped by existing or computed columns.
Alter row Row modifier Set insert, delete, update, and upsert policies on rows.
Assert Row modifier Set assert rules for each row.
Cast Schema modifier Change column data types with type checking.
Conditional Multiple Route rows of data to different streams based on matching
split inputs/outputs conditions.
Derived Schema modifier Generate new columns or modify existing fields using the data
column flow expression language.
External call Schema modifier Call external endpoints inline row-by-row.
Exists Multiple Check whether your data exists in another source or stream.
inputs/outputs
Filter Row modifier Filter a row based upon a condition.
Flatten Formatters Take array values inside hierarchical structures such as JSON and
unroll them into individual rows.
Flowlet Flowlets Build and include custom re-usable transformation logic.
Join Multiple Combine data from two sources or streams.
inputs/outputs
Lookup Multiple Reference data from another source.
inputs/outputs
New Multiple Apply multiple sets of operations and transformations against the
branch inputs/outputs same data stream.
Parse Formatters Parse text columns in your data stream that are strings of JSON,
delimited text, or XML formatted text.
Pivot Schema modifier An aggregation where one or more grouping columns has its
distinct row values transformed into individual columns.
Rank Schema modifier Generate an ordered ranking based upon sort conditions
Select Schema modifier Alias columns and stream names, and drop or reorder columns
Sink - A final destination for your data
Sort Row modifier Sort incoming rows on the current data stream
Source - A data source for the data flow
Stringify Formatters Turn complex types into plain strings
Surrogate Schema modifier Add an incrementing non-business arbitrary key value
key
Union Multiple Combine multiple data streams vertically
inputs/outputs
Unpivot Schema modifier Pivot columns into row values
Window Schema modifier Define window-based aggregations of columns in your data
streams.

SCD Types:

SCD Type Description Example Data Handling


Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Fixed Dimension: Data
is never updated after A company records the
No updates; data
the initial insertion. date of birth of an
SCD Type 0 remains
This is used when employee, and this
unchanged.
historical data should data never changes.
never change.
Overwrite the Existing If a customer's address
Data: When changes changes, the old
occur, the old data is address is replaced Overwrite the
SCD Type 1
overwritten with the with the new address, existing record.
new data. No history is and the old data is
maintained. lost.

Create a New Record


for Changes: A new When a customer
row is added to the changes their address,
table for every change, a new record is
Insert a new row
maintaining the inserted with the new
for each change
SCD Type 2 complete history of address and a new
with timestamps or
changes. Each row effective date, while
versioning.
includes start and end the old record remains
dates (or a flag) for to preserve the
tracking the current history.
version.

Track Partial History


If a customer changes
with Extra Columns:
their address, a new
Only the current and
column (e.g., "previous
previous versions are Add extra columns
address") is used to
SCD Type 3 stored, usually with to track previous
store the old value,
additional columns for values.
while the current
the old data. Limited
column holds the
historical data is
latest address.
maintained.
Historical Table: A The main customer
separate history table table holds the latest
Maintain a
is created to store all address, but another
SCD Type 4 separate history
changes, while the table tracks all
table.
main table holds only previous addresses
the current data. and changes over time.

OLTP vs OLAP
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/

OLTP (Online Transaction


Aspect OLAP (Online Analytical Processing)
Processing)
Manages day-to-day transaction Used for complex queries, reporting,
Purpose
processing. and data analysis.
Stores operational, real-time data Stores historical and aggregated data
Data Type (e.g., sales, inventory, customer for business intelligence and decision
transactions). making.
Typically handles large volumes Handles large volumes of historical
Data Volume
of short transactions. data, often aggregated.
Focuses on complex read operations
Focuses on CRUD operations
Operations (e.g., multi-dimensional analysis,
(Create, Read, Update, Delete).
reporting).
Complex queries involving
Query Simple queries that are fast and
aggregations, joins, and data modeling
Complexity executed frequently.
(slower and more intense).
Highly normalized (3NF) to Typically de-normalized (star or
Schema
reduce data redundancy and snowflake schema) to optimize for
Design
ensure data integrity. read-heavy operations.
Performance High speed for insert, update, High performance for complex queries
Requirements and delete operations. and analytical computations.
High concurrency with many
Lower concurrency, as fewer users are
Concurrency users performing transactions
running complex analytical queries.
simultaneously.
Applications like banking, e-
Applications like business intelligence,
Use Case commerce, order entry, and
data mining, and reporting systems.
inventory systems.
Real-time data updates as Data is refreshed periodically, often
Data Refresh
transactions occur. through batch processing.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Data warehouses and data marts (e.g.,
Relational databases such as SQL
Storage Type Azure Synapse, Google BigQuery,
Server, MySQL, Oracle.
Amazon Redshift).
Used by operational staff and
Used by data analysts, business users,
Users end-users who need to manage
and decision-makers.
transactions.
Processing a bank transaction or
Generating a quarterly sales report or
Example placing an order on an e-
analyzing customer behavior patterns.
commerce website.

Triggers in Azure Data Factory (ADF)


In Azure Data Factory (ADF), triggers are used to automatically initiate the execution of pipelines
based on specific criteria or schedules. Triggers enable you to schedule and automate data pipelines,
track success and failure rates, and ensure workflows run as expected.
A trigger in ADF determines when to start or invoke the end-to-end execution of a pipeline without
manual intervention. Triggers can be set based on a time interval, events, or custom criteria.
Types of Triggers:
1. Schedule Trigger:
o Allows pipelines to run at predefined intervals, such as hourly, daily, or weekly.
o Ideal for recurring data processing tasks.
2. Tumbling Window Trigger:
o Processes data in contiguous, non-overlapping time windows.
o Ensures each time window is processed exactly once, making it useful for time-based data
aggregation or processing.
3. Event-based Trigger:
o Initiates a pipeline in response to specific events, such as:
▪ Azure Blob Storage events (e.g., when a file is created or deleted).
▪ Azure Data Lake Storage Gen2 events.
o Suitable for triggering pipelines based on changes in data storage.
4. Custom Event Trigger:
o Triggers pipelines based on custom events published to Azure Event Grid.
o Offers flexibility for handling unique event-based scenarios.
5. On-demand Trigger (Manual Trigger):
o Manually triggers a pipeline as needed, providing full control over when a pipeline runs.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/

Azure Key Vault


Azure Key Vault is a cloud service used to securely store and manage secrets, such as API keys,
passwords, connection strings, and certificates. In Azure Data Factory (ADF), Key Vault can be
integrated to securely manage and retrieve sensitive information for your pipelines, ensuring security
and compliance without hard-coding sensitive data into the pipeline.
Key Features of Azure Key Vault in ADF:
1. Secure Storage of Secrets:
o Key Vault allows you to store sensitive information such as database connection strings,
storage account keys, and API credentials securely.
2. Seamless Integration:
o ADF can directly connect to Azure Key Vault to retrieve secrets during pipeline execution
without exposing them in the code.
3. Simplified Secret Management:
o If credentials or sensitive data change, you only need to update the Key Vault. ADF pipelines
will continue to retrieve the updated values, reducing the need for manual updates in pipeline
code.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
4. Centralized Management:
o Key Vault provides a central location to manage secrets across different Azure services, making
it easier to control access and monitor usage.
5. Enhanced Security:
o Secrets stored in Azure Key Vault are encrypted using industry-standard algorithms. ADF
accesses the secrets at runtime, without storing them in the pipeline itself.

How to Use Azure Key Vault in ADF:

1. Create a Linked Service:


o In ADF, create a Linked Service that points to the Azure Key Vault. This allows ADF to retrieve
secrets from the Key Vault when needed.
2. Reference Secrets in Pipelines:
o When configuring activities (e.g., copy activity or data transformation), you can reference the
secrets stored in Key Vault by using dynamic content in your ADF pipeline.
3. Access Control:
o ADF requires appropriate permissions to access Key Vault. You can assign Managed Identity to
the Data Factory and grant it access to Key Vault, ensuring only authorized entities can retrieve
the secrets.

Example Use Case:


Suppose you need to connect to an Azure SQL Database in a pipeline. Instead of hardcoding the
database connection string in your pipeline, you store it in Azure Key Vault. In your pipeline, you
configure a Linked Service to Azure SQL Database, and in the Connection String field, you reference
the secret from Azure Key Vault. This ensures that the connection string is securely retrieved at
runtime.
Benefits of Using Key Vault with ADF:

1. Security: Secrets are never exposed in pipeline code, reducing the risk of data breaches.
2. Simplified Management: Centralized secret management means that if credentials change, you only
need to update them in Key Vault.
3. Compliance: Using Key Vault helps you comply with security and privacy standards by protecting
sensitive data.
4. Automatic Access: You can use ADF’s Managed Identity to control access to Key Vault, providing a
secure, seamless experience without manually handling credentials.

In summary, integrating Azure Key Vault with ADF provides a secure and efficient way to manage
sensitive information in your data pipelines, improving both security and ease of management.

Git Integration in Azure Data Factory


Azure Data Factory supports integration with Git repositories to enable collaboration, version control,
and CI/CD processes. You can connect your ADF environment to a Git repository like Azure DevOps
Git or GitHub to manage the development of pipelines, datasets, and other resources.
Steps for Git Integration:
1. Connect to Git Repository:
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
o In the ADF portal, go to Manage > Git Configuration.
o Select Azure DevOps Git or GitHub as the repository type.
o Provide the necessary credentials (e.g., Azure DevOps project, repository name, branch).
o Specify the collaboration branch, where all changes will be made and pushed before being
merged into the main branch.
2. Branching Strategy:
o Main Branch: The primary branch used for deploying to production.
o Feature Branches: Developers create feature branches to work on individual components (e.g.,
a new pipeline). Once the feature is complete, it’s merged into the collaboration branch via
pull requests.
o Collaboration Branch: Developers collaborate on the same branch (e.g., develop or master)
and merge feature branches before deploying to the production environment.
3. Version Control:
o Every change made to the ADF resources (pipelines, datasets, linked services) is saved in the
Git repository, enabling version tracking.
o You can revert to a previous version of a pipeline or resource if needed by reviewing commit
history.
4. Commit and Push Changes:
o After making changes to pipelines or other resources, commit the changes to your feature
branch.
o Push the changes to the Git repository, which will trigger the next step in the CI/CD process if
set up.

CI/CD in Azure Data Factory using Azure DevOps


Continuous Integration (CI) and Continuous Deployment (CD) are practices that help automate the
deployment of pipelines and data factory changes across environments (development, testing,
production).
CI/CD Pipeline Overview:
1. Continuous Integration (CI):
o Developers push their changes to the Git repository.
o When changes are merged into the collaboration branch (e.g., main or develop), the CI
pipeline builds the changes and runs tests to validate them.
o Artifacts (ARM templates) are generated as part of the build process and stored for
deployment.
2. Continuous Deployment (CD):
o Once the changes are validated, the CD pipeline automatically deploys the new version of the
data factory to different environments (e.g., test, staging, production).

Integration Runtime

Integration Runtime (IR) in Azure Data Factory (ADF) is the compute infrastructure that ADF uses to provide
data integration across different network environments. It is responsible for the movement of data and the
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
execution of data transformation activities. There are different types of Integration Runtimes to handle
different scenarios, such as connecting to cloud, on-premises, or hybrid data sources.

 Azure Integration Runtime: Best for cloud-native data processing and movement between Azure
resources.
 Self-hosted Integration Runtime: Ideal for hybrid or on-premises scenarios where you need to
connect ADF with on-premises data sources.

 Azure-SSIS Integration Runtime: Used for running SSIS packages in the cloud, typically for
customers migrating their ETL processes from on-premises to Azure.

Pipeline Parameters:
Pipeline Parameters are values that you define at the pipeline level, which can be passed in when you
trigger or invoke the pipeline. Parameters allow you to make your pipeline dynamic, enabling you to
reuse the same pipeline for different data sources or configurations by passing in different parameter
values.
Key Features of Pipeline Parameters:

● Definition: Parameters are defined at the pipeline level and can be passed to activities within the
pipeline.
● Static during Execution: Once a pipeline starts executing, the parameter values remain constant
throughout the execution.
● Scope: They are scoped to the pipeline and cannot be changed dynamically during pipeline execution.

● Usage: Parameters are typically used for things like file paths, table names, or filtering conditions.

You can then reference this parameter in activities within the pipeline using:
@pipeline().parameters.filePath
Use Cases:

● Dynamic file paths, table names, or query filters.

● Passing connection strings or authentication details when the pipeline is triggered.

2. Pipeline Variables:
Variables in ADF are used to store temporary values that can be changed during the execution of the
pipeline. They allow dynamic behavior within the pipeline, as their values can be updated or modified
as the pipeline progresses.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Key Features of Pipeline Variables:

● Definition: Variables are defined within the pipeline and can be assigned values during execution using
activities like the Set Variable or Append Variable activities.
● Dynamic during Execution: Unlike parameters, variables can change during pipeline execution.

● Scope: Variables are scoped to the pipeline, and their values can only be used and changed within the
pipeline.
● Data Types: Variables can be of types such as String, Array, or Boolean.

Example:
You can update its value during execution using the Set Variable activity or within expressions:
@variables('counter')
Use Cases:

● Tracking the number of iterations in a loop.

● Storing intermediate results within the pipeline.

● Dynamic decision-making during pipeline execution (e.g., based on a condition).

3. Global Parameters:
Global Parameters are parameters defined at the Data Factory level, making them available to all
pipelines across the Data Factory. They provide a convenient way to define common values that need
to be reused across multiple pipelines.
Key Features of Global Parameters:

● Scope: Available globally within the Data Factory and can be accessed from any pipeline.

● Definition: Defined in the Manage section of ADF.

● Static during Execution: Once a pipeline using a global parameter is triggered, the global parameter’s
value remains constant.
● Usage: You can reference global parameters in pipelines or activities the same way you reference
pipeline parameters.

Example:
If you have an environment-specific value such as a storage account name that is common across
pipelines, you can define it as a global parameter:
In any pipeline, you can reference this global parameter using:
@globalParameters('storageAccountName')
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Use Cases:

● Environment-specific values like connection strings, storage account names, or file paths used across
multiple pipelines.
● Centralized management of values that are used across different pipelines (e.g., URLs, database
names).

Optimizing Copy Activity for Cloud-to-Cloud Migration in Azure Data Factory


1. Data Integration Units (DIUs):
o If using the Azure Integration Runtime, consider adjusting the DIUs (Data Integration Units) to
allocate more processing power for increased throughput, especially for large datasets. Start
with the default values and adjust as necessary to find the optimal setting. (Note: DIUs are not
applicable to Self-hosted Integration Runtime).
2. Enable Compression:
o For transferring large amounts of text or log data, enable data compression during the copy
activity. This reduces the amount of data transferred over the network, which is particularly
beneficial for inter-region migrations within the cloud.
3. Data Consistency Verification:
o Consider disabling data consistency verification if it's not required, as this can reduce the
overall time taken by the copy process.
4. Parallel Copy and Batching:
o Enable Parallel Copy: Configure parallel copy to utilize multiple threads for concurrent reading
and writing from source and destination cloud data stores. This can significantly improve data
transfer speeds.
o Batching Requests: Set up batch transfers within the ADF copy activity to reduce the number
of individual requests between cloud services, improving efficiency and minimizing network
overhead.
5. Fault Tolerance Settings:
o Skip Incompatible Rows: In the copy activity settings, you can enable fault tolerance by
skipping rows that cause errors during the data copy. This ensures that the rest of the data is
copied even if some rows have issues.

What is Big Data?


Big Data refers to an enormous amount of data that is too large and complex for traditional data
processing tools to handle. Imagine the data like a massive library filled with books, but this library is
so gigantic that conventional methods like manually searching for a book or using a small computer
can't manage it efficiently.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Everyday Example of Big Data:

● Social Media: Platforms like Facebook, Twitter, and Instagram generate vast amounts of data every
second. Every post, like, comment, and share contributes to a massive collection of data.
● Online Shopping: Websites like Amazon track millions of transactions, customer preferences, and
product searches every day.
● Smart Devices: Smartphones, smartwatches, and IoT devices collect and transmit data continuously,
from tracking steps to monitoring home security.

Why is Big Data Important?


Big Data helps organizations make better decisions by analyzing these vast amounts of information.
For instance, analyzing shopping patterns can help online stores recommend products you might like,
or studying social media trends can help companies understand what people think about their products.
What is Hadoop?
Hadoop is like a giant, super-fast librarian for the Big Data library. When you have an enormous
amount of data (like our library filled with millions of books), you need a special system to store and
organize this data so that you can find and use it quickly.
How Hadoop Works (Simple Explanation):

1. Storage: Hadoop divides the big library into many smaller sections (called "blocks") and stores them
across multiple shelves (computers) in a way that it's easy to find the information you need.
2. Processing: When you want to find information (like searching for a book), Hadoop sends many
librarians (workers) to different shelves to fetch the data at the same time. This makes the search
process much faster.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Everyday Analogy:

● Hadoop as a Pizza Delivery Service: Imagine you want to order 100 pizzas for a party. If you rely on
just one delivery person, it will take a long time to deliver all the pizzas. But what if you have 100
delivery people, each delivering one pizza to the party at the same time? The delivery will be much
faster! Hadoop works similarly by using many computers (delivery people) to process and deliver parts
of the data quickly.

Why Use Hadoop?

● Handles Huge Amounts of Data: Hadoop can store and process large amounts of data efficiently, much
more than a regular computer.
● Cost-Effective: It uses a network of simple, inexpensive computers to handle data instead of relying on
a single, super-expensive machine.
● Fault Tolerance: If one computer (librarian) fails, Hadoop can still find the data using other computers,
ensuring the system doesn't crash.

2. HDFS (Hadoop Distributed File System)

● What is HDFS?
o Simple Explanation: HDFS is like a giant digital storage system designed to store huge amounts
of data across multiple computers.
o How It Works:
▪ Breaks large files into smaller pieces.
▪ Stores these pieces across different computers to manage storage efficiently.
▪ Keeps multiple copies of data to ensure safety and reliability.
● Why Use HDFS?
o Scalability: Can handle massive datasets by distributing them across many computers.
o Reliability: If one computer fails, HDFS retrieves the data from another copy.

3. MapReduce

● What is MapReduce?
o Simple Explanation: MapReduce is a method to process big data using many helpers
(computers). It splits a large task into smaller tasks, processes them in parallel, and then
combines the results.
o How It Works:
▪ Map Phase: Divides the big task into smaller parts and processes each part on a
separate computer.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
▪ Reduce Phase: Collects and combines the results from all computers to produce the
final output.
● Why Use MapReduce?
o Parallel Processing: Processes large datasets quickly by working on multiple parts at the same
time.
o Efficiency: Handles complex tasks more efficiently by dividing the work.

Blocks and Replication Factor in HDFS


1. What are Blocks in HDFS?

● Blocks are the fundamental units of data storage in HDFS (Hadoop Distributed File System). When you
store a large file in HDFS, it splits the file into smaller, fixed-size chunks called blocks.
● Why Use Blocks?:
o Manageability: Breaking a large file into smaller blocks makes it easier to manage, store, and
process the file across multiple computers.
o Parallel Processing: By dividing a file into blocks, HDFS can distribute these blocks across
different computers (nodes) in the cluster, allowing for parallel processing of the data.
● Block Size:
o The default block size in HDFS is typically 128 MB or 256 MB. This size can be configured
depending on your needs.
● Example:
o If you have a 600 MB file and the block size is set to 128 MB, HDFS will split this file into 5
blocks:
▪ 4 blocks of 128 MB each
▪ 1 block of the remaining 88 MB
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
2. What is Replication Factor in HDFS?

● Replication Factor determines how many copies of each block HDFS will create and store across the
cluster.
● Why Replication Matters:
o Data Reliability: If one computer (node) holding a block fails, other copies of the block are still
available on other nodes. This ensures that the data is not lost and can be accessed even if a
part of the system fails.
o High Availability: By having multiple copies of the same block, HDFS can provide high
availability, making sure data is accessible whenever needed.
● Default Replication Factor:
o The default replication factor in HDFS is 3. This means HDFS creates 3 copies of each block and
distributes them across different nodes.
● Example:
o Continuing with the earlier example of a 600 MB file split into 5 blocks:
▪ If the replication factor is 3, HDFS will create 3 copies of each of the 5 blocks.
▪ In total, HDFS will store 15 blocks (5 blocks × 3 copies) across the cluster.
● How it Works:
o When a file is saved in HDFS, it is split into blocks. Each block is then copied and stored on
different nodes (computers) in the cluster. The system ensures that these copies are placed on
different nodes to avoid data loss in case one node fails.

Different Types of Data in Data Engineering


In data engineering, it's crucial to understand the various types of data you'll encounter. These types
help determine how data is stored, processed, and analyzed. Here’s a breakdown of the main types of
data:
1. Structured Data

● Description: Structured data is organized in a defined format, often in tables with rows and columns. It
follows a specific schema (structure) that makes it easy to search, query, and analyze using traditional
tools like SQL databases.
● Examples:
o Relational Databases: Tables in databases like SQL Server, MySQL, or Oracle.
o Spreadsheets: Excel files with data arranged in rows and columns.
● Characteristics:
o Follows a fixed format (e.g., numbers, strings).
o Easy to query using SQL.
o Examples include customer records, transaction data, and sales data.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
2. Unstructured Data

● Description: Unstructured data does not have a predefined structure or schema. It can be text-heavy
or contain various types of information that don’t fit neatly into a table.
● Examples:
o Text Files: Logs, emails, Word documents.
o Multimedia Files: Images, videos, audio files.
o Social Media: Posts, tweets, comments.
● Characteristics:
o Does not follow a fixed format.
o More complex to search, analyze, and process.
o Requires specialized tools and techniques like Natural Language Processing (NLP) or machine
learning for analysis.

3. Semi-Structured Data

● Description: Semi-structured data has some organizational properties but does not adhere strictly to a
fixed schema like structured data. It is more flexible and can contain varying types of information.
● Examples:
o JSON: JavaScript Object Notation used in APIs and NoSQL databases.
o XML: Extensible Markup Language used for data exchange between systems.
o CSV Files: Comma-separated values that contain structured data but can have variability in
format.
● Characteristics:
o Contains markers or tags (like in XML or JSON) to identify different elements.
o More flexible than structured data but still provides some level of organization.
o Often used for data exchange between different systems.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/

Data Locality in Hadoop


Data Locality in Hadoop refers to the concept of moving the computation to where the data is located
rather than moving the data to the computation. This approach significantly improves performance and
efficiency in processing large datasets, especially in distributed systems like Hadoop.
Why Data Locality Matters

● Large Data Volumes: Hadoop deals with vast amounts of data (big data). Transferring this data across
the network to the processing nodes can be time-consuming and resource-intensive.
● Network Bottlenecks: Moving large datasets over the network can cause congestion and slow down
processing.
● Efficiency: By processing data on the same node where it is stored, Hadoop minimizes data transfer,
resulting in faster data processing and reduced network overhead.

How Data Locality Works in Hadoop

1. Data Distribution: In Hadoop, data is divided into smaller pieces called blocks (usually 128 MB or 256
MB) and distributed across different nodes (computers) in the cluster.
2. Processing: When a job is submitted to process the data (like a MapReduce job), Hadoop tries to
schedule the task on the node where the data block is stored. This approach ensures that data
processing happens locally on the node.
3. Types of Data Locality:
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
o Data-Local: The task is executed on the same node where the data block is stored. This is the
ideal scenario for maximum efficiency.
o Rack-Local: If a data-local node is not available, the task is executed on a node within the same
rack (group of nodes). It involves minimal network transfer within the rack.
o Off-Rack (Remote): If both data-local and rack-local nodes are unavailable, the task is executed
on a different rack. This involves more network transfer and is the least efficient option.

Example Scenario
Imagine a Hadoop cluster with 10 nodes, and a large file is stored across these nodes in 100 blocks.
When you run a data processing job on this file:

● Data Locality: Hadoop will try to run the processing tasks on the nodes where the data blocks are
stored. If Block 1 is on Node A, Hadoop will schedule the task for Block 1 to run on Node A.
● Efficiency: By doing this, the job doesn't need to move the data block across the network to another
node for processing, which speeds up the job.

Benefits of Data Locality

● Faster Data Processing: By processing data where it resides, Hadoop reduces the time taken for data
transfer, leading to quicker job completion.
● Reduced Network Load: Minimizing data movement over the network decreases the risk of network
congestion and reduces the strain on network resources.
● Scalability: Data locality allows Hadoop to scale efficiently, even as the volume of data grows.

Hadoop vs Spark vs Pyspark vs Databricks


Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/

Aspect Hadoop Spark PySpark Databricks


Description Open-source Open-source fast, general- Python API for Unified
framework for purpose cluster-computing Apache Spark analytics
distributed framework platform based
storage & on Apache
processing Spark
Core HDFS Spark Core, Spark SQL, Spark with Python Managed
Components (storage), Spark Streaming, MLlib, API Spark
MapReduce GraphX environment
(processing) with
additional
features
Processing Batch In-memory processing In-memory In-memory
Model processing processing (via processing
(MapReduce) Python) with enhanced
management
Languages Java, Python, Scala, Java, Python, R Python Scala, Python,
Supported Scala, R SQL, R
Data Handling Disk-based In-memory (low disk I/O) In-memory (low In-memory
(high disk I/O) disk I/O) (with options
for Delta Lake
storage)
Speed Slower due to Fast (up to 100x faster Fast (same speed Very fast with
disk I/O than Hadoop for some as Spark, with optimizations
(MapReduce) tasks) Python simplicity) like Delta
Engine
Scalability Highly Highly scalable Highly scalable Highly
scalable scalable and
(horizontal managed by
scaling) cloud
resources
Ease of Use Complex Easier with high-level Easiest for Python User-friendly
(requires APIs users (familiar UI with
writing syntax) notebooks and
MapReduce SQL support
jobs)
Fault High (via data High (via data replication High (inherits High (built-in
Tolerance replication in and lineage) Spark's fault cluster
HDFS) tolerance) management
and
autoscaling)
Real-time Not suited for Supports real-time (via Supports real-time Supports real-
Processing real-time Spark Streaming) (via Spark time
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
(batch- Streaming) (integrated
oriented) with Spark
Streaming)
Common Use Batch Batch & real-time Data processing, Data
Cases processing, processing, ML, graph ML, data analysis engineering,
ETL, Data processing with Python ML, analytics
warehousing with managed
Spark
Memory Lower (uses Higher (in-memory Higher (in-memory Higher (in-
Requirements disk storage) processing) processing) memory +
managed
cloud
resources)
Best For Batch Fast, large-scale Data analysis and Unified
processing of processing, real-time processing with analytics with
large datasets analytics Python managed
Spark
environment
Ease of Complex setup Easier with Spark Easier when using Simplest
Deployment (manual standalone/YARN/Mesos with Spark (cloud-
cluster standalone/YARN managed
management) service, easy
setup)
Cost Cost-effective Can be expensive Can be expensive Varies
with (requires more memory) (due to Spark's (managed
commodity memory usage) service with
hardware cloud costs)
Tool/Platform Hadoop Apache Spark Apache Spark with Cloud services
Ecosystem Python (AWS, Azure,
(HDFS, Hive, GCP)
HBase)

Reasons of choosing spark over Hadoop

 Speed: Spark is significantly faster than Hadoop, especially for iterative and real-time processing
tasks.
 Ease of Use: Spark's high-level APIs and built-in libraries make it easier and quicker to write
complex data processing applications.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
 Versatility: Spark's ability to handle batch processing, streaming, machine learning, and graph
processing makes it a one-stop solution for various data processing needs.
 Real-Time Processing: Spark's support for real-time analytics makes it suitable for use cases that
require immediate insights.

 Efficiency: Spark's in-memory processing model reduces the need for disk I/O, making it more
efficient for large-scale data processing.

Spark Common Function List

Function Example Syntax Outcome Result


select df.select("name") Selects the "name" column.
Filters rows where age is
filter/where df.filter(df.age > 18)
greater than 18.
Groups rows by the
groupBy df.groupBy("department")
"department" column.
Aggregates data, calculating
df.groupBy("department").agg({"salary":
agg the average salary per
"avg"})
department.
Adds a new column
withColumn df.withColumn("new_col", df.age + 1) "new_col" by adding 1 to the
"age" column.
drop df.drop("age") Drops the "age" column.
Renames the "name" column
alias df.select(df.name.alias("employee_name"))
to "employee_name".
Returns distinct values from
distinct df.select("department").distinct()
the "department" column.
Sorts rows by the "age"
orderBy/sort df.orderBy("age")
column in ascending order.
limit df.limit(10) Returns the first 10 rows.
Joins df1 and df2 on the "id"
join df1.join(df2, "id")
column.
Unions two DataFrames
union df1.union(df2)
vertically.
Counts the number of rows
count df.count()
in the DataFrame.
Displays the first 5 rows of
show df.show(5)
the DataFrame.
collect df.collect() Collects all rows as a list.
first df.first() Returns the first row of the
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Function Example Syntax Outcome Result
DataFrame.
Returns the first 3 rows as a
head df.head(3)
list.
Returns the first 5 rows as a
take df.take(5)
list.
Returns a random sample
sample df.sample(0.1)
(10%) of the DataFrame.
Removes duplicate rows
dropDuplicates df.dropDuplicates(["name"]) based on the "name"
column.
df.withColumnRenamed("oldName", Renames a column from
withColumnRenamed
"newName") "oldName" to "newName".
Filters rows where the
isNull df.filter(df.name.isNull())
"name" column is null.
Filters rows where the
isNotNull df.filter(df.name.isNotNull())
"name" column is not null.
Removes rows with any null
dropna df.dropna()
values.
Replaces null values in the
fillna df.fillna({"age": 0})
"age" column with 0.
Replaces "unknown" values
replace df.replace("unknown", "N/A")
with "N/A".
Provides summary statistics
describe df.describe()
of numeric columns.
Selects a column using SQL
selectExpr df.selectExpr("name as employee_name")
expression and aliases it.
Caches the DataFrame in
cache df.cache()
memory.
Persists the DataFrame with
persist df.persist(StorageLevel.DISK_ONLY)
a specific storage level.
Removes the DataFrame
unpersist df.unpersist()
from memory/disk.
createOrReplaceTempVie Creates a temporary view for
df.createOrReplaceTempView("view_name")
w SQL querying.
Executes SQL query on a
sql spark.sql("SELECT * FROM view_name")
temporary view.
Returns the Cartesian
crossJoin df1.crossJoin(df2)
product of two DataFrames.
Explodes an array column
explode df.select(explode(df.array_col))
into multiple rows.
Removes duplicate rows
dropDuplicates df.dropDuplicates(["name"])
based on specified columns.
df.groupBy("name").pivot("year").sum("sales Pivots the DataFrame for
pivot
") "year" column aggregation.
Converts DataFrame to an
rdd df.rdd
RDD.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Function Example Syntax Outcome Result
Converts RDD to a
toDF rdd.toDF("col1", "col2") DataFrame with column
names.
Converts DataFrame to a
toPandas df.toPandas()
Pandas DataFrame.
Converts DataFrame to a
toJSON df.toJSON()
JSON string RDD.
Repartitions the DataFrame
repartition df.repartition(10)
into 10 partitions.
Reduces the number of
coalesce df.coalesce(1)
partitions to 1.
Writes DataFrame to a CSV
write df.write.csv("path/to/file")
file.
Saves the DataFrame as a
saveAsTable df.write.saveAsTable("table_name")
table.
Filters rows where age is
filter df.filter(df.age > 30)
greater than 30.
Returns stratified sample by
sampleBy df.sampleBy("gender", fractions={"M": 0.5})
specified fractions.
Computes the correlation
corr df.stat.corr("col1", "col2")
between two columns.
Calculates approximate
approxQuantile df.approxQuantile("col", [0.25, 0.75], 0.05)
quantiles for a column.

Common String Manipulation function in Spark

Function Example Syntax Outcome Result


Concatenates "firstName"
df.select(concat(df.firstName,
concat and "lastName" columns into
df.lastName))
one string.
Concatenates "firstName"
df.select(concat_ws(", ", df.firstName,
concat_ws and "age" columns with ", " as
df.age))
a separator.
Extracts a substring from
substr df.select(df.name.substr(1, 3)) "name" starting at position 1
with a length of 3.
Returns the length of the
length df.select(length(df.name))
string in the "name" column.
trim df.select(trim(df.name)) Removes leading and trailing
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
spaces from the "name"
column.
Removes leading spaces from
ltrim df.select(ltrim(df.name))
the "name" column.
Removes trailing spaces from
rtrim df.select(rtrim(df.name))
the "name" column.
Converts all characters in the
lower df.select(lower(df.name))
"name" column to lowercase.
Converts all characters in the
upper df.select(upper(df.name))
"name" column to uppercase.
Capitalizes the first letter of
initcap df.select(initcap(df.name)) each word in the "name"
column.
Reverses the characters in the
reverse df.select(reverse(df.name))
"name" column.
Extracts the domain from the
df.select(regexp_extract(df.email, r'@(.+)',
regexp_extract "email" column using a
1))
regular expression.
Replaces all spaces in the
regexp_replace df.select(regexp_replace(df.name, ' ', '_')) "name" column with
underscores.
Replaces characters in
df.select(translate(df.name, 'aeiou',
translate "name" according to the
'12345'))
specified mapping.
Splits the "name" column into
split df.select(split(df.name, ' ')) an array of words using space
as a separator.
Filters rows where the
contains df.filter(df.name.contains('John')) "name" column contains the
substring 'John'.
Filters rows where the
startswith df.filter(df.name.startswith('A')) "name" column starts with
'A'.
Filters rows where the
endswith df.filter(df.name.endswith('s'))
"name" column ends with 's'.
df.select(format_string("Hello %s", Formats the "name" column
format_string
df.name)) into a string template.
Returns the position of the
instr df.select(instr(df.name, 'Smith')) substring 'Smith' in the
"name" column.
Replaces the substring of
overlay df.select(overlay(df.name, 'ABC', 1, 2)) "name" starting at position 1
with 'ABC'.
Replaces occurrences of
replace df.select(df.name.replace('John', 'Mike'))
'John' with 'Mike' in the
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
"name" column.
Returns the length of the
char_length df.select(char_length(df.name))
string in the "name" column.
Returns the ASCII value of the
ascii df.select(ascii(df.name)) first character in the "name"
column.
Encodes the "name" column
base64 df.select(base64(df.name))
into a base64 string.
Decodes a base64-encoded
unbase64 df.select(unbase64(df.encoded_name)) string in the "encoded_name"
column.
Inserts 'XYZ' into the "name"
overlay df.select(overlay(df.name, 'XYZ', 1, 0))
column at position 1.
Pads the "name" column with
lpad df.select(lpad(df.name, 10, '0')) '0' on the left to a length of
10.
Pads the "name" column with
rpad df.select(rpad(df.name, 10, '0')) '0' on the right to a length of
10.
Repeats the string in the
repeat df.select(repeat(df.name, 3))
"name" column 3 times.
Returns the position of the
find_in_set df.select(find_in_set('abc', df.csv_col)) string 'abc' in a comma-
separated list.
Returns the Soundex phonetic
soundex df.select(soundex(df.name)) representation of the "name"
column.
Computes the Levenshtein
levenshtein df.select(levenshtein(df.name, df.name2))
distance between two strings.
Converts the "name" column
hex df.select(hex(df.name))
to a hexadecimal string.
Converts a hexadecimal string
unhex df.select(unhex(df.hex_name))
to binary (decodes).
Computes the MD5 hash of
md5 df.select(md5(df.name))
the "name" column.
Computes the SHA-1 hash of
sha1 df.select(sha1(df.name))
the "name" column.
Computes the SHA-2 hash
sha2 df.select(sha2(df.name, 256)) (256-bit) of the "name"
column.
Capitalizes the first letter of
initcap df.select(initcap(df.name)) each word in the "name"
column.
Computes a hash code of the
hash df.select(hash(df.name))
"name" column.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Computes the CRC32
crc32 df.select(crc32(df.name)) checksum of the "name"
column.
monotonically_increasing_i df.withColumn("id", Adds a column with a unique
d monotonically_increasing_id()) monotonically increasing ID.
Adds a column with a unique
uuid df.withColumn("uuid", expr("uuid()"))
UUID.

Date Time Manipulation function in pyspark

Function Example Syntax Outcome Result


current_date df.select(current_date()) Returns the current date.
current_timestamp df.select(current_timestamp()) Returns the current timestamp.
df.select(to_date(df.date_string, 'yyyy-MM- Converts a string column to a
to_date
dd')) date.
to_timestamp df.select(to_timestamp(df.timestamp_string)) Converts a string to a timestamp.
Formats a date column to the
date_format df.select(date_format(df.date, 'MM/dd/yyyy'))
specified format.
Extracts the year from a date
year df.select(year(df.date))
column.
Extracts the month from a date
month df.select(month(df.date))
column.
Extracts the day of the month
dayofmonth df.select(dayofmonth(df.date))
from a date column.
Returns the day of the week (1 =
dayofweek df.select(dayofweek(df.date))
Sunday, 2 = Monday, etc.).
Extracts the day of the year from
dayofyear df.select(dayofyear(df.date))
a date column.
weekofyear df.select(weekofyear(df.date)) Returns the week of the year.
Returns the quarter of the year (1
quarter df.select(quarter(df.date))
to 4).
date_add df.select(date_add(df.date, 7)) Adds 7 days to the date column.

date_sub df.select(date_sub(df.date, 7)) Subtracts 7 days from the date


Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
column.
Returns the difference in days
datediff df.select(datediff(df.end_date, df.start_date))
between two dates.
df.select(months_between(df.end_date, Returns the number of months
months_between
df.start_date)) between two dates.
Adds 2 months to the date
add_months df.select(add_months(df.date, 2))
column.
Returns the next Monday after
next_day df.select(next_day(df.date, 'Monday'))
the given date.
Returns the last day of the month
last_day df.select(last_day(df.date))
for the given date.
Truncates a date to the first day
trunc df.select(trunc(df.date, 'MM'))
of the month.
Truncates a timestamp to the
date_trunc df.select(date_trunc('month', df.timestamp))
start of the month.
Converts Unix timestamp to a
from_unixtime df.select(from_unixtime(df.unix_time))
timestamp.
df.select(unix_timestamp(df.date, 'yyyy-MM- Converts a date to Unix
unix_timestamp
dd')) timestamp.
current_date df.select(current_date()) Returns the current date.
current_timestamp df.select(current_timestamp()) Returns the current timestamp.
Extracts the hour from a
hour df.select(hour(df.timestamp))
timestamp.
Extracts the minute from a
minute df.select(minute(df.timestamp))
timestamp.
Extracts the second from a
second df.select(second(df.timestamp))
timestamp.
from_utc_timestam df.select(from_utc_timestamp(df.timestamp, Converts UTC timestamp to a
p 'PST')) specified time zone.
df.select(to_utc_timestamp(df.timestamp,
to_utc_timestamp Converts a timestamp to UTC.
'PST'))
Returns the last day of the month
last_day df.select(last_day(df.date))
for the given date.
Truncates a date to the first day
trunc df.select(trunc(df.date, 'MM'))
of the month.
Truncates a timestamp to the
date_trunc df.select(date_trunc('month', df.timestamp))
start of the month.
Converts Unix timestamp to a
from_unixtime df.select(from_unixtime(df.unix_time))
timestamp.
df.select(unix_timestamp(df.date, 'yyyy-MM- Converts a date to Unix
unix_timestamp
dd')) timestamp.
Extracts the date from a
to_date df.select(to_date(df.timestamp))
timestamp.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Returns the quarter of the year (1
quarter df.select(quarter(df.date))
to 4) from the date column.
Returns the last day of the month
last_day df.select(last_day(df.date))
for the given date.
Truncates a date to the first day
trunc df.select(trunc(df.date, 'MM'))
of the month.
Truncates a timestamp to the
date_trunc df.select(date_trunc('month', df.timestamp))
start of the month.
Returns the week of the year for
weekofyear df.select(weekofyear(df.date))
the date.
Extracts the second from a
second df.select(second(df.timestamp))
timestamp.
df.select(date_format(df.date, 'MM-dd-
date_format Formats a date to 'MM-dd-yyyy'.
yyyy'))
df.select(months_between(df.end_date, Returns the number of months
months_between
df.start_date)) between two dates.

date_add df.select(date_add(df.date, 5)) Adds 5 days to a date.

date_sub df.select(date_sub(df.date, 5)) Subtracts 5 days from a date.


Returns the day of the week (1 =
dayofweek df.select(dayofweek(df.date))
Sunday, 2 = Monday, etc.).

Writing Various File Formats in PySpark Using DataFrame.write


In PySpark, the DataFrame.write API allows you to write data to various file formats like CSV,
JSON, Parquet, Avro, ORC, and more. You can also specify options for each format, such as the
delimiter for CSV files or the compression type for Parquet files.
Here’s an explanation of how to use DataFrame.write to write different file formats, along with
examples:
Common DataFrame.write Options

● format: Specifies the file format (e.g., "csv", "json", "parquet").

● mode: Determines the behavior if the output file already exists. Options include:
o overwrite: Overwrites existing files.
o append: Appends to the existing file.
o ignore: Ignores the write operation if the file exists.
o error``or errorifexists`: Throws an error if the file exists.
● option: Additional options for the specific file format.

● partitionBy: Writes the data in partitioned form.


Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
1. Writing a CSV File

● Syntax:

df.write.format("csv").option("header",
"true").mode("overwrite").save("/path/to/output/csv")

● Explanation:
o format("csv"): Specifies CSV as the file format.
o option("header", "true"): Includes the header in the CSV file.
o mode("overwrite"): Overwrites existing files at the destination.
o save("/path/to/output/csv"): Specifies the output path.

2. Writing a JSON File

● Syntax:

df.write.format("json").mode("overwrite").save("/path/to/output/json")

● Explanation:
o format("json"): Specifies JSON as the file format.
o JSON files do not have an option for headers as each line is a valid JSON object.
o save("/path/to/output/json"): Writes the DataFrame to the specified path in JSON
format.

3. Writing a Parquet File

● Syntax:

df.write.format("parquet").mode("overwrite").save("/path/to/output/parquet")

● Explanation:
o format("parquet"): Specifies Parquet as the file format, which is a columnar storage
format.
o save("/path/to/output/parquet"): Writes the DataFrame in Parquet format, preserving
the schema and data types.

4. Writing an ORC File

● Syntax:

df.write.format("orc").mode("overwrite").save("/path/to/output/orc")
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
● Explanation:
o format("orc"): Specifies ORC as the file format, which is efficient for read-heavy operations.
o save("/path/to/output/orc"): Writes the DataFrame in ORC format.

5. Writing an Avro File

● Syntax (Avro support might require additional packages):

df.write.format("avro").mode("overwrite").save("/path/to/output/avro")

● Explanation:
o format("avro"): Specifies Avro as the file format, which is a compact and fast binary
format.
o save("/path/to/output/avro"): Writes the DataFrame to Avro format.

6. Writing to a Hive Table

● Syntax:

df.write.mode("overwrite").saveAsTable("database_name.table_name")

● Explanation:
o saveAsTable("database_name.table_name"): Writes the DataFrame to a Hive table.
o Hive must be enabled and configured in the Spark session for this to work.

7. Writing to a Specific File with coalesce


To write the DataFrame as a single file, you can use coalesce(1) to reduce the number of output files
to one:

● Syntax:

df.coalesce(1).write.format("csv").option("header",
"true").save("/path/to/output/single_csv_file")

● Explanation:
o coalesce(1): Reduces the DataFrame to a single partition, resulting in a single output file.

8. Specifying Compression

● Syntax:
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
df.write.format("parquet").option("compression",
"snappy").save("/path/to/output/compressed_parquet")

● Explanation:
o option("compression", "snappy"): Specifies the compression type (e.g., "snappy",
"gzip") for the output file.

9. Partitioning the Output


Partitioning allows you to split the output data into directories based on column values.

● Syntax:

df.write.partitionBy("year",
"month").format("parquet").save("/path/to/output/partitioned_parquet")

● Explanation:
o partitionBy("year", "month"): Partitions the data by "year" and "month" columns,
creating subdirectories for each unique combination of these column values.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
1. Spark SQL

● What is Spark SQL?


Spark SQL is a module in Apache Spark that allows you to interact with structured data using SQL
queries. It provides a unified interface for reading, processing, and writing data using SQL syntax,
DataFrames, and Datasets. Spark SQL integrates seamlessly with the Spark core, allowing you to mix
SQL queries with complex data transformations.
● Key Features:
o Unified Data Access: Supports reading data from various sources like Hive, Parquet, JSON,
JDBC, and more.
o Performance Optimization: Uses the Catalyst optimizer and Tungsten execution engine for
efficient query optimization and execution.
o Seamless Integration: Allows mixing SQL queries with Spark's DataFrame and Dataset APIs.
● Example:

# Using SQL to query a DataFrame


df.createOrReplaceTempView("table")
spark.sql("SELECT * FROM table WHERE age > 30").show()
2. Metastore

● What is a Metastore?
The Metastore in Spark SQL and Hive is a central repository that stores metadata information about
the data. This metadata includes information about tables, databases, columns, data types, and the
location of data files. In the context of Spark, the Metastore can be either the default embedded
metastore (using Derby) or an external metastore like Apache Hive.
● Key Features:
o Metadata Management: Stores information about table schemas, partitions, and storage
formats.
o Interoperability: Allows different Spark sessions and other tools like Hive to access and
manage data consistently.
● Types:
o Embedded Metastore: Usually for single-user or development use cases (default in Spark using
Derby).
o External Metastore (Hive Metastore): A more scalable solution using external databases like
MySQL, Postgres, etc., suitable for production environments.
● Example: When you create a table in Spark SQL using CREATE TABLE, the metastore records the
table's schema and storage information.

3. Hive Warehouse

● What is the Hive Warehouse?


The Hive Warehouse, often referred to as the Hive Warehouse Directory, is the location in HDFS or a
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
similar distributed file system where Hive and Spark store the actual data files for Hive tables. It is the
physical layer of data storage that contains the data in formats like ORC, Parquet, or text files.
● Default Location:
o By default, the Hive Warehouse is located at /user/hive/warehouse in HDFS.
● How it Works:
o When a table is created in Hive or Spark with Hive support, the data is stored in the Hive
Warehouse directory by default.
o Managed tables created in Hive/Spark store both metadata in the metastore and data in the
Hive Warehouse.
● Example:
o Suppose you create a table in Spark SQL using:

CREATE TABLE employee (id INT, name STRING);

o The table's data will be stored in a subdirectory within the Hive Warehouse directory
(/user/hive/warehouse/employee).

4. Default Location in Spark

● Default Locations in Spark:


o Warehouse Directory: By default, Spark uses the Hive Warehouse directory
(/user/hive/warehouse) to store the data for managed tables. This location can be
configured using the spark.sql.warehouse.dir property.
o Temporary Data: Temporary tables and views do not have a physical storage location since
they exist only in memory.
● Changing the Default Location:
o You can change the default warehouse location by setting the spark.sql.warehouse.dir
configuration property.
o Example:

# Setting the warehouse directory in a Spark session


spark = SparkSession.builder \
.appName("Example") \
.config("spark.sql.warehouse.dir", "/path/to/custom/warehouse") \
.enableHiveSupport() \
.getOrCreate()
How They Work Together

1. Spark SQL allows you to run SQL queries on structured data stored in various formats and locations.
When you create or query tables, Spark interacts with the Metastore to understand the structure and
location of the data.
2. The Metastore acts as a catalog, storing metadata about tables, databases, and columns. When you
query a table, Spark uses the metastore to locate the data.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
3. The Hive Warehouse is the physical storage location where table data is stored. When you create a
managed table in Spark SQL with Hive support, the data is typically stored in the Hive Warehouse
directory.
4. The Default Location (/user/hive/warehouse) is where the managed table data is stored by default
in HDFS. You can change this location by setting the appropriate configuration in Spark.

Example Workflow

1. You create a table in Spark SQL:

CREATE TABLE employee (id INT, name STRING);

2. Spark records the table schema and location in the Metastore.


3. Spark stores the table data in the Hive Warehouse directory (/user/hive/warehouse/employee).
4. When you query the table, Spark SQL uses the Metastore to locate and retrieve the data from the Hive
Warehouse.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/

RDD in Spark: Explained in Detail


RDD stands for Resilient Distributed Dataset, which is the fundamental data structure of Apache
Spark. It is an immutable distributed collection of objects that can be processed in parallel across a
cluster. RDDs are designed to handle big data and enable fault-tolerant computations on large datasets.
Key Concepts of RDD
1. Resilient

● Fault Tolerance: RDDs are inherently fault-tolerant. If a part of the RDD is lost due to a node failure,
Spark can automatically recompute the lost partitions using the lineage information. This lineage is
essentially a logical execution plan that tracks the sequence of transformations applied to the dataset.
● Recomputing Data: RDDs can be recomputed efficiently using lineage, avoiding the need to save
intermediate data to disk after every transformation.

2. Distributed

● Partitioning: RDDs are distributed across multiple nodes in a cluster. They are divided into partitions,
which can be processed in parallel. Each partition represents a portion of the overall dataset.
● Parallel Processing: The distributed nature of RDDs allows Spark to perform parallel processing,
improving the speed of data processing on large datasets.

3. Dataset

● Immutable Collection: An RDD is an immutable collection of objects, meaning once an RDD is created,
it cannot be changed. However, you can create new RDDs by applying transformations to existing ones.
● Typed: RDDs are typed in languages like Scala and Java. For example, an RDD of integers will have a
type RDD[Int] in Scala.

Characteristics of RDDs

1. Immutable: Once created, RDDs cannot be altered. Any operation on an RDD returns a new RDD.
2. Lazily Evaluated: Transformations on RDDs are lazily evaluated. This means Spark will not execute the
transformations until an action (like collect or count) is called. This allows Spark to optimize the
execution plan for better performance.
3. Fault Tolerant: Spark keeps track of the lineage of each RDD, meaning it knows how to recompute
RDDs from their original data if a partition is lost.
4. In-Memory Processing: RDDs are primarily designed for in-memory processing, which can significantly
speed up data processing tasks compared to disk-based processing.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/

Advantages of RDDs

● Resilience: RDDs are fault-tolerant with the ability to recover lost data using lineage.

● Parallel Processing: RDDs enable distributed and parallel processing of large datasets.

● In-Memory Computing: RDDs can be cached in memory, reducing the time for iterative and interactive
operations.

Disadvantages of RDDs

● Low-Level API: RDDs provide a low-level API for distributed data processing, which can be complex and
verbose compared to higher-level abstractions like DataFrames and Datasets.
● Lack of Optimization: Unlike DataFrames, RDDs do not benefit from Spark's Catalyst optimizer, making
them less efficient for many operations.

Managed Tables and External Tables in Spark


In Spark SQL, tables can be categorized into two types: Managed Tables and External Tables. The
distinction between these table types primarily revolves around how Spark handles the data and
metadata associated with these tables.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
1. Managed Tables
Definition:

● A Managed Table in Spark SQL is a table where Spark manages both the metadata and the data. When
you create a managed table, Spark stores the data in a default location (usually in the warehouse
directory specified by spark.sql.warehouse.dir), and Spark is responsible for managing the
lifecycle of the table data.

Key Characteristics:

● Data Location: Data for managed tables is stored in the default warehouse directory (e.g.,
/user/hive/warehouse on HDFS) unless a different location is specified during table creation.
● Data Lifecycle: Spark takes full responsibility for the data. If you drop a managed table, Spark deletes
both the table's metadata and the underlying data files.
● Automatic Data Management: When you insert, update, or delete data in a managed table, Spark
handles the data storage automatically.

Example of Creating a Managed Table:

-- Creating a managed table in Spark SQL


CREATE TABLE employee (
id INT,
name STRING,
age INT
);

-- Inserting data into the managed table


INSERT INTO employee VALUES (1, 'Alice', 30), (2, 'Bob', 25);
When to Use Managed Tables:

● Full Control: When you want Spark to handle the data management, including storage and deletion.

● Temporary Data: When the data is transient or temporary, and you want it to be cleaned up
automatically when no longer needed.

Behavior on Drop:

● When you execute DROP TABLE employee, Spark deletes both the table schema and the underlying
data files from the storage.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
2. External Tables
Definition:

● An External Table in Spark SQL is a table where Spark manages only the metadata, not the actual data.
The data for an external table resides outside of Spark's control in a location specified by the user (e.g.,
an HDFS directory, Amazon S3, Azure Blob Storage).

Key Characteristics:

● Data Location: The data location is specified by the user using the LOCATION clause when creating the
table. Spark only stores metadata about this table, like its schema and location.
● Data Lifecycle: Spark does not manage the lifecycle of the data. Dropping an external table only
removes the table's metadata from the metastore; the underlying data files remain intact.
● User-Controlled Data Management: Users must manage the data's storage and lifecycle
independently.

Example of Creating an External Table:

-- Creating an external table with data stored at a specific location


CREATE EXTERNAL TABLE external_employee (
id INT,
name STRING,
age INT
)
LOCATION '/path/to/external/data/employee';

-- Inserting data into the external table (assuming the data is managed
externally)
INSERT INTO external_employee VALUES (3, 'Charlie', 35);
When to Use External Tables:

● Data Reusability: When the data is shared between different applications or systems and needs to be
reused outside Spark.
● Existing Data: When you have existing data files that you want to query using Spark without Spark
taking ownership of these files.
● Data Persistence: When you want the data to persist beyond the lifecycle of the Spark job or
application.

Behavior on Drop:

● When you execute DROP TABLE external_employee, Spark deletes the table's metadata from the
metastore, but the data files at /path/to/external/data/employee remain untouched.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Summary Table
Aspect Managed Table External Table
Default warehouse directory
User-specified location (e.g.,
Data Location (/user/hive/warehouse) or specified during
HDFS, S3, Blob Storage).
creation.
Data Managed by Spark. Spark takes care of data storage, Managed by the user. Spark
Management deletion, etc. only manages metadata.
Data is not deleted when the
Data Deletion Data is deleted when the table is dropped.
table is dropped.
Shared or existing data
Use Cases Temporary data, data fully managed by Spark.
managed outside Spark.
Lifecycle Users must manage data
Spark handles the entire lifecycle.
Management lifecycle externally.
Working with Managed and External Tables in Spark
1. Creating a Managed Table

● Managed tables are created without specifying the LOCATION clause.

s
CREATE TABLE managed_employee (
id INT,
name STRING,
age INT
);

● Data is stored in the default warehouse location.

2. Creating an External Table

● External tables require the LOCATION clause to specify the data's location.

CREATE EXTERNAL TABLE external_employee (


id INT,
name STRING,
age INT
)
LOCATION '/path/to/external/employee_data';

● Data remains at the specified location and is not moved or deleted by Spark.

You might also like