Ade Notes
Ade Notes
https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Cloud computing is the delivery of computing services over the internet ("the cloud"), allowing users to access
and use resources such as servers, storage, databases, networking, software, and analytics without needing to
manage the underlying hardware and software infrastructure.
• Storage
• computing
• Networking
• Servers
• Databases
• Developer tools
• Security
• Analytics, etc
Scalability:
Cloud computing provides the ability to easily increase or decrease IT resources based on demand or
requirements. For example, when companies like Flipkart and Amazon start their Big Billion Day sale or when
there's an India-Pakistan cricket match on Hotstar or a football World Cup final, the user traffic on these
platforms increases rapidly. To manage this surge in traffic, they can rent the required resources from a cloud
service provider for those specific days. This is known as scaling up.
Increasing resources is called scaling up, while decreasing resources is called scaling down.
Vertical Scaling:
Vertical scaling involves increasing or decreasing resources within the same server. This is done by adding more
RAM, hard disk space, or CPU power to the existing server without purchasing a new one.
Horizontal Scaling:
Horizontal scaling involves adding or removing servers to handle increased traffic or improve performance.
This is achieved by adding additional virtual machines (VMs), containers, or servers to the existing
infrastructure.
Geo-Distribution:
Cloud computing helps mitigate network latency issues by providing services based on geographical location.
Organizations can rent cloud servers in any region to provide services to users with minimal latency.
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Agility:
Cloud computing enables organizations to access services very quickly, allowing them to adapt to changes or
new demands efficiently.
Cost-Effectiveness:
Cloud computing offers two cost models: CapEx and OpEx.
Challenges
Security:
Security can be a concern, especially when using public cloud services or services from third-party providers.
The highest level of security is usually achieved when an organization uses its own private cloud, often referred
to as "on-prem" or "on-premises" cloud.
Types of Cloud
Private Cloud:
A private cloud is when an organization uses its own cloud infrastructure, managed either internally or by a
third party. This cloud environment is dedicated to a single organization, providing greater control, security,
and customization. It is often used by large enterprises that need to handle sensitive data and require stringent
compliance.
Public Cloud:
A public cloud is when an organization utilizes cloud services provided by a third-party cloud provider, such as
Microsoft Azure, Amazon Web Services (AWS), or Google Cloud Platform (GCP). In this model, the cloud
infrastructure is shared among multiple users, also known as tenants, making it a cost-effective option for
businesses. Public clouds offer scalability, flexibility, and ease of access to a wide range of services.
Hybrid Cloud:
A hybrid cloud is a combination of both private and public clouds, designed to allow data and applications to
be shared between them. This setup enables organizations to take advantage of the scalability and cost-
efficiency of public cloud services while keeping sensitive operations in a private cloud environment. Hybrid
clouds are ideal for businesses that want to optimize their existing infrastructure and meet specific regulatory
or business requirements.
IaaS offers on-demand access to fundamental computing resources such as virtual machines, storage, and
networking. Organizations can rent these resources and scale them up or down based on their needs, without
having to manage the physical hardware. Examples include AWS EC2, Azure Virtual Machines, and Google
Compute Engine. IaaS is ideal for businesses that want to retain control over their infrastructure while
benefiting from the flexibility and cost savings of the cloud.
    2. Edge Computing: Cloud providers often use something called edge computing, which means
        they place smaller data centers (called edge locations) closer to users. These edge locations can
        handle certain tasks or store frequently accessed data locally, reducing the need to send data all
        the way back to the main data center. This also helps in reducing latency.
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
   3. Content Delivery Networks (CDNs): Many cloud providers offer CDNs, which are networks
       of servers distributed around the world. CDNs store copies of your website or content on
       multiple servers in different locations. When someone tries to access your content, the CDN
       delivers it from the nearest server, reducing the distance the data needs to travel and,
       consequently, the latency.
1. Azure Portal
   ● Description: The Azure Portal is a web-based interface where you can manage and monitor your Azure
       services.
   ● Usage: Simply go to https://portal.azure.com, sign in with your Azure account credentials, and you'll
       have access to all your Azure resources.
   ● Best For: Managing resources, creating services, monitoring activity, and performing administrative
       tasks via a graphical interface.
   ● Description: Azure CLI is a cross-platform command-line tool that allows you to manage your Azure
       resources directly from your terminal or command prompt.
   ● Usage: After installing Azure CLI, you can connect by running the command az login, which will open
       a web browser for you to sign in.
   ● Best For: Automating tasks, scripting, and managing Azure resources via commands. It's popular
       among developers and DevOps engineers.
3. Azure PowerShell
   ● Description: Azure PowerShell is a set of modules that allow you to manage Azure resources using
       PowerShell scripts and commands.
   ● Usage: After installing Azure PowerShell, you can connect by running Connect-AzAccount, which
       will prompt you to sign in through a web browser.
   ● Best For: Scripting and automating Azure resource management tasks, especially for those familiar
       with PowerShell.
   ● Description: The Azure mobile app lets you monitor and manage your Azure resources on the go.
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
   ● Usage: Download the Azure app from the App Store or Google Play, sign in with your Azure credentials,
       and you can monitor resources, check alerts, and perform basic management tasks.
   ● Best For: Quick checks and light management tasks from a mobile device.
   ● Description: Azure Cloud Shell is an online, browser-based shell that provides you with a command-
       line experience in the Azure portal.
   ● Usage: Available directly in the Azure Portal, you can choose either Bash or PowerShell environments
       to manage Azure resources without installing anything locally.
   ● Best For: Quick command-line tasks without needing to install anything on your local machine.
   ● Description: Azure provides Software Development Kits (SDKs) and REST APIs that allow you to
       connect to and manage Azure services programmatically.
   ● Usage: By using Azure SDKs available for various programming languages (like Python, .NET, Java, etc.),
       you can connect to Azure directly within your application code.
   ● Best For: Developers building applications that need to interact with Azure services programmatically.
Azure Regions:
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
An Azure region is a specific geographical area where Microsoft has data centers to host cloud
services and resources. Each region consists of multiple data centers located within a defined perimeter
and connected through a dedicated, low-latency network.
   ● Purpose: Regions allow you to deploy your resources closer to your users to reduce latency,
     meet data residency requirements, and ensure compliance with local regulations.
   ● Examples of Azure Regions:
        o East US (Virginia)
        o West Europe (Netherlands)
        o Southeast Asia (Singapore)
        o Australia East (New South Wales)
        o Central India (Pune)
Azure offers over 60 regions globally, making it one of the most extensive cloud networks in the
world.
Availability Zones:
An Availability Zone is a physically separate location within an Azure region. Each Availability Zone
consists of one or more data centers equipped with independent power, cooling, and networking. By
deploying resources across multiple Availability Zones, you can ensure high availability and fault
tolerance for your applications.
   ● Purpose: Availability Zones are designed to protect your applications and data from data
     center failures within a region. If one zone goes down, the other zones in the region continue to
     operate, minimizing downtime.
   ● Structure: Typically, an Azure region will have three or more Availability Zones. These zones
     are interconnected with high-speed, private fiber-optic networks.
   ● Examples of Services Using Availability Zones:
         o Virtual Machines: You can deploy VMs across multiple zones to ensure that if one
             zone fails, your application remains available in the other zones.
         o Managed Disks: Zone-redundant storage (ZRS) replicates your data across multiple
             zones to ensure durability.
         o Load Balancers: Azure Load Balancer can distribute traffic across VMs in different
             zones, providing high availability.
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Important Considerations:
   ● Location: The resource group itself has a location (region), which determines where its
     metadata is stored. However, the resources within the group can be in different regions. It's
     generally recommended to keep resources in the same region for performance reasons.
   ● Naming: Resource groups should be named in a way that reflects their purpose, such as RG-
     Production-ApplicationName or RG-Dev-ProjectX, to make them easily identifiable.
   ● Scope: A resource can only belong to one resource group at a time, but you can move resources
     between groups if needed.
   ● Limits: While there are limits on the number of resource groups and resources per group, these
     limits are generally high and sufficient for most use cases.
Example Use Cases:
   ● Application Lifecycle: You might create a resource group for a web application that includes
     resources like a web server, database, and storage account. When you update the application,
     you can update all related resources together.
   ● Development and Testing: For a development environment, you could create a resource group
     that contains all the necessary resources (VMs, databases, etc.) for testing purposes. Once
     testing is complete, you can delete the resource group to clean up all associated resources.
   ●    Cost Management: By grouping all resources for a specific department or project, you can
        easily track the costs associated with that group and manage budgets accordingly.
In summary, resource groups in Azure are a fundamental feature for organizing, managing, and
controlling access to your cloud resources. They help streamline operations and provide a structured
way to manage your Azure environment.
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
   ● General-purpose v2 (GPv2): Supports all the latest features and is the recommended type for most
       scenarios. It provides access to all Azure Storage services.
   ● Blob Storage Account: Specifically optimized for blob storage, with tiering options (hot, cool, and
       archive) to optimize cost based on access frequency.
   ● File Storage Account: Optimized for Azure Files, supporting premium file shares.
   ● BlockBlobStorage Account: Designed for workloads with high transaction rates or that require
       consistent, low-latency data access.
Use Cases:
● Storing Files and Documents: Store and access files, images, videos, and other unstructured data.
   ● Backup and Restore: Use Azure Storage as a backup destination for on-premises or cloud-based
       systems.
   ● Disaster Recovery: Ensure data availability with geo-redundant storage options.
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
   ● Big Data Analytics: Store large datasets for analysis with tools like Azure Data Lake Analytics or Azure
       Synapse Analytics.
   ● Web and Mobile Applications: Host and serve content such as web pages, videos, or static files
       directly from Azure Storage.
Once created, you can start using your storage account to store and manage data in the cloud.
To keep history you can enable disable the soft delete feature.
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Encryption in Azure Storage Account
Azure Storage provides robust encryption features to protect your data at rest. This ensures that all
your data is automatically encrypted before it is stored and decrypted before it is retrieved, without
requiring any additional configuration or management from you.
 Microsoft-managed keys: By default, Azure manages the encryption keys for you, which simplifies
key management and ensures that your data is protected using Microsoft-managed keys.
 Customer-managed keys (CMK): If you prefer more control, you can manage your own
encryption keys using Azure Key Vault. This gives you full control over the key lifecycle, including
rotation and revocation. Customer-managed keys can also be used for auditing purposes, as you have
visibility into the key usage.
    ● Description: The Hot tier is designed for data that is accessed frequently. It offers the lowest access
        latency and the highest throughput, making it ideal for data that needs to be accessed and processed
        regularly.
    ● Use Cases:
               o   Active datasets, such as files and databases that are accessed frequently.
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
            o Content that is frequently updated or queried, like transaction logs.
            o Data for applications that require low latency access, such as web and mobile apps.
    ● Cost: Higher storage costs compared to the Cool and Archive tiers, but lower access costs.
2. Cool Tier:
    ● Description: The Cool tier is optimized for data that is infrequently accessed but needs to be stored for
        at least 30 days. It offers lower storage costs than the Hot tier but higher access costs.
    ● Use Cases:
            o      Data that is not accessed frequently but still needs to be available for occasional access, like
                   backups, archived data, or media content that is accessed seasonally.
            o      Data that is stored for compliance or business continuity purposes.
    ● Cost: Lower storage costs than the Hot tier, but higher costs for data access and retrieval.
3. Archive Tier:
    ● Description: The Archive tier is intended for data that is rarely accessed and can tolerate higher
        retrieval times. This tier offers the lowest storage costs but the highest costs and latency for data
        retrieval.
    ● Use Cases:
            o      Long-term archival data, such as compliance records, legal documents, or historical data that
                   may be accessed once in a while.
            o      Data that needs to be kept for extended periods but is not likely to be needed frequently.
    ● Cost: The lowest storage cost among all tiers, but the highest cost and latency for data access. Data in
        the Archive tier must be rehydrated (moved to the Hot or Cool tier) before it can be accessed.
Access Keys, SAS Tokens, and Why Azure Provides Two Keys (Key1 & Key2)
1. Access Keys:
    ● What Are They? Access keys are a pair of 512-bit keys generated by Azure for your storage
      account. These keys are used to authenticate and authorize access to your storage account's data
      services, including Blob Storage, Queue Storage, Table Storage, and File Storage.
    ● Purpose: Access keys allow full access to your storage account, including read, write, and
      delete operations across all data services. They are the primary means of programmatic access
      to Azure Storage services.
    ●   Usage:
            o      SDKs and APIs: When developing applications that need to interact with Azure Storage, you
                   can use these keys to authenticate requests. The keys are included in the connection strings
                   used by the Azure Storage SDKs or directly in API calls.
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
            o Administrative Tools: Tools like Azure Storage Explorer or custom scripts often use access keys
               to connect to and manage storage resources.
    ● File Server Migration: Migrate on-premises file servers to Azure to reduce infrastructure management
        overhead and improve accessibility.
    ● Hybrid Cloud Solutions: Use Azure File Sync to maintain a synchronized copy of your data on-premises
        and in Azure.
    ● Application Storage: Store configuration files, logs, and other application-related data that need to be
        shared across multiple instances or VMs.
    ● Lift and Shift Applications: For legacy applications that rely on SMB or NFS protocols, Azure File Shares
        offer a cloud-native solution without requiring application changes.
    ● Persistent Storage for Containers: Use Azure File Shares with Azure Kubernetes Service (AKS) to
        provide persistent storage for containerized applications.
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
How to Create and Use Azure File Shares:
  1. Create a Storage Account:
         o In the Azure Portal, navigate to "Storage accounts" and create a new storage account.
   2. Create a File Share:
         o Inside the storage account, go to "File shares" and click "Add" to create a new file share.
            Specify the name and quota for the file share.
   3. Mount the File Share:
         o Use the connection string or script provided by Azure to mount the file share on your local
            machine or VM. This can be done using SMB or NFS protocols.
   4. Manage Files:
         o Once mounted, you can manage files just like you would with a local file system, including
            creating, deleting, copying, and moving files.
   5. Use Azure File Sync (Optional):
         o Install the Azure File Sync agent on a Windows Server and configure it to sync with your Azure
            File Share, enabling local caching and multi-site synchronization.
● Scalability: Easily scale horizontally to accommodate large volumes of data and high transaction rates.
● Flexibility: Store unstructured, semi-structured, or structured data without requiring a rigid schema.
● Performance: Optimized for fast reads and writes, making them ideal for real-time applications.
   ● Distributed and Fault-Tolerant: Often designed to run on distributed systems, ensuring high
       availability and fault tolerance.
   ● Lack of ACID Transactions: Many NoSQL databases do not provide strong consistency guarantees
       (ACID transactions), which can be a disadvantage for applications requiring strict data consistency.
   ● Complexity: NoSQL databases may require more complex data modeling and querying, especially for
       developers accustomed to SQL databases.
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
   ● Limited Support for Complex Queries: While NoSQL databases excel in performance and scalability,
       they may not offer the advanced query capabilities found in SQL databases.
   ● Big Data and Analytics: Handling large volumes of diverse data for analytics, such as in IoT applications
        or social media platforms.
   ● Content Management: Storing and retrieving unstructured data, such as documents, images, and
        videos.
   ● Real-Time Applications: Supporting high-performance, real-time applications like online gaming, chat
        applications, and real-time analytics.
   ● Graph-Based Queries: Managing data with complex relationships, such as social networks,
        recommendation systems, and fraud detection.
ETL Tools
   1.   Informatica PowerCenter
   2.   Talend
   3.   Apache NiFi
   4.   Microsoft SQL Server Integration Services (SSIS)
   5.   AWS Glue
   2. Create Datasets:
          o   SQL Dataset:
                 1. Go to Author > Datasets and click + New Dataset.
                 2. Select Azure SQL Database and configure it to point to the Customer table.
                 3. Provide a meaningful name, e.g., CustomerDataset.
          o   ADLS Dataset:
                 1. Create another dataset, this time selecting Azure Data Lake Storage Gen2.
                 2. Choose the file format (e.g., DelimitedText for CSV files).
                 3. Specify the ADLS path where the data should be stored, e.g.,
                    adls-container/customer-data/.
                 4. Name it appropriately, e.g., CustomerADLSDataset.
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
   3. Create a Pipeline:
         o In the Author section, click on + New Pipeline.
         o Name the Pipeline (e.g., CustomerToADLSPipeline).
   4. Add a Copy Activity:
         o Drag the Copy data activity from the Activities pane into the pipeline canvas.
         o Source: Configure the source to use the CustomerDataset.
         o Sink: Configure the sink to use the CustomerADLSDataset.
   5. Configure Pipeline Settings:
         o Mapping: Optionally map columns from the SQL table to the ADLS output.
         o Settings: Configure additional settings like performance tuning or logging if necessary.
   6. Validate and Debug the Pipeline:
         o Click on Validate to ensure there are no errors in the pipeline.
         o Use the Debug feature to test the pipeline and ensure that data is correctly copied from the
             SQL database to ADLS.
   7. Publish and Trigger the Pipeline:
         o Once validated, click Publish All to save your pipeline.
         o You can trigger the pipeline manually using the Trigger Now option or schedule it using a
             trigger (e.g., time-based or event-based).
Summary:
   ● ADF Pipeline: Create a pipeline that pulls data from the SQL database and stores it in Azure Data Lake
       Storage.
By following these steps, you can create a highly reusable and flexible dataset configuration in Azure
Data Factory, minimizing the need to create multiple datasets while efficiently handling different data
sources and destinations.
Example Scenario:
Suppose you want to copy multiple files from one blob storage container to another. You can use a
ForEach activity to loop over a list of file names (retrieved using a Get Metadata or Lookup activity)
and execute a Copy Data activity for each file.
● Copy Data Activity: Executes for each file, copying it to the destination container.
some common challenges associated with using the ForEach activity in ADF:
1. Performance and Scalability:
   ● Limited Parallelism: Although the ForEach activity supports parallel execution, there is a limit to the
       number of activities that can run in parallel. The default parallelism is often limited by the Batch Count
       setting, and setting this too high can overwhelm the underlying resources, leading to performance
       bottlenecks.
   ● Resource Constraints: Running multiple activities in parallel can consume significant resources (CPU,
       memory, etc.), especially when working with large datasets or complex operations. This may lead to
       slower execution times or even failures due to resource exhaustion.
   ● Extended Pipeline Duration: If the ForEach activity processes a large number of items, especially with
       sequential execution, the overall pipeline execution time can become very long. This may lead to
       higher costs, particularly if your ADF instance is based on a consumption model.
   ● Cost Implications: Running multiple activities in parallel or over an extended period can increase the
       cost of pipeline execution. Each activity within the ForEach loop consumes resources, and if the loop
       contains many activities, this can add up quickly.
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
3. Complex Debugging and Monitoring:
    ● Difficulty in Troubleshooting: When the ForEach activity runs many activities in parallel, it can be
        challenging to pinpoint which iteration or specific activity caused a failure. This complexity increases as
        the number of iterations grows.
    ● Log Management: Monitoring the execution of each iteration can generate a large volume of logs,
        making it difficult to manage and analyze these logs effectively. Identifying patterns or issues across
        many iterations requires careful log management.
4. Error Handling:
    ● Handling Failures: If one iteration of the ForEach loop fails, it can be tricky to decide how to handle the
        error. Should the entire loop stop, or should the pipeline continue with the remaining items? This
        decision depends on the business logic but can complicate error handling.
    ● Retries and Idempotency: Retrying failed activities in a ForEach loop requires careful consideration.
        Some operations may not be idempotent (i.e., they cannot be safely retried without causing side
        effects), which can lead to data inconsistencies or unintended consequences.
5. Parameter Management:
    ● Complex Parameterization: When dealing with dynamic content and passing parameters into the
        ForEach activity, the logic can become complex, especially if multiple parameters or nested loops are
        involved. Managing these parameters effectively requires careful planning and testing.
6. Data Dependencies:
    ● Sequential Dependencies: If the activities inside the ForEach loop have dependencies on each other,
        you may be forced to run the loop sequentially, which can significantly slow down the pipeline.
        Balancing the need for parallelism with dependency management is often a challenge.
    ● Data Integrity: Ensuring data integrity across iterations, especially in parallel execution scenarios, can
        be complex. Care must be taken to avoid data conflicts, especially when multiple iterations are
        modifying the same data.
    1. Data Source Connectivity: It supports a variety of data sources, such as SQL databases, Azure Blob
       Storage, Azure Data Lake, and more.
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
   2. Single Row Retrieval: By default, it retrieves a single row, which is often useful for configurations or
       control flows.
   3. Multi-row Support: It can fetch multiple rows as an array if needed, which is useful when iterating
       over a set of values using a ForEach activity.
   4. Integration with Other Activities: The output of the Lookup activity can be used as dynamic input for
       subsequent activities, such as running a stored procedure, copying data, or making decisions based on
       conditions.
● Parameter Passing: Fetching a value from a configuration table that can be passed to other activities.
    ● Conditional Logic: Using the retrieved data in a If Condition or Switch activity to control the flow of the
        pipeline.
    ● Looping through Data: When combined with a ForEach activity, it can loop through a dataset and
        perform actions for each row.
Limitation:
    ● Single Row Mode: When set to return a single row, the Lookup activity works without issues for most
        scenarios.
    ● Multi-row Mode: If the Lookup activity is configured to return multiple rows, the output size is limited
        to 4 MB or 5000 rows. If the result exceeds this limit, it will fail.
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Dynamic Date Format in Azure Data Factory (ADF)
In Azure Data Factory (ADF), you can dynamically format dates using the built-in expression
language, which allows you to manipulate date and time values flexibly. This is particularly useful
when you need to create dynamic file paths, file names, or parameter values based on the current date,
time, or other date-related values.
Use Cases:
   ● Automatically create file names or directories: Dynamically generate file names or directories based
        on the current date or time for organizing data files.
   ● Partitioning large datasets by date: This helps improve query performance and manageability by
        organizing data into date-based partitions.
   ● Move or copy data to specific locations: Dynamically route data to specific folders or locations based
        on the date.
   ● Filter data dynamically: Process specific subsets of data by filtering based on the current date or time
        range.
Example:
Let’s see how this works in practice.
Here’s an example of an expression that creates a folder path with the current year, month, and day:
Customer/@{formatDateTime(utcnow(), 'yyyy')}/@{formatDateTime(utcnow(),
'MM')}/@{formatDateTime(utcnow(), 'dd')}
This will dynamically create folder paths like Customer/2024/09/06 based on the current date.
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
What is If activity?
The If activity in Azure Data Factory (ADF) is a control flow activity that allows you to implement
conditional logic within your data pipelines. It evaluates a condition and executes one of two sets of
activities based on whether the condition evaluates to true or false. This activity is particularly useful
when you need to create dynamic workflows based on different conditions or criteria.
Key Features of the If Activity:
    1. Condition Evaluation: The If activity takes a condition in the form of an expression. It checks
         if the condition evaluates to true or false and then proceeds accordingly.
    2. Two Branches:
             o   True branch: If the condition is true, the set of activities in the "If True" branch will be
                 executed.
             o   False branch: If the condition is false, the set of activities in the "If False" branch will be
                 executed.
    3. Dynamic Expressions: You can use ADF’s expression language to create complex conditions
      that evaluate different parameters, variables, or input data.
   4. Integration with Other Activities: The If activity works in combination with other activities
      such as Lookup, Get Metadata, Copy, and more, allowing you to create highly flexible and
      dynamic data workflows.
Use Cases:
    ● Conditional Data Processing: Depending on a value or state, you can decide whether to perform
        certain transformations or data movements.
    ● Branching Logic: Dynamically determine the path your pipeline should take based on input
        parameters, status flags, or metadata.
    ● File Handling: For instance, process files only if they exist or meet certain criteria (e.g., if a file is larger
        than a certain size).
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
The Get Metadata activity in Azure Data Factory (ADF) is a control activity that allows you to retrieve
metadata information from a variety of data sources. This metadata can include properties like file size, last
modified date, column names, and data types, among others. It's useful for making decisions based on the
characteristics of your data before further processing it in your data pipeline.
          4. Use the output of the activity as an input to another activity, like a Switch activity in
             this example. You can reference the output of the Metadata Activity anywhere
             dynamic content is supported in the other activity.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
        5. In the dynamic content editor, select the Get Metadata activity output to reference
           it in the other activity.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
The Execute Pipeline activity in Azure Data Factory (ADF) is used to trigger or run another pipeline from within
a pipeline. This activity helps in organizing complex workflows by allowing you to break them down into
smaller, reusable pipelines, promoting modularity and better management.
Use Cases:
    ● Breaking Down Complex Workflows: When you have a large, complex pipeline, you can split it into
        smaller pipelines and use the Execute Pipeline activity to call them in sequence or based on conditions.
    ● Reusability: If you have logic that is commonly used across multiple pipelines, you can create a
        reusable pipeline and trigger it from multiple parent pipelines.
    ● Parameterization: Triggering the same pipeline multiple times with different parameters (e.g., for
        different datasets or environments).
Type properties
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Remember Nested if and Nested Foreach is not allowed in ADF at the moment hence execute pipeline can
be used as the work around.
You can now parameterize a linked service and pass dynamic values at run time. For example, if you
want to connect to different databases on the same logical SQL server, you can now parameterize the
database name in the linked service definition. This prevents you from having to create a linked service
for each database on the logical SQL server. You can parameterize other properties in the linked service
definition as well - for example, User name.
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
    ● Improved Performance: Processing only the new or updated data significantly reduces the amount of
       data to be handled, resulting in faster execution times.
    ● Cost-Effective: By reducing the amount of data processed, incremental pipelines help save on compute
       and storage costs.
    ● Efficient Data Management: Incremental processing makes it easier to manage and process large
       datasets without overloading the system.
1. Select the watermark column. Select one column in the source data store, which can be
   used to slice the new or updated records for every run. Normally, the data in this selected
   column (for example, last_modify_time or ID) keeps increasing when rows are created or
   updated. The maximum value in this column is used as a watermark.
2. Prepare a data store to store the watermark value. In this tutorial, you store the watermark
   value in a SQL database.
3. Create a pipeline with the following workflow:
o   Create two Lookup activities. Use the first Lookup activity to retrieve the last watermark
    value. Use the second Lookup activity to retrieve the new watermark value. These
    watermark values are passed to the Copy activity.
    Mission100 Azure Data Engineer Course By Deepak Goyal
    https://adeus.azurelib.com
    Email at: admin@azurelib.com
    Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
o     Create a Copy activity that copies rows from the source data store with the value of the
      watermark column greater than the old watermark value and less than the new watermark
      value. Then, it copies the delta data from the source data store to Blob storage as a new file.
o     Create a StoredProcedure/Lookup activity that updates the watermark value for the pipeline
      that runs next time.
What is an API?
API stands for Application Programming Interface. It is a mechanism that enables two software
components to communicate with each other using a set of definitions and protocols. APIs are a way to
extract and share data within and across organizations.
What are REST APIs?
REST (Representational State Transfer) defines a set of functions like GET, POST, PUT, and DELETE that
clients can use to access server data. REST APIs use the HTTP protocol for communication between
clients and servers.
One of the main features of REST APIs is statelessness, meaning the server does not store any
information about client requests between requests. Each request from the client contains all the
necessary information to process that request, similar to how you visit a URL in a browser and receive
a response (typically in plain data like JSON, not a graphical web page).
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Key Concepts of REST APIs:
   ● Resources/Endpoint: In REST, data and functionalities are treated as resources, each identified by a
       unique URL. For example, accessing user information might use the endpoint
       https://restcountries/#endpoints-all, which provides data in JSON format.
   ● HTTP Methods:
           o   GET: Retrieve information about a resource.
           o   POST: Create a new resource.
           o   PUT: Update an existing resource.
           o   DELETE: Remove a resource.
   ● Stateless: Each request contains all the information the server needs to process it. The server doesn't
       maintain any session or context between requests.
   ● Representation: Resources can be represented in various formats such as JSON (most common), XML,
       HTML, or plain text.
   ● Uniform Interface: REST APIs adhere to a uniform interface, simplifying the architecture and making it
       more scalable. This includes using standard methods, resource URIs, and response codes.
   ● Client-Server Architecture: REST separates the client (which requests resources) from the server
       (which provides resources). This decoupling enhances flexibility and scalability.
owing these principles, REST APIs allow for the creation of scalable, maintainable, and flexible web
services that can be consumed by various clients, including web browsers, mobile apps, and other
servers.
What are Microservices?
Microservices are an architectural and organizational approach to software development where
software is composed of small, independent services that communicate over well-defined APIs. These
services are managed by small, autonomous teams.
Microservices architectures make applications easier to scale and faster to develop, allowing for
quicker innovation and faster delivery of new features.
How do REST APIs and Microservices Relate?
In a microservices architecture, REST APIs are commonly used as the communication mechanism
between different services. Each microservice exposes its functionality as a RESTful API, allowing
other microservices or external systems to interact with them.
        3. Configure the service details, test the connection, and create the new linked
           service.
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
In simple terms, a Logic App in Azure is like a digital "flowchart" that helps automate tasks and
processes without needing to write code. It connects different apps, services, and systems, letting them
work together automatically.
It’s a tool for building workflows that take care of repetitive tasks, so you don’t have to do them
manually. You set up steps (called triggers and actions) using a simple interface, and the Logic App
takes care of making everything work together.
In short, it’s a way to automate tasks across apps and services, saving time and effort!
The Web Activity in Azure Data Factory (ADF) is used to make HTTP requests to a web service or
API endpoint within a data pipeline. It allows your ADF pipeline to interact with external systems by
calling REST APIs or web services, retrieve data, trigger processes, or send information to external
applications.
Key Features of Web Activity:
   1. HTTP Methods: Supports HTTP methods such as GET, POST, PUT, and DELETE, which allow you to
      perform different actions depending on the API you're calling.
   2. Headers and Body: You can customize the request by adding headers (e.g., for authorization) and body
      content (e.g., sending data in JSON format).
   3. Response Handling: The Web activity can capture responses from the API, and the output can be used
      in downstream activities within the pipeline.
   ● Calling APIs: You can use the Web activity to call external REST APIs, such as to retrieve data from an
       external system or send data to an application.
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
   ● Triggering External Processes: For example, triggering a web service to kick off a workflow in another
       system after certain pipeline actions are complete.
   ● Interacting with Cloud Services: Communicating with other Azure services or third-party cloud services
       via their API.
Example:
   ● If you want to notify an external system after data processing in ADF, you can use the Web activity to
       make a POST request to that system's API and pass the necessary information.
To create and manage alerts, select Alerts under Monitoring in the left navigation of your Data
Factory page in the Azure portal.
You can set the Email/SMS notification at the entire account level using this.
A Data Flow in Azure Data Factory (ADF) is a powerful feature that allows you to perform
transformations on data at scale, without writing code. It provides a visual interface where you can
define complex data transformation logic, and ADF takes care of executing it in a scalable and efficient
manner using Azure's underlying infrastructure.
Key Components of Data Flow:
    1. Source: Defines the input data for your data flow. This could be from various data stores such
        as Azure Blob Storage, Azure Data Lake, SQL databases, etc.
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
   2. Transformations: The core of the data flow, where you define various transformation rules.
       ADF offers several types of transformations:
           o   Filter: Filter rows based on a condition.
           o   Aggregate: Perform aggregations like sum, count, average, etc.
           o   Join: Join two datasets based on matching conditions.
           o   Sort: Sort data by specified columns.
           o   Derived Column: Create new columns or modify existing ones using expressions.
           o   Lookup: Perform lookups from external sources.
           o   Conditional Split: Split data into different streams based on conditions.
           o   Union: Combine multiple datasets into a single output.
   3. Sink: Defines where the transformed data will be written to. You can output data to a variety of
       destinations such as Azure SQL Database, Data Lake, Blob Storage, and more.
   4. Data Flow Debug: ADF provides a debug mode where you can preview and test your
      transformations using sample data, helping you refine your logic before scaling it.
Key Features of Data Flow in ADF:
   ● No-Code Environment: You can design complex data transformations without needing to write any
       code. Everything is done through a drag-and-drop interface.
   ● Scalability: ADF Data Flows are executed using Azure's Spark-based infrastructure, which means
       transformations can scale to handle large datasets.
   ● Data Flow Parameters: You can define parameters to make your data flows dynamic, allowing you to
       reuse the same data flow for different datasets or transformation rules.
   ● Mapping and Wrangling Data Flows:
           o   Mapping Data Flows: These are used for data transformation processes where you map data
               from source to destination.
           o   Wrangling Data Flows: These are designed for data preparation and are more focused on
               interactive data wrangling using Power Query.
   1. Design the Data Flow: Use the visual interface to add a source, apply transformations, and configure
      the sink.
   2. Integrate into a Pipeline: Once the data flow is designed, you can integrate it into an ADF pipeline. The
      pipeline can schedule the data flow to run at specific times or trigger it based on events.
   3. Execution: When the pipeline triggers the data flow, ADF converts the flow logic into a Spark execution
      plan and runs it on Azure Databricks or other Azure compute services.
   4. Monitor and Optimize: Use the Monitor section in ADF to track the performance of your data flows,
      troubleshoot issues, and optimize performance by scaling up compute resources if necessary.
   ● ETL (Extract, Transform, Load): Transform raw data into a format that is ready for analysis or storage
       in a data warehouse.
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
   ● Data Cleansing: Remove or correct errors in the data before loading it into a target system.
● Data Aggregation: Summarize data (e.g., sales data by region) before reporting or analysis.
● Data Enrichment: Combine data from multiple sources, adding additional information to datasets.
Example:
   ● Ease of Use: You don’t need to be a developer or data engineer to define transformations; everything
       is visual.
   ● Cost-Effective: You can choose the compute resources you need, only paying for what you use when
       the data flow runs.
   ● Flexible: Support for a wide range of data sources and destinations, along with numerous
       transformations, makes it versatile for many data scenarios.
SCD Types:
OLTP vs OLAP
Mission100 Azure Data Engineer Course By Deepak Goyal
https://adeus.azurelib.com
Email at: admin@azurelib.com
Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
   1. Security: Secrets are never exposed in pipeline code, reducing the risk of data breaches.
   2. Simplified Management: Centralized secret management means that if credentials change, you only
      need to update them in Key Vault.
   3. Compliance: Using Key Vault helps you comply with security and privacy standards by protecting
      sensitive data.
   4. Automatic Access: You can use ADF’s Managed Identity to control access to Key Vault, providing a
      secure, seamless experience without manually handling credentials.
In summary, integrating Azure Key Vault with ADF provides a secure and efficient way to manage
sensitive information in your data pipelines, improving both security and ease of management.
Integration Runtime
Integration Runtime (IR) in Azure Data Factory (ADF) is the compute infrastructure that ADF uses to provide
data integration across different network environments. It is responsible for the movement of data and the
  Mission100 Azure Data Engineer Course By Deepak Goyal
  https://adeus.azurelib.com
 Email at: admin@azurelib.com
  Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
execution of data transformation activities. There are different types of Integration Runtimes to handle
different scenarios, such as connecting to cloud, on-premises, or hybrid data sources.
 Azure Integration Runtime: Best for cloud-native data processing and movement between Azure
resources.
 Self-hosted Integration Runtime: Ideal for hybrid or on-premises scenarios where you need to
connect ADF with on-premises data sources.
 Azure-SSIS Integration Runtime: Used for running SSIS packages in the cloud, typically for
customers migrating their ETL processes from on-premises to Azure.
Pipeline Parameters:
Pipeline Parameters are values that you define at the pipeline level, which can be passed in when you
trigger or invoke the pipeline. Parameters allow you to make your pipeline dynamic, enabling you to
reuse the same pipeline for different data sources or configurations by passing in different parameter
values.
Key Features of Pipeline Parameters:
    ● Definition: Parameters are defined at the pipeline level and can be passed to activities within the
        pipeline.
    ● Static during Execution: Once a pipeline starts executing, the parameter values remain constant
        throughout the execution.
    ● Scope: They are scoped to the pipeline and cannot be changed dynamically during pipeline execution.
● Usage: Parameters are typically used for things like file paths, table names, or filtering conditions.
You can then reference this parameter in activities within the pipeline using:
@pipeline().parameters.filePath
Use Cases:
2. Pipeline Variables:
Variables in ADF are used to store temporary values that can be changed during the execution of the
pipeline. They allow dynamic behavior within the pipeline, as their values can be updated or modified
as the pipeline progresses.
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Key Features of Pipeline Variables:
   ● Definition: Variables are defined within the pipeline and can be assigned values during execution using
       activities like the Set Variable or Append Variable activities.
   ● Dynamic during Execution: Unlike parameters, variables can change during pipeline execution.
   ● Scope: Variables are scoped to the pipeline, and their values can only be used and changed within the
       pipeline.
   ● Data Types: Variables can be of types such as String, Array, or Boolean.
Example:
You can update its value during execution using the Set Variable activity or within expressions:
@variables('counter')
Use Cases:
3. Global Parameters:
Global Parameters are parameters defined at the Data Factory level, making them available to all
pipelines across the Data Factory. They provide a convenient way to define common values that need
to be reused across multiple pipelines.
Key Features of Global Parameters:
● Scope: Available globally within the Data Factory and can be accessed from any pipeline.
   ● Static during Execution: Once a pipeline using a global parameter is triggered, the global parameter’s
       value remains constant.
   ● Usage: You can reference global parameters in pipelines or activities the same way you reference
       pipeline parameters.
Example:
If you have an environment-specific value such as a storage account name that is common across
pipelines, you can define it as a global parameter:
In any pipeline, you can reference this global parameter using:
@globalParameters('storageAccountName')
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Use Cases:
   ● Environment-specific values like connection strings, storage account names, or file paths used across
       multiple pipelines.
   ● Centralized management of values that are used across different pipelines (e.g., URLs, database
       names).
   ● Social Media: Platforms like Facebook, Twitter, and Instagram generate vast amounts of data every
       second. Every post, like, comment, and share contributes to a massive collection of data.
   ● Online Shopping: Websites like Amazon track millions of transactions, customer preferences, and
       product searches every day.
   ● Smart Devices: Smartphones, smartwatches, and IoT devices collect and transmit data continuously,
       from tracking steps to monitoring home security.
   1. Storage: Hadoop divides the big library into many smaller sections (called "blocks") and stores them
      across multiple shelves (computers) in a way that it's easy to find the information you need.
   2. Processing: When you want to find information (like searching for a book), Hadoop sends many
      librarians (workers) to different shelves to fetch the data at the same time. This makes the search
      process much faster.
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Everyday Analogy:
   ● Hadoop as a Pizza Delivery Service: Imagine you want to order 100 pizzas for a party. If you rely on
       just one delivery person, it will take a long time to deliver all the pizzas. But what if you have 100
       delivery people, each delivering one pizza to the party at the same time? The delivery will be much
       faster! Hadoop works similarly by using many computers (delivery people) to process and deliver parts
       of the data quickly.
   ● Handles Huge Amounts of Data: Hadoop can store and process large amounts of data efficiently, much
       more than a regular computer.
   ● Cost-Effective: It uses a network of simple, inexpensive computers to handle data instead of relying on
       a single, super-expensive machine.
   ● Fault Tolerance: If one computer (librarian) fails, Hadoop can still find the data using other computers,
       ensuring the system doesn't crash.
   ● What is HDFS?
           o   Simple Explanation: HDFS is like a giant digital storage system designed to store huge amounts
               of data across multiple computers.
           o   How It Works:
                   ▪   Breaks large files into smaller pieces.
                   ▪   Stores these pieces across different computers to manage storage efficiently.
                   ▪   Keeps multiple copies of data to ensure safety and reliability.
   ● Why Use HDFS?
           o   Scalability: Can handle massive datasets by distributing them across many computers.
           o   Reliability: If one computer fails, HDFS retrieves the data from another copy.
3. MapReduce
   ● What is MapReduce?
           o   Simple Explanation: MapReduce is a method to process big data using many helpers
               (computers). It splits a large task into smaller tasks, processes them in parallel, and then
               combines the results.
           o   How It Works:
                   ▪   Map Phase: Divides the big task into smaller parts and processes each part on a
                       separate computer.
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
                   ▪ Reduce Phase: Collects and combines the results from all computers to produce the
                       final output.
   ● Why Use MapReduce?
           o   Parallel Processing: Processes large datasets quickly by working on multiple parts at the same
               time.
           o   Efficiency: Handles complex tasks more efficiently by dividing the work.
   ● Blocks are the fundamental units of data storage in HDFS (Hadoop Distributed File System). When you
       store a large file in HDFS, it splits the file into smaller, fixed-size chunks called blocks.
   ● Why Use Blocks?:
           o   Manageability: Breaking a large file into smaller blocks makes it easier to manage, store, and
               process the file across multiple computers.
           o   Parallel Processing: By dividing a file into blocks, HDFS can distribute these blocks across
               different computers (nodes) in the cluster, allowing for parallel processing of the data.
   ● Block Size:
           o   The default block size in HDFS is typically 128 MB or 256 MB. This size can be configured
               depending on your needs.
   ● Example:
           o   If you have a 600 MB file and the block size is set to 128 MB, HDFS will split this file into 5
               blocks:
                    ▪   4 blocks of 128 MB each
                    ▪   1 block of the remaining 88 MB
  Mission100 Azure Data Engineer Course By Deepak Goyal
  https://adeus.azurelib.com
 Email at: admin@azurelib.com
  Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
2. What is Replication Factor in HDFS?
   ● Replication Factor determines how many copies of each block HDFS will create and store across the
       cluster.
   ● Why Replication Matters:
           o      Data Reliability: If one computer (node) holding a block fails, other copies of the block are still
                  available on other nodes. This ensures that the data is not lost and can be accessed even if a
                  part of the system fails.
           o      High Availability: By having multiple copies of the same block, HDFS can provide high
                  availability, making sure data is accessible whenever needed.
   ● Default Replication Factor:
           o      The default replication factor in HDFS is 3. This means HDFS creates 3 copies of each block and
                  distributes them across different nodes.
   ● Example:
           o      Continuing with the earlier example of a 600 MB file split into 5 blocks:
                      ▪   If the replication factor is 3, HDFS will create 3 copies of each of the 5 blocks.
                      ▪   In total, HDFS will store 15 blocks (5 blocks × 3 copies) across the cluster.
   ● How it Works:
           o      When a file is saved in HDFS, it is split into blocks. Each block is then copied and stored on
                  different nodes (computers) in the cluster. The system ensures that these copies are placed on
                  different nodes to avoid data loss in case one node fails.
   ● Description: Structured data is organized in a defined format, often in tables with rows and columns. It
       follows a specific schema (structure) that makes it easy to search, query, and analyze using traditional
       tools like SQL databases.
   ● Examples:
           o      Relational Databases: Tables in databases like SQL Server, MySQL, or Oracle.
           o      Spreadsheets: Excel files with data arranged in rows and columns.
   ● Characteristics:
           o      Follows a fixed format (e.g., numbers, strings).
           o      Easy to query using SQL.
           o      Examples include customer records, transaction data, and sales data.
  Mission100 Azure Data Engineer Course By Deepak Goyal
  https://adeus.azurelib.com
 Email at: admin@azurelib.com
  Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
2. Unstructured Data
   ● Description: Unstructured data does not have a predefined structure or schema. It can be text-heavy
       or contain various types of information that don’t fit neatly into a table.
   ● Examples:
           o   Text Files: Logs, emails, Word documents.
           o   Multimedia Files: Images, videos, audio files.
           o   Social Media: Posts, tweets, comments.
   ● Characteristics:
           o   Does not follow a fixed format.
           o   More complex to search, analyze, and process.
           o   Requires specialized tools and techniques like Natural Language Processing (NLP) or machine
               learning for analysis.
3. Semi-Structured Data
   ● Description: Semi-structured data has some organizational properties but does not adhere strictly to a
       fixed schema like structured data. It is more flexible and can contain varying types of information.
   ● Examples:
           o   JSON: JavaScript Object Notation used in APIs and NoSQL databases.
           o   XML: Extensible Markup Language used for data exchange between systems.
           o   CSV Files: Comma-separated values that contain structured data but can have variability in
               format.
   ● Characteristics:
           o   Contains markers or tags (like in XML or JSON) to identify different elements.
           o   More flexible than structured data but still provides some level of organization.
           o   Often used for data exchange between different systems.
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
   ● Large Data Volumes: Hadoop deals with vast amounts of data (big data). Transferring this data across
       the network to the processing nodes can be time-consuming and resource-intensive.
   ● Network Bottlenecks: Moving large datasets over the network can cause congestion and slow down
       processing.
   ● Efficiency: By processing data on the same node where it is stored, Hadoop minimizes data transfer,
       resulting in faster data processing and reduced network overhead.
   1. Data Distribution: In Hadoop, data is divided into smaller pieces called blocks (usually 128 MB or 256
      MB) and distributed across different nodes (computers) in the cluster.
   2. Processing: When a job is submitted to process the data (like a MapReduce job), Hadoop tries to
      schedule the task on the node where the data block is stored. This approach ensures that data
      processing happens locally on the node.
   3. Types of Data Locality:
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
            o Data-Local: The task is executed on the same node where the data block is stored. This is the
               ideal scenario for maximum efficiency.
            o Rack-Local: If a data-local node is not available, the task is executed on a node within the same
               rack (group of nodes). It involves minimal network transfer within the rack.
            o Off-Rack (Remote): If both data-local and rack-local nodes are unavailable, the task is executed
               on a different rack. This involves more network transfer and is the least efficient option.
Example Scenario
Imagine a Hadoop cluster with 10 nodes, and a large file is stored across these nodes in 100 blocks.
When you run a data processing job on this file:
   ● Data Locality: Hadoop will try to run the processing tasks on the nodes where the data blocks are
       stored. If Block 1 is on Node A, Hadoop will schedule the task for Block 1 to run on Node A.
   ● Efficiency: By doing this, the job doesn't need to move the data block across the network to another
       node for processing, which speeds up the job.
   ● Faster Data Processing: By processing data where it resides, Hadoop reduces the time taken for data
       transfer, leading to quicker job completion.
   ● Reduced Network Load: Minimizing data movement over the network decreases the risk of network
       congestion and reduces the strain on network resources.
   ● Scalability: Data locality allows Hadoop to scale efficiently, even as the volume of data grows.
 Speed: Spark is significantly faster than Hadoop, especially for iterative and real-time processing
tasks.
 Ease of Use: Spark's high-level APIs and built-in libraries make it easier and quicker to write
complex data processing applications.
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
 Versatility: Spark's ability to handle batch processing, streaming, machine learning, and graph
processing makes it a one-stop solution for various data processing needs.
 Real-Time Processing: Spark's support for real-time analytics makes it suitable for use cases that
require immediate insights.
 Efficiency: Spark's in-memory processing model reduces the need for disk I/O, making it more
efficient for large-scale data processing.
   ● mode: Determines the behavior if the output file already exists. Options include:
             o   overwrite: Overwrites existing files.
             o   append: Appends to the existing file.
             o   ignore: Ignores the write operation if the file exists.
             o   error``or errorifexists`: Throws an error if the file exists.
   ● option: Additional options for the specific file format.
● Syntax:
       df.write.format("csv").option("header",
       "true").mode("overwrite").save("/path/to/output/csv")
   ● Explanation:
          o    format("csv"): Specifies CSV as the file format.
          o    option("header", "true"): Includes the header in the CSV file.
          o    mode("overwrite"): Overwrites existing files at the destination.
          o    save("/path/to/output/csv"): Specifies the output path.
● Syntax:
df.write.format("json").mode("overwrite").save("/path/to/output/json")
   ● Explanation:
          o    format("json"): Specifies JSON as the file format.
          o    JSON files do not have an option for headers as each line is a valid JSON object.
          o    save("/path/to/output/json"): Writes the DataFrame to the specified path in JSON
               format.
● Syntax:
df.write.format("parquet").mode("overwrite").save("/path/to/output/parquet")
   ● Explanation:
          o    format("parquet"): Specifies Parquet as the file format, which is a columnar storage
               format.
          o    save("/path/to/output/parquet"): Writes the DataFrame in Parquet format, preserving
               the schema and data types.
● Syntax:
       df.write.format("orc").mode("overwrite").save("/path/to/output/orc")
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
   ● Explanation:
            o format("orc"): Specifies ORC as the file format, which is efficient for read-heavy operations.
            o save("/path/to/output/orc"): Writes the DataFrame in ORC format.
df.write.format("avro").mode("overwrite").save("/path/to/output/avro")
   ● Explanation:
           o   format("avro"): Specifies Avro as the file format, which is a compact and fast binary
               format.
           o   save("/path/to/output/avro"): Writes the DataFrame to Avro format.
● Syntax:
df.write.mode("overwrite").saveAsTable("database_name.table_name")
   ● Explanation:
           o   saveAsTable("database_name.table_name"): Writes the DataFrame to a Hive table.
           o   Hive must be enabled and configured in the Spark session for this to work.
● Syntax:
       df.coalesce(1).write.format("csv").option("header",
       "true").save("/path/to/output/single_csv_file")
   ● Explanation:
           o   coalesce(1): Reduces the DataFrame to a single partition, resulting in a single output file.
8. Specifying Compression
   ● Syntax:
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
       df.write.format("parquet").option("compression",
       "snappy").save("/path/to/output/compressed_parquet")
   ● Explanation:
           o   option("compression", "snappy"): Specifies the compression type (e.g., "snappy",
               "gzip") for the output file.
● Syntax:
       df.write.partitionBy("year",
       "month").format("parquet").save("/path/to/output/partitioned_parquet")
   ● Explanation:
           o   partitionBy("year", "month"): Partitions the data by "year" and "month" columns,
               creating subdirectories for each unique combination of these column values.
  Mission100 Azure Data Engineer Course By Deepak Goyal
  https://adeus.azurelib.com
 Email at: admin@azurelib.com
  Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
1. Spark SQL
   ● What is a Metastore?
       The Metastore in Spark SQL and Hive is a central repository that stores metadata information about
       the data. This metadata includes information about tables, databases, columns, data types, and the
       location of data files. In the context of Spark, the Metastore can be either the default embedded
       metastore (using Derby) or an external metastore like Apache Hive.
   ● Key Features:
           o   Metadata Management: Stores information about table schemas, partitions, and storage
               formats.
           o   Interoperability: Allows different Spark sessions and other tools like Hive to access and
               manage data consistently.
   ● Types:
           o   Embedded Metastore: Usually for single-user or development use cases (default in Spark using
               Derby).
           o   External Metastore (Hive Metastore): A more scalable solution using external databases like
               MySQL, Postgres, etc., suitable for production environments.
   ● Example: When you create a table in Spark SQL using CREATE TABLE, the metastore records the
       table's schema and storage information.
3. Hive Warehouse
            o   The table's data will be stored in a subdirectory within the Hive Warehouse directory
                (/user/hive/warehouse/employee).
   1. Spark SQL allows you to run SQL queries on structured data stored in various formats and locations.
      When you create or query tables, Spark interacts with the Metastore to understand the structure and
      location of the data.
   2. The Metastore acts as a catalog, storing metadata about tables, databases, and columns. When you
      query a table, Spark uses the metastore to locate the data.
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
   3. The Hive Warehouse is the physical storage location where table data is stored. When you create a
       managed table in Spark SQL with Hive support, the data is typically stored in the Hive Warehouse
       directory.
   4. The Default Location (/user/hive/warehouse) is where the managed table data is stored by default
       in HDFS. You can change this location by setting the appropriate configuration in Spark.
Example Workflow
    ● Fault Tolerance: RDDs are inherently fault-tolerant. If a part of the RDD is lost due to a node failure,
        Spark can automatically recompute the lost partitions using the lineage information. This lineage is
        essentially a logical execution plan that tracks the sequence of transformations applied to the dataset.
    ● Recomputing Data: RDDs can be recomputed efficiently using lineage, avoiding the need to save
        intermediate data to disk after every transformation.
2. Distributed
    ● Partitioning: RDDs are distributed across multiple nodes in a cluster. They are divided into partitions,
        which can be processed in parallel. Each partition represents a portion of the overall dataset.
    ● Parallel Processing: The distributed nature of RDDs allows Spark to perform parallel processing,
        improving the speed of data processing on large datasets.
3. Dataset
    ● Immutable Collection: An RDD is an immutable collection of objects, meaning once an RDD is created,
        it cannot be changed. However, you can create new RDDs by applying transformations to existing ones.
    ● Typed: RDDs are typed in languages like Scala and Java. For example, an RDD of integers will have a
        type RDD[Int] in Scala.
Characteristics of RDDs
    1. Immutable: Once created, RDDs cannot be altered. Any operation on an RDD returns a new RDD.
    2. Lazily Evaluated: Transformations on RDDs are lazily evaluated. This means Spark will not execute the
       transformations until an action (like collect or count) is called. This allows Spark to optimize the
       execution plan for better performance.
    3. Fault Tolerant: Spark keeps track of the lineage of each RDD, meaning it knows how to recompute
       RDDs from their original data if a partition is lost.
    4. In-Memory Processing: RDDs are primarily designed for in-memory processing, which can significantly
       speed up data processing tasks compared to disk-based processing.
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Advantages of RDDs
● Resilience: RDDs are fault-tolerant with the ability to recover lost data using lineage.
● Parallel Processing: RDDs enable distributed and parallel processing of large datasets.
   ● In-Memory Computing: RDDs can be cached in memory, reducing the time for iterative and interactive
       operations.
Disadvantages of RDDs
   ● Low-Level API: RDDs provide a low-level API for distributed data processing, which can be complex and
       verbose compared to higher-level abstractions like DataFrames and Datasets.
   ● Lack of Optimization: Unlike DataFrames, RDDs do not benefit from Spark's Catalyst optimizer, making
       them less efficient for many operations.
    ● A Managed Table in Spark SQL is a table where Spark manages both the metadata and the data. When
        you create a managed table, Spark stores the data in a default location (usually in the warehouse
        directory specified by spark.sql.warehouse.dir), and Spark is responsible for managing the
        lifecycle of the table data.
Key Characteristics:
    ● Data Location: Data for managed tables is stored in the default warehouse directory (e.g.,
        /user/hive/warehouse on HDFS) unless a different location is specified during table creation.
    ● Data Lifecycle: Spark takes full responsibility for the data. If you drop a managed table, Spark deletes
        both the table's metadata and the underlying data files.
    ● Automatic Data Management: When you insert, update, or delete data in a managed table, Spark
        handles the data storage automatically.
● Full Control: When you want Spark to handle the data management, including storage and deletion.
    ● Temporary Data: When the data is transient or temporary, and you want it to be cleaned up
        automatically when no longer needed.
Behavior on Drop:
    ● When you execute DROP TABLE employee, Spark deletes both the table schema and the underlying
        data files from the storage.
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
2. External Tables
Definition:
    ● An External Table in Spark SQL is a table where Spark manages only the metadata, not the actual data.
        The data for an external table resides outside of Spark's control in a location specified by the user (e.g.,
        an HDFS directory, Amazon S3, Azure Blob Storage).
Key Characteristics:
    ● Data Location: The data location is specified by the user using the LOCATION clause when creating the
        table. Spark only stores metadata about this table, like its schema and location.
    ● Data Lifecycle: Spark does not manage the lifecycle of the data. Dropping an external table only
        removes the table's metadata from the metastore; the underlying data files remain intact.
    ● User-Controlled Data Management: Users must manage the data's storage and lifecycle
        independently.
-- Inserting data into the external table (assuming the data is managed
externally)
INSERT INTO external_employee VALUES (3, 'Charlie', 35);
When to Use External Tables:
    ● Data Reusability: When the data is shared between different applications or systems and needs to be
        reused outside Spark.
    ● Existing Data: When you have existing data files that you want to query using Spark without Spark
        taking ownership of these files.
    ● Data Persistence: When you want the data to persist beyond the lifecycle of the Spark job or
        application.
Behavior on Drop:
    ● When you execute DROP TABLE external_employee, Spark deletes the table's metadata from the
        metastore, but the data files at /path/to/external/data/employee remain untouched.
 Mission100 Azure Data Engineer Course By Deepak Goyal
 https://adeus.azurelib.com
 Email at: admin@azurelib.com
 Ask Queries here: https://www.linkedin.com/in/deepak-goyal-93805a17/
Summary Table
      Aspect                          Managed Table                                 External Table
                    Default warehouse directory
                                                                            User-specified location (e.g.,
 Data Location      (/user/hive/warehouse) or specified during
                                                                            HDFS, S3, Blob Storage).
                    creation.
 Data               Managed by Spark. Spark takes care of data storage,     Managed by the user. Spark
 Management         deletion, etc.                                          only manages metadata.
                                                                            Data is not deleted when the
 Data Deletion      Data is deleted when the table is dropped.
                                                                            table is dropped.
                                                                            Shared or existing data
 Use Cases          Temporary data, data fully managed by Spark.
                                                                            managed outside Spark.
 Lifecycle                                                                  Users must manage data
                    Spark handles the entire lifecycle.
 Management                                                                 lifecycle externally.
Working with Managed and External Tables in Spark
1. Creating a Managed Table
s
CREATE TABLE managed_employee (
   id INT,
   name STRING,
   age INT
);
● External tables require the LOCATION clause to specify the data's location.
● Data remains at the specified location and is not moved or deleted by Spark.