CCD 4,5,6
CCD 4,5,6
The data pipeline typically includes a series of steps. This is for extracting data from a source,
transforming and cleaning it, and loading it into a destination system, such as a database or a data
warehouse.
Data pipelines can be used for a variety of purposes, including data integration, data
warehousing, automating data migration, and analytics.
characteristics, such as data formats, data structures, data schemas and data definitions --
information that’s needed to plan and build a pipeline.
Once it’s in place, the data pipeline typically involves the following steps:
1) Many data pipelines are built by data engineers or big data engineers.
2) To create effective pipelines, its critical that they develop their soft skills -- meaning their
interpersonal and communication skills.
3) This will help them collaborate with data scientists, other analysts and business
stakeholders to identify user requirements and the data that needed to meet them before
launching a data pipeline development project.
4) Such skills are also necessary for ongoing conversations to prioritize new development plans
and manage existing data pipelines.
1) Manage the development of a data pipeline as a project, with defined goals and delivery dates.
2) Document data lineage information so the history, technical attributes and business meaning
of data can be understood.
3) Ensure that the proper context of data is maintained as its transformed in a pipeline.
4) Create reusable processes or templates for data pipeline steps to streamline development.
5) Avoid scope creep that can complicate pipeline projects and create unrealistic expectations
among users.
Data ingestion: Raw data from one or more source systems is ingested into the data pipeline. Depending
on the data set, data ingestion can be done in batch or real-time mode.
Data integration: If multiple data sets are being pulled into the pipeline for use in analytics or
operational applications, they need to be combined through data integration processes.
Data Cleansing: For most applications, data quality management measures are applied to the raw data
in the pipeline to ensure that its clean, accurate and consistent.
Data filtering: Data sets are commonly filtered to remove data that isn’t needed for the particular
applications the pipeline was built to support.
Data transformation: The data is modified as needed for the planned applications.
Examples of data transformation method include aggregation, generalization, reduction and smoothing.
Data enrichment: In some cases, data sets are augmented and enriched as part of the pipeline through
the addition of more data elements required for applications.
Data validation: The finalized data is checked to confirm that it is valid and fully meets the application
requirements.
Data loading: For BI and analytics applications, the data is loaded into a data store so it can be accessed
by users. Typically, That’s a data warehouse, a data lake or a data lakehouse, which combines elements
of the other two platforms.
Many data pipelines also apply machine learning and neural network algorithms to create more
advanced data transformations and enrichments. This includes segmentation, regression analysis,
clustering and the creation of advanced indices and propensity scores.
In addition, logic and algorithms can be built into a data pipeline to add intelligence.
As machine learning -- and, especially, automated machine learning (AutoML) -- processes become
more prevalent, data pipelines likely will become increasingly intelligent. With these processes,
intelligent data pipelines could continuously learn and adapt based on the characteristics of source
systems, required data transformations and enrichments, and evolving business and application
requirements.
Characteristics of data pipeline architecture and use cases. Some of the most
common types include:
Batch Processing: Data is processed in batches at set intervals, such as daily or weekly.
Lambda Architecture: A combination of batch and real-time processing, where data is first processed
in batch and then updated in real-time.
Kappa Architecture: Similar to Lambda architecture, data is only processed once, and all data is
ingested in real time.
ETL (Extract, Transform, Load) Architecture: Data is extracted from various sources, transformed
to fit the target system, and loaded into the target system.
A Data pipeline architecture is essential for several reasons:
Scalability: Data pipeline architecture should allow for the efficient processing of large amounts of data,
enabling organizations to scale their data processing capabilities as their data volume increases.
Reliability: A well-designed data pipeline architecture ensures that data is processed accurately and
reliably. This reduces the risk of errors and inaccuracies in the data.
Efficiency: Data pipeline architecture streamlines the data processing workflow, making it more
efficient and reducing the time and resources required to process data.
Flexibility: It allows for the integration of different data sources and the ability to adapt to changing
business requirements.
Security: Data pipeline architecture enables organizations to implement security measures, such as
encryption and access controls, to protect sensitive data.
Data Governance: Data pipeline architecture allows organizations to implement data governance
practices such as data lineage, data quality, and data cataloguing that help maintain data accuracy,
completeness, and reliability.
Data pipelines can be compared to the plumbing system in the real world. Both are crucial channels that
meet basic needs, whether it’s moving data or water. Both systems can malfunction and require
maintenance.
In many companies, a team of data engineers will design and maintain data pipelines.
Data pipelines should be automated as much as possible to reduce the need for manual supervision.
However, even with data automation, businesses may still face challenges with their data pipelines:
1. Complexity: In large companies, there could be a large number of data pipelines in operation.
Managing and understanding all these pipelines at scale can be difficult, such as identifying which
pipelines are currently in use, how current they are, and what dashboards or reports rely on them.
In an environment with multiple data pipelines, tasks such as complying with regulations and
migrating to the cloud can become more complicated.
2. Cost: Building data pipelines at a large scale can be costly. Advancements in technology, migration
to the cloud, and demands for more data analysis may all require data engineers and developers to
create new pipelines. Managing multiple data pipelines may lead to increased operational expenses
as time goes by.
3. Efficiency: Data pipelines may lead to slow query performance depending on how data is replicated
and transferred within an organization. When there are many simultaneous requests or large
amounts of data, pipelines can become slow, particularly in situations that involve multiple data
replicas or use data virtualization techniques.
Data pipeline design patterns are templates used as a foundation for creating data pipelines. The choice
of design pattern depends on various factors, such as how data is received, the business use cases, and
the data volume. Some common design patterns include:
1) Raw Data Load: This pattern involves moving and loading raw data from one location to
another, such as between databases or from an on-premise data center to the cloud. However, this
pattern only focuses on the extraction and loading process and can be slow and time-consuming
with large data volumes. It works well for one-time operations but is not suitable for recurring
situations.
2) Extract, Transform, Load (ETL): This is a widely used pattern for loading data into data
warehouses, lakes, and operational data stores. It involves the extraction, transformation, and
loading of data from one location to another. However, most ETL processes use batch processing
which can introduce latency to operations.
3) Streaming ETL: Similar to the standard ETL pattern but with data streams as the origin, this
pattern uses tools like Apache Kafka or StreamSets Data Collector Engine for the complex ETL
processes.
4) Extract, Load, Transform (ELT): This pattern is similar to ETL, but the transformation
process happens after the data is loaded into the target destination, which can reduce
latency. However, this design can affect data quality and violate data privacy rules.
5) Change, Data, Capture (CDC): This pattern introduces freshness to data processed
using the ETL batch processing pattern by detecting changes that occur during the ETL
process and sending them to message queues for downstream processing.
6) Data Stream Processing: This pattern is suitable for feeding real- time data to high-
performance applications such as IoT and financial applications.
Data is continuously received from devices, parsed and filtered, processed, and sent to various
destinations like dashboards for real-time applications.
1) Both data pipelines and ETL are responsible for transferring data between sources and storage
solutions, but they do so in different ways.
3) An ETL pipeline refers to a set of integration-related batch processes that run on a scheduled
basis. ETL jobs extract data from one or more systems, do basic data transformations and load
the data into a repository for analytics or operational uses.
4) A data pipeline, on the other hand, involves a more advanced set of data processing activities for
filtering, transforming and enriching data to meet user needs.
5) As mentioned above, a data pipeline can handle batch processing but also run in real- time mode,
either with streaming data or triggered by a predetermined rule or set of conditions. As a result,
an ETL pipeline can be seen as one form of a data pipeline.
Difference between ETL and ELT
Q) What is ETL pipeline?
The purpose of an ETL pipeline is to prepare data for analytics and business intelligence. To provide
valuable insights, source data from various systems (CRMs, social media platforms, Web
reporting, etc.) needs to be moved and consolidated and altered to fit with the parameters and
functions of the destination database. An ETL pipeline is helpful for:
Centralizing and standardizing data, making it readily available to analysts and
decision-makers
Freeing up developers from technical implementation tasks for data movement and
maintenance, allowing them to focus on more purposeful work.
Data migration from legacy systems to a data warehouse
Deeper analytics after exhausting the insights provided by basic transformation
Characteristics of an ETL Pipeline
Provide continuous data processing
Be elastic and agile
Use isolated, independent processing resources
Increase data access
Be easy to set up and maintain
An ETL pipeline is a set of processes that moves data from one or more sources into a database,
such as a data warehouse. ETL stands for "extract, transform, load," which are the three main
steps in the data integration process:
Extract: Pull data from the source
Transform: Clean, organize, and make the data usable
Load: Move the data into the target system
ETL pipelines are used to combine data from multiple sources into a single, consistent data
set. This makes the data easier to analyze and ensures that everyone accessing the information
has the most current data.
1. Origin: The entry point for data in the pipeline, including data sources
(e.g., IoT sensors, APIs, social media) and storage systems (data warehouse
or lake).
2. Destination: The endpoint where data is transferred for visualization,
analysis, or storage.
3. Dataflow: Movement of data from origin to destination via ETL (Extract,
Transform, Load):
o Extract: Retrieve data from source systems.
o Transform: Reformat data in a staging area.
o Load: Save processed data to its final destination.
4. Storage: Systems for preserving data during pipeline stages. Choices
depend on data volume, query frequency, and use cases.
5. Processing: Activities to ingest, store, transform, and deliver data, e.g.,
database replication or streaming.
6. Workflow: Sequence of tasks in the pipeline:
o Upstream: Source tasks that must complete first.
o Downstream: Dependent tasks or destinations.
7. Monitoring: Ensures efficiency, accuracy, and consistency in the pipeline
while preventing data loss.
8. Technology: Tools and infrastructure supporting the pipeline:
o ETL tools (Informatica, Apache Spark).
o Data warehouses (Amazon Redshift, Snowflake).
o Data lakes (Azure, IBM).
o Workflow schedulers (Airflow, Luigi).
o Streaming tools (Kafka, Flink).
o Programming languages (Python, Java).
9. delivery
Traditional data pipelines followed the ETL (Extract, Transform, Load) approach, where data
is extracted from sources, transformed into a specific format in a staging area, and then loaded
into storage. However, with advancements in cloud computing, storage, and processing
technologies, the paradigm has shifted to ELT (Extract, Load, Transform).
Key Differences Between ETL and ELT:
1. Order of Operations:
o ETL: Transformation occurs before loading into storage.
o ELT: Data is loaded in its raw form and transformed post-loading.
2. Processing Power:
o ETL: Relies on separate ETL tools for transformations, which can be resource-
intensive.
o ELT: Leverages the computational capabilities of modern data warehouses or
lakes for transformation.
3. Scalability:
o ETL: May struggle with massive data volumes as transformations happen
before storage.
o ELT: Handles large datasets efficiently by taking advantage of scalable cloud
storage and computing.
4. Flexibility:
o ETL: Requires pre-defined transformations, limiting flexibility for future use
cases.
o ELT: Stores raw data, enabling flexible transformations as new needs arise.
Advantages of ELT:
Faster Data Ingestion: Directly loading raw data reduces initial latency.
Cost-Effectiveness: Utilizes existing cloud storage and processing platforms instead of
dedicated ETL infrastructure.
Adaptability: Raw data supports diverse use cases like analytics, machine learning, or
ad hoc reporting.
Improved Performance: Modern data platforms like Snowflake and BigQuery
optimize transformation processing.
Use Cases:
ETL: Still relevant for legacy systems or highly regulated environments with strict data
requirements.
ELT: Ideal for modern, cloud-based systems with large-scale data and real-time
processing needs.
Chapter 5
Virtualization & Containerization On & Elasticity In Cloud Computing:
Key Features:
Scalability: Resources scale up/down automatically with workload.
On-Demand: Pay only for what you use.
Automation: Tools like auto-scaling adjust resources as needed.
Examples:
Compute: Auto-scaling VMs or containers (e.g., AWS EC2).
Storage: Expanding/shrinking storage (e.g., Amazon S3).
Networking: Load balancers manage traffic surges.
Benefits:
Reduces costs by avoiding overprovisioning.
Enhances performance during demand spikes.
Ensures high availability and reliability.
Use Cases:
E-commerce sales traffic.
Streaming live events.
Big data analytics.
Explain Containerization in Cloud Computing
Containerization is a lightweight method of packaging software applications and their
dependencies (libraries, configurations, binaries) into isolated units called containers. These
containers can run consistently across different computing environments, from development to
production, ensuring seamless deployment.
1) Public container registries are generally the faster and easier route when initiating a container
registry.
2)Public registries are also seen to be easier to use.
3)they may also be less secure than private registries.
4) They are for smaller teams and wroks for standard and open sourced images from public
registries.
1) Docker is the containerization platform that is used to package your application and all its dependencies
together in the form of containers to make sure that your application works seamlessly in any environment
which can be developed or tested or in production.
2) Docker is a tool designed to make it easier to create,deploy, and run applications by using containers.
3) Docker is the world’s leading software containerplatform. It was launched in 2013 by a company
called Dotcloud, Inc which was later renamed Docker, Inc. It is written in the Go language.
Docker architecture consists of Docker client, Docker Daemon running on Docker Host, and
Docker Hub repository.
Docker has client Server architecture in which the client communicates with the Docker Daemon
running on the Docker Host using a combination of APIs, Socket IO, and TCP.
What are components of Docker
. Docker Clients and Servers: Docker has a client-server architecture. The Docker Daemon/Server
consists of all containers.
The Docker Daemon/Server receives the request from the Docker client through CLI or REST APIs and
thus processes the request accordingly. Docker client and Daemon can be present on the same host or
different host.
. Docker Clients and Servers: Docker has a client-server architecture. The Docker Daemon/Server
consists of all containers.
The Docker Daemon/Server receives the request from the Docker client through CLI or REST APIs and
thus processes the request accordingly. Docker client and Daemon can be present on the same host or
different host.
Advantages of Docker
1. Speed: Containers are lightweight and start quickly, reducing build and deployment
time.
2. Portability: Docker ensures consistent performance across environments, making it
easy to move applications.
3. Scalability: Docker can be deployed on multiple servers, data centers, and cloud
platforms, with seamless transitions between them.
4. Density: Docker uses resources efficiently, allowing more containers to run on a single
host without overhead.
KUBERNETES IN CLOUD W.R.T TO SCALING , PIPELINE
AND MICRO SERVCES
Kubernetes in Cloud: Scaling (4 Marks)
5. Machine Learning Studio publishes models as web services that can easily be consumed
by custom apps or BI tools.
2. Amazon SageMaker is a fully-managed service that covers the entire machine learning
workflow to label and prepare your data, choose an algorithm, train the model, tune and
optimize it for deployment, make predictions, and take action.
3. Your models get to production faster with much less effort and lower cost
4. Amazon SageMaker is a fully-managed service that enables data scientists and developers
to quickly and easily build, train, and deploy machine learning models at any scale.