0% found this document useful (0 votes)
12 views21 pages

CCD 4,5,6

diploma chapter 4,5,6 notes revised

Uploaded by

Ritika Darade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views21 pages

CCD 4,5,6

diploma chapter 4,5,6 notes revised

Uploaded by

Ritika Darade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Chapter 4: Data management using Cloud Compu ng

What is data pipelining?


 A data pipeline is a process that moves data from one system or format to another.

 The data pipeline typically includes a series of steps. This is for extracting data from a source,
transforming and cleaning it, and loading it into a destination system, such as a database or a data
warehouse.
 Data pipelines can be used for a variety of purposes, including data integration, data
warehousing, automating data migration, and analytics.

What is purpose of data pipelining


 The data pipeline is a key element in the overall data management process.
 Its purpose is to automate and scale repetitive data flows and associated data collection,
transformation and integration tasks.
 A properly constructed data pipeline can accelerate the processing that’s required as data is
gathered, cleansed, filtered, enriched and moved to downstream systems and applications.
 Well-designed pipelines also enable organizations to take advantage of big data assets that often
include large amounts of structured, unstructured and semi-structured data.
 In many cases, some of that is real-time data generated and updated on an ongoing basis. As
the volume, variety and velocity of data continue to grow in big data systems, the need for data
pipelines that can linearly scale -- whether in on-premises, cloud or hybrid cloud environments -
- is becoming increasingly critical to analytics initiatives and business operations.

Who needs data pipelining?


A data pipeline is needed for any analytics application or business process that requires regular
aggregation, cleansing, transformation and distribution of data to downstream data consumers. Typical
data pipeline users include the following:

1) Data scientists and other members of data science teams.


2) Business intelligence (BI) analysts and developers.
3) Business analysts Senior management and other business executives.
4) Marketing and sales teams.
5) Operational workers.
To make it easier for business users to access relevant data, pipelines can also be used to feed it into BI
dashboards and reports, as well as operational monitoring and alerting systems.
How data pipeline works
 The data pipeline development process starts by defining what, where and how data is
generated or collected. That includes capturing source system

 characteristics, such as data formats, data structures, data schemas and data definitions --
information that’s needed to plan and build a pipeline.
Once it’s in place, the data pipeline typically involves the following steps:

1) Many data pipelines are built by data engineers or big data engineers.

2) To create effective pipelines, its critical that they develop their soft skills -- meaning their
interpersonal and communication skills.

3) This will help them collaborate with data scientists, other analysts and business
stakeholders to identify user requirements and the data that needed to meet them before
launching a data pipeline development project.

4) Such skills are also necessary for ongoing conversations to prioritize new development plans
and manage existing data pipelines.

Other best practices on data pipelines include the following:

1) Manage the development of a data pipeline as a project, with defined goals and delivery dates.

2) Document data lineage information so the history, technical attributes and business meaning
of data can be understood.
3) Ensure that the proper context of data is maintained as its transformed in a pipeline.

4) Create reusable processes or templates for data pipeline steps to streamline development.

5) Avoid scope creep that can complicate pipeline projects and create unrealistic expectations
among users.

Data ingestion: Raw data from one or more source systems is ingested into the data pipeline. Depending
on the data set, data ingestion can be done in batch or real-time mode.

Data integration: If multiple data sets are being pulled into the pipeline for use in analytics or
operational applications, they need to be combined through data integration processes.

Data Cleansing: For most applications, data quality management measures are applied to the raw data
in the pipeline to ensure that its clean, accurate and consistent.
Data filtering: Data sets are commonly filtered to remove data that isn’t needed for the particular
applications the pipeline was built to support.

Data transformation: The data is modified as needed for the planned applications.
Examples of data transformation method include aggregation, generalization, reduction and smoothing.

Data enrichment: In some cases, data sets are augmented and enriched as part of the pipeline through
the addition of more data elements required for applications.

Data validation: The finalized data is checked to confirm that it is valid and fully meets the application
requirements.

Data loading: For BI and analytics applications, the data is loaded into a data store so it can be accessed
by users. Typically, That’s a data warehouse, a data lake or a data lakehouse, which combines elements
of the other two platforms.

Many data pipelines also apply machine learning and neural network algorithms to create more
advanced data transformations and enrichments. This includes segmentation, regression analysis,
clustering and the creation of advanced indices and propensity scores.
In addition, logic and algorithms can be built into a data pipeline to add intelligence.
As machine learning -- and, especially, automated machine learning (AutoML) -- processes become
more prevalent, data pipelines likely will become increasingly intelligent. With these processes,
intelligent data pipelines could continuously learn and adapt based on the characteristics of source
systems, required data transformations and enrichments, and evolving business and application
requirements.
Characteristics of data pipeline architecture and use cases. Some of the most
common types include:

Batch Processing: Data is processed in batches at set intervals, such as daily or weekly.

Real-Time Streaming: Data is processed as soon as it is generated, with minimal delay.

Lambda Architecture: A combination of batch and real-time processing, where data is first processed
in batch and then updated in real-time.

Kappa Architecture: Similar to Lambda architecture, data is only processed once, and all data is
ingested in real time.

Microservices Architecture: Data is processed using loosely coupled, independently deployable


services.

ETL (Extract, Transform, Load) Architecture: Data is extracted from various sources, transformed
to fit the target system, and loaded into the target system.
A Data pipeline architecture is essential for several reasons:

Scalability: Data pipeline architecture should allow for the efficient processing of large amounts of data,
enabling organizations to scale their data processing capabilities as their data volume increases.

Reliability: A well-designed data pipeline architecture ensures that data is processed accurately and
reliably. This reduces the risk of errors and inaccuracies in the data.

Efficiency: Data pipeline architecture streamlines the data processing workflow, making it more
efficient and reducing the time and resources required to process data.

Flexibility: It allows for the integration of different data sources and the ability to adapt to changing
business requirements.

Security: Data pipeline architecture enables organizations to implement security measures, such as
encryption and access controls, to protect sensitive data.

Data Governance: Data pipeline architecture allows organizations to implement data governance
practices such as data lineage, data quality, and data cataloguing that help maintain data accuracy,
completeness, and reliability.

Data pipelines can be compared to the plumbing system in the real world. Both are crucial channels that
meet basic needs, whether it’s moving data or water. Both systems can malfunction and require
maintenance.
In many companies, a team of data engineers will design and maintain data pipelines.

Data pipelines should be automated as much as possible to reduce the need for manual supervision.
However, even with data automation, businesses may still face challenges with their data pipelines:
1. Complexity: In large companies, there could be a large number of data pipelines in operation.
Managing and understanding all these pipelines at scale can be difficult, such as identifying which
pipelines are currently in use, how current they are, and what dashboards or reports rely on them.
In an environment with multiple data pipelines, tasks such as complying with regulations and
migrating to the cloud can become more complicated.

2. Cost: Building data pipelines at a large scale can be costly. Advancements in technology, migration
to the cloud, and demands for more data analysis may all require data engineers and developers to
create new pipelines. Managing multiple data pipelines may lead to increased operational expenses
as time goes by.

3. Efficiency: Data pipelines may lead to slow query performance depending on how data is replicated
and transferred within an organization. When there are many simultaneous requests or large
amounts of data, pipelines can become slow, particularly in situations that involve multiple data
replicas or use data virtualization techniques.

What are data pipeline design patterns? Or Designing pipelines?

Data pipeline design patterns are templates used as a foundation for creating data pipelines. The choice
of design pattern depends on various factors, such as how data is received, the business use cases, and
the data volume. Some common design patterns include:

1) Raw Data Load: This pattern involves moving and loading raw data from one location to
another, such as between databases or from an on-premise data center to the cloud. However, this
pattern only focuses on the extraction and loading process and can be slow and time-consuming
with large data volumes. It works well for one-time operations but is not suitable for recurring
situations.

2) Extract, Transform, Load (ETL): This is a widely used pattern for loading data into data
warehouses, lakes, and operational data stores. It involves the extraction, transformation, and
loading of data from one location to another. However, most ETL processes use batch processing
which can introduce latency to operations.

3) Streaming ETL: Similar to the standard ETL pattern but with data streams as the origin, this
pattern uses tools like Apache Kafka or StreamSets Data Collector Engine for the complex ETL
processes.
4) Extract, Load, Transform (ELT): This pattern is similar to ETL, but the transformation
process happens after the data is loaded into the target destination, which can reduce
latency. However, this design can affect data quality and violate data privacy rules.
5) Change, Data, Capture (CDC): This pattern introduces freshness to data processed
using the ETL batch processing pattern by detecting changes that occur during the ETL
process and sending them to message queues for downstream processing.
6) Data Stream Processing: This pattern is suitable for feeding real- time data to high-
performance applications such as IoT and financial applications.

Data is continuously received from devices, parsed and filtered, processed, and sent to various
destinations like dashboards for real-time applications.

Difference between ELT and Data Pipeline

1) Both data pipelines and ETL are responsible for transferring data between sources and storage
solutions, but they do so in different ways.

2) Data pipelines work with ongoing data streams in real time

3) An ETL pipeline refers to a set of integration-related batch processes that run on a scheduled
basis. ETL jobs extract data from one or more systems, do basic data transformations and load
the data into a repository for analytics or operational uses.

4) A data pipeline, on the other hand, involves a more advanced set of data processing activities for
filtering, transforming and enriching data to meet user needs.
5) As mentioned above, a data pipeline can handle batch processing but also run in real- time mode,
either with streaming data or triggered by a predetermined rule or set of conditions. As a result,
an ETL pipeline can be seen as one form of a data pipeline.
Difference between ETL and ELT
Q) What is ETL pipeline?

The purpose of an ETL pipeline is to prepare data for analytics and business intelligence. To provide
valuable insights, source data from various systems (CRMs, social media platforms, Web
reporting, etc.) needs to be moved and consolidated and altered to fit with the parameters and
functions of the destination database. An ETL pipeline is helpful for:
 Centralizing and standardizing data, making it readily available to analysts and
decision-makers
 Freeing up developers from technical implementation tasks for data movement and
maintenance, allowing them to focus on more purposeful work.
 Data migration from legacy systems to a data warehouse
 Deeper analytics after exhausting the insights provided by basic transformation
Characteristics of an ETL Pipeline
 Provide continuous data processing
 Be elastic and agile
 Use isolated, independent processing resources
 Increase data access
 Be easy to set up and maintain
An ETL pipeline is a set of processes that moves data from one or more sources into a database,
such as a data warehouse. ETL stands for "extract, transform, load," which are the three main
steps in the data integration process:
 Extract: Pull data from the source
 Transform: Clean, organize, and make the data usable
 Load: Move the data into the target system

ETL pipelines are used to combine data from multiple sources into a single, consistent data
set. This makes the data easier to analyze and ensures that everyone accessing the information
has the most current data.

Q) Can you describe the components of a typical data pipeline?

Data Pipeline Components

1. Origin: The entry point for data in the pipeline, including data sources
(e.g., IoT sensors, APIs, social media) and storage systems (data warehouse
or lake).
2. Destination: The endpoint where data is transferred for visualization,
analysis, or storage.
3. Dataflow: Movement of data from origin to destination via ETL (Extract,
Transform, Load):
o Extract: Retrieve data from source systems.
o Transform: Reformat data in a staging area.
o Load: Save processed data to its final destination.
4. Storage: Systems for preserving data during pipeline stages. Choices
depend on data volume, query frequency, and use cases.
5. Processing: Activities to ingest, store, transform, and deliver data, e.g.,
database replication or streaming.
6. Workflow: Sequence of tasks in the pipeline:
o Upstream: Source tasks that must complete first.
o Downstream: Dependent tasks or destinations.
7. Monitoring: Ensures efficiency, accuracy, and consistency in the pipeline
while preventing data loss.
8. Technology: Tools and infrastructure supporting the pipeline:
o ETL tools (Informatica, Apache Spark).
o Data warehouses (Amazon Redshift, Snowflake).
o Data lakes (Azure, IBM).
o Workflow schedulers (Airflow, Luigi).
o Streaming tools (Kafka, Flink).
o Programming languages (Python, Java).

9. delivery

Q. Evolving from ETL to ELT

Traditional data pipelines followed the ETL (Extract, Transform, Load) approach, where data
is extracted from sources, transformed into a specific format in a staging area, and then loaded
into storage. However, with advancements in cloud computing, storage, and processing
technologies, the paradigm has shifted to ELT (Extract, Load, Transform).
Key Differences Between ETL and ELT:

1. Order of Operations:
o ETL: Transformation occurs before loading into storage.
o ELT: Data is loaded in its raw form and transformed post-loading.
2. Processing Power:
o ETL: Relies on separate ETL tools for transformations, which can be resource-
intensive.
o ELT: Leverages the computational capabilities of modern data warehouses or
lakes for transformation.
3. Scalability:
o ETL: May struggle with massive data volumes as transformations happen
before storage.
o ELT: Handles large datasets efficiently by taking advantage of scalable cloud
storage and computing.
4. Flexibility:
o ETL: Requires pre-defined transformations, limiting flexibility for future use
cases.
o ELT: Stores raw data, enabling flexible transformations as new needs arise.

Advantages of ELT:

 Faster Data Ingestion: Directly loading raw data reduces initial latency.
 Cost-Effectiveness: Utilizes existing cloud storage and processing platforms instead of
dedicated ETL infrastructure.
 Adaptability: Raw data supports diverse use cases like analytics, machine learning, or
ad hoc reporting.
 Improved Performance: Modern data platforms like Snowflake and BigQuery
optimize transformation processing.

Use Cases:

 ETL: Still relevant for legacy systems or highly regulated environments with strict data
requirements.
 ELT: Ideal for modern, cloud-based systems with large-scale data and real-time
processing needs.
Chapter 5
Virtualization & Containerization On & Elasticity In Cloud Computing:

Q. Explain Elastic Resources


Elastic resources refer to the ability of a system to dynamically allocate and deallocate computing
resources (e.g., processing power, memory, and storage) based on real-time demand. This
concept is a cornerstone of cloud computing, enabling organizations to optimize resource usage,
reduce costs, and enhance scalability. Elastic resources dynamically allocate and deallocate
computing power, storage, and memory based on real-time demand. This cloud computing
feature ensures cost efficiency, scalability, and adaptability.

Key Features:
 Scalability: Resources scale up/down automatically with workload.
 On-Demand: Pay only for what you use.
 Automation: Tools like auto-scaling adjust resources as needed.
Examples:
 Compute: Auto-scaling VMs or containers (e.g., AWS EC2).
 Storage: Expanding/shrinking storage (e.g., Amazon S3).
 Networking: Load balancers manage traffic surges.
Benefits:
 Reduces costs by avoiding overprovisioning.
 Enhances performance during demand spikes.
 Ensures high availability and reliability.
Use Cases:
 E-commerce sales traffic.
 Streaming live events.
 Big data analytics.
Explain Containerization in Cloud Computing
Containerization is a lightweight method of packaging software applications and their
dependencies (libraries, configurations, binaries) into isolated units called containers. These
containers can run consistently across different computing environments, from development to
production, ensuring seamless deployment.

Key Features of Containerization:


1. Isolation: Each container operates independently, ensuring applications don’t interfere
with one another.
2. Portability: Containers work consistently across environments, whether on-premises or
in the cloud.
3. Efficiency: Containers share the host operating system, making them lightweight
compared to virtual machines.
Benefits:
1. Scalability: Containers can be easily replicated and distributed across cloud instances to
handle increased demand.
2. Flexibility: Applications can run on any system that supports the container runtime (e.g.,
Docker, Kubernetes).
3. Cost-Effectiveness: Containers use fewer resources than traditional VMs, reducing
infrastructure costs.
4. Faster Deployment: Containers can be quickly started, updated, or replaced without
affecting others.
Use Cases:
 Microservices Architecture: Containers are ideal for running microservices, where each
service runs in its own container.
 DevOps Pipelines: Simplifies continuous integration and delivery (CI/CD).
 Cloud-Native Applications: Containers maximize cloud resource utilization.
Tools for Containerization:
1. Docker: Popular containerization platform for building and running containers.
2. Kubernetes: Orchestrates container deployment, scaling, and management.
3. Cloud Platforms: Services like AWS ECS, Azure AKS, and Google Kubernetes Engine
provide container management in the cloud.

Explain Container Registries


A container registry is a centralized repository where container images are stored, managed,
and retrieved for deployment. These images are prepackaged versions of applications and their
dependencies, created using tools like Docker. Registries simplify the sharing and reuse of
container images across teams and environments.
Key Features of Container Registries:
1. Image Storage: Stores container images for public or private access.
2. Versioning: Tracks different versions of an image, enabling rollbacks if needed.
3. Access Control: Offers authentication and role-based access for secure image
management.
4. Integration: Works with tools like Docker, Kubernetes, and CI/CD pipelines for
streamlined deployments.
There are agin 2 types of registries

1) Public container registries are generally the faster and easier route when initiating a container
registry.
2)Public registries are also seen to be easier to use.
3)they may also be less secure than private registries.
4) They are for smaller teams and wroks for standard and open sourced images from public
registries.

1) A private container registry is set up by the organization using it.


2) Private registries are either hosted or on premises and popular with larger organization or
enterprises that are more set on using a container registry.
3) Having complete control over the registry in development allows an organization more
freedom in how they choose to manage it.
4) private registries are seen to be the more secure.
What is Docker and why it is used for containerization?

1) Docker is the containerization platform that is used to package your application and all its dependencies
together in the form of containers to make sure that your application works seamlessly in any environment
which can be developed or tested or in production.
2) Docker is a tool designed to make it easier to create,deploy, and run applications by using containers.
3) Docker is the world’s leading software containerplatform. It was launched in 2013 by a company
called Dotcloud, Inc which was later renamed Docker, Inc. It is written in the Go language.

Docker architecture consists of Docker client, Docker Daemon running on Docker Host, and
Docker Hub repository.
Docker has client Server architecture in which the client communicates with the Docker Daemon
running on the Docker Host using a combination of APIs, Socket IO, and TCP.
What are components of Docker

. Docker Clients and Servers: Docker has a client-server architecture. The Docker Daemon/Server
consists of all containers.
The Docker Daemon/Server receives the request from the Docker client through CLI or REST APIs and
thus processes the request accordingly. Docker client and Daemon can be present on the same host or
different host.
. Docker Clients and Servers: Docker has a client-server architecture. The Docker Daemon/Server
consists of all containers.
The Docker Daemon/Server receives the request from the Docker client through CLI or REST APIs and
thus processes the request accordingly. Docker client and Daemon can be present on the same host or
different host.

Advantages of Docker

1. Speed: Containers are lightweight and start quickly, reducing build and deployment
time.
2. Portability: Docker ensures consistent performance across environments, making it
easy to move applications.
3. Scalability: Docker can be deployed on multiple servers, data centers, and cloud
platforms, with seamless transitions between them.
4. Density: Docker uses resources efficiently, allowing more containers to run on a single
host without overhead.
KUBERNETES IN CLOUD W.R.T TO SCALING , PIPELINE
AND MICRO SERVCES
Kubernetes in Cloud: Scaling (4 Marks)

 Automatic Scaling: Adjusts number of pods and nodes based on workload.


 Horizontal Pod Autoscaling: Scales pods based on CPU or memory usage.
 Cluster Autoscaling: Increases or decreases nodes to match demand.
 Vertical and Horizontal Scaling: Supports both resource adjustments and pod
replication for optimal performance.

Kubernetes in Cloud: Pipeline (4 Marks)

 CI/CD Integration: Automates containerized application deployment.


 Rolling Updates: Gradual deployment of new versions to avoid downtime.
 Automated Rollbacks: Reverts to previous stable versions in case of issues.
 Tool Integration: Works with Jenkins, GitLab, Helm for streamlined pipelines.

Kubernetes in Cloud: Microservices (4 Marks)

 Microservice Isolation: Each service runs in its own container.


 Service Discovery & Load Balancing: Facilitates communication between services.
 Independent Scaling: Microservices can scale individually based on demand.
 Self-Healing & High Availability: Automatically restarts failed services, ensuring
reliability.
Chapter 6: Managed ML Systems
Q) Compare commercial and open-source ML systems:

Q) Explain Jupyter Notebook and Explain Its Workflow? ★★


What is Jupyter?
Ju(lia) + Py(thon) + (e)R
• Jupyter Notebooks are a community standard for communicating and performing
interactive computing. They are a document that blends computations, outputs,
explanatory text, mathematics, images, and rich media representations of objects.
• JupyterLab is one interface used to create and interact with Jupyter Notebooks.
Advantages of Jupyter?
• Useful for data cleaning and transformation, numerical simulation, statistical modelling, data
visualization, machine learning, and much more.
• Language of choice  40+ Languages
• Notebooks can be shared with others using email, Dropbox, GitHub and the Jupyter
Notebook Viewer.
• Your code can produce rich, interactive output: HTML, images, videos, and custom MIME
types.
• Big data integration - Leverage big data tools, such as Apache Spark, from Python, R and
Scala.
Explore that same data with pandas, scikit-learn, ggplot2, TensorFlow.
Limitations of Jupyter:
• It messes with your version control.
• The Jupyter Notebook format is just a big JSON, which contains your code and the outputs of
the code
• Code can only be run in chunks.

Q) Explain Azure ML Studio? ★


1. Azure is Microsoft’s cloud computing platform which helps to build solutions to meet
business goals.

2. It supports infrastructure (IaaS), platform (PaaS), and software as a service (SaaS)


computing services.

3. It also supports advanced computing services like artificial intelligence, machine


learning, and IoT.
Azure allows you to build, manage and deploy the application on a global network.

4. Microsoft Azure Machine Learning Studio is a collaborative, drag-and-drop tool you


can use to build, test, and deploy predictive analytics solutions on your data.

5. Machine Learning Studio publishes models as web services that can easily be consumed
by custom apps or BI tools.

Q) Explain AWS Sagemaker? ★


What is Sagemaker
1. SageMaker provides every developer and data scientist with the ability to build, train, and
deploy machine learning models quickly.

2. Amazon SageMaker is a fully-managed service that covers the entire machine learning
workflow to label and prepare your data, choose an algorithm, train the model, tune and
optimize it for deployment, make predictions, and take action.

3. Your models get to production faster with much less effort and lower cost
4. Amazon SageMaker is a fully-managed service that enables data scientists and developers
to quickly and easily build, train, and deploy machine learning models at any scale.

5. Amazon SageMaker includes modules that can be used together or independently to


build, train, and deploy your machine-learning models.

You might also like