0% found this document useful (0 votes)

12 views21 pages

CCD 4,5,6

diploma chapter 4,5,6 notes revised

Uploaded by

Ritika Darade

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views21 pages

CCD 4,5,6

diploma chapter 4,5,6 notes revised

Uploaded by

Ritika Darade

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Chapter 4: Data management using Cloud Compu ng

What is data pipelining?

 A data pipeline is a process that moves data from one system or format to another.

 The data pipeline typically includes a series of steps. This is for extracting data from a source,
transforming and cleaning it, and loading it into a destination system, such as a database or a data
warehouse.
 Data pipelines can be used for a variety of purposes, including data integration, data
warehousing, automating data migration, and analytics.

What is purpose of data pipelining

 The data pipeline is a key element in the overall data management process.
 Its purpose is to automate and scale repetitive data flows and associated data collection,
transformation and integration tasks.
 A properly constructed data pipeline can accelerate the processing that’s required as data is
gathered, cleansed, filtered, enriched and moved to downstream systems and applications.
 Well-designed pipelines also enable organizations to take advantage of big data assets that often
include large amounts of structured, unstructured and semi-structured data.
 In many cases, some of that is real-time data generated and updated on an ongoing basis. As
the volume, variety and velocity of data continue to grow in big data systems, the need for data
pipelines that can linearly scale -- whether in on-premises, cloud or hybrid cloud environments -
- is becoming increasingly critical to analytics initiatives and business operations.

Who needs data pipelining?

A data pipeline is needed for any analytics application or business process that requires regular
aggregation, cleansing, transformation and distribution of data to downstream data consumers. Typical
data pipeline users include the following:

1) Data scientists and other members of data science teams.

2) Business intelligence (BI) analysts and developers.
3) Business analysts Senior management and other business executives.
4) Marketing and sales teams.
5) Operational workers.
To make it easier for business users to access relevant data, pipelines can also be used to feed it into BI
dashboards and reports, as well as operational monitoring and alerting systems.
How data pipeline works
 The data pipeline development process starts by defining what, where and how data is
generated or collected. That includes capturing source system

 characteristics, such as data formats, data structures, data schemas and data definitions --
information that’s needed to plan and build a pipeline.
Once it’s in place, the data pipeline typically involves the following steps:

1) Many data pipelines are built by data engineers or big data engineers.

2) To create effective pipelines, its critical that they develop their soft skills -- meaning their
interpersonal and communication skills.

3) This will help them collaborate with data scientists, other analysts and business
stakeholders to identify user requirements and the data that needed to meet them before
launching a data pipeline development project.

4) Such skills are also necessary for ongoing conversations to prioritize new development plans
and manage existing data pipelines.

Other best practices on data pipelines include the following:

1) Manage the development of a data pipeline as a project, with defined goals and delivery dates.

2) Document data lineage information so the history, technical attributes and business meaning
of data can be understood.
3) Ensure that the proper context of data is maintained as its transformed in a pipeline.

4) Create reusable processes or templates for data pipeline steps to streamline development.

5) Avoid scope creep that can complicate pipeline projects and create unrealistic expectations
among users.

Data ingestion: Raw data from one or more source systems is ingested into the data pipeline. Depending
on the data set, data ingestion can be done in batch or real-time mode.

Data integration: If multiple data sets are being pulled into the pipeline for use in analytics or
operational applications, they need to be combined through data integration processes.

Data Cleansing: For most applications, data quality management measures are applied to the raw data
in the pipeline to ensure that its clean, accurate and consistent.
Data filtering: Data sets are commonly filtered to remove data that isn’t needed for the particular
applications the pipeline was built to support.

Data transformation: The data is modified as needed for the planned applications.
Examples of data transformation method include aggregation, generalization, reduction and smoothing.

Data enrichment: In some cases, data sets are augmented and enriched as part of the pipeline through
the addition of more data elements required for applications.

Data validation: The finalized data is checked to confirm that it is valid and fully meets the application
requirements.

Data loading: For BI and analytics applications, the data is loaded into a data store so it can be accessed
by users. Typically, That’s a data warehouse, a data lake or a data lakehouse, which combines elements
of the other two platforms.

Many data pipelines also apply machine learning and neural network algorithms to create more
advanced data transformations and enrichments. This includes segmentation, regression analysis,
clustering and the creation of advanced indices and propensity scores.
In addition, logic and algorithms can be built into a data pipeline to add intelligence.
As machine learning -- and, especially, automated machine learning (AutoML) -- processes become
more prevalent, data pipelines likely will become increasingly intelligent. With these processes,
intelligent data pipelines could continuously learn and adapt based on the characteristics of source
systems, required data transformations and enrichments, and evolving business and application
requirements.
Characteristics of data pipeline architecture and use cases. Some of the most
common types include:

Batch Processing: Data is processed in batches at set intervals, such as daily or weekly.

Real-Time Streaming: Data is processed as soon as it is generated, with minimal delay.

Lambda Architecture: A combination of batch and real-time processing, where data is first processed
in batch and then updated in real-time.

Kappa Architecture: Similar to Lambda architecture, data is only processed once, and all data is
ingested in real time.

Microservices Architecture: Data is processed using loosely coupled, independently deployable

services.

ETL (Extract, Transform, Load) Architecture: Data is extracted from various sources, transformed
to fit the target system, and loaded into the target system.
A Data pipeline architecture is essential for several reasons:

Scalability: Data pipeline architecture should allow for the efficient processing of large amounts of data,
enabling organizations to scale their data processing capabilities as their data volume increases.

Reliability: A well-designed data pipeline architecture ensures that data is processed accurately and
reliably. This reduces the risk of errors and inaccuracies in the data.

Efficiency: Data pipeline architecture streamlines the data processing workflow, making it more
efficient and reducing the time and resources required to process data.

Flexibility: It allows for the integration of different data sources and the ability to adapt to changing
business requirements.

Security: Data pipeline architecture enables organizations to implement security measures, such as
encryption and access controls, to protect sensitive data.

Data Governance: Data pipeline architecture allows organizations to implement data governance
practices such as data lineage, data quality, and data cataloguing that help maintain data accuracy,
completeness, and reliability.

Data pipelines can be compared to the plumbing system in the real world. Both are crucial channels that
meet basic needs, whether it’s moving data or water. Both systems can malfunction and require
maintenance.
In many companies, a team of data engineers will design and maintain data pipelines.

Data pipelines should be automated as much as possible to reduce the need for manual supervision.
However, even with data automation, businesses may still face challenges with their data pipelines:
1. Complexity: In large companies, there could be a large number of data pipelines in operation.
Managing and understanding all these pipelines at scale can be difficult, such as identifying which
pipelines are currently in use, how current they are, and what dashboards or reports rely on them.
In an environment with multiple data pipelines, tasks such as complying with regulations and
migrating to the cloud can become more complicated.

2. Cost: Building data pipelines at a large scale can be costly. Advancements in technology, migration
to the cloud, and demands for more data analysis may all require data engineers and developers to
create new pipelines. Managing multiple data pipelines may lead to increased operational expenses
as time goes by.

3. Efficiency: Data pipelines may lead to slow query performance depending on how data is replicated
and transferred within an organization. When there are many simultaneous requests or large
amounts of data, pipelines can become slow, particularly in situations that involve multiple data
replicas or use data virtualization techniques.

What are data pipeline design patterns? Or Designing pipelines?

Data pipeline design patterns are templates used as a foundation for creating data pipelines. The choice
of design pattern depends on various factors, such as how data is received, the business use cases, and
the data volume. Some common design patterns include:

1) Raw Data Load: This pattern involves moving and loading raw data from one location to
another, such as between databases or from an on-premise data center to the cloud. However, this
pattern only focuses on the extraction and loading process and can be slow and time-consuming
with large data volumes. It works well for one-time operations but is not suitable for recurring
situations.

2) Extract, Transform, Load (ETL): This is a widely used pattern for loading data into data
warehouses, lakes, and operational data stores. It involves the extraction, transformation, and
loading of data from one location to another. However, most ETL processes use batch processing
which can introduce latency to operations.

3) Streaming ETL: Similar to the standard ETL pattern but with data streams as the origin, this
pattern uses tools like Apache Kafka or StreamSets Data Collector Engine for the complex ETL
processes.
4) Extract, Load, Transform (ELT): This pattern is similar to ETL, but the transformation
process happens after the data is loaded into the target destination, which can reduce
latency. However, this design can affect data quality and violate data privacy rules.
5) Change, Data, Capture (CDC): This pattern introduces freshness to data processed
using the ETL batch processing pattern by detecting changes that occur during the ETL
process and sending them to message queues for downstream processing.
6) Data Stream Processing: This pattern is suitable for feeding real- time data to high-
performance applications such as IoT and financial applications.

Data is continuously received from devices, parsed and filtered, processed, and sent to various
destinations like dashboards for real-time applications.

Difference between ELT and Data Pipeline

1) Both data pipelines and ETL are responsible for transferring data between sources and storage
solutions, but they do so in different ways.

2) Data pipelines work with ongoing data streams in real time

3) An ETL pipeline refers to a set of integration-related batch processes that run on a scheduled
basis. ETL jobs extract data from one or more systems, do basic data transformations and load
the data into a repository for analytics or operational uses.

4) A data pipeline, on the other hand, involves a more advanced set of data processing activities for
filtering, transforming and enriching data to meet user needs.
5) As mentioned above, a data pipeline can handle batch processing but also run in real- time mode,
either with streaming data or triggered by a predetermined rule or set of conditions. As a result,
an ETL pipeline can be seen as one form of a data pipeline.
Difference between ETL and ELT
Q) What is ETL pipeline?

The purpose of an ETL pipeline is to prepare data for analytics and business intelligence. To provide
valuable insights, source data from various systems (CRMs, social media platforms, Web
reporting, etc.) needs to be moved and consolidated and altered to fit with the parameters and
functions of the destination database. An ETL pipeline is helpful for:
 Centralizing and standardizing data, making it readily available to analysts and
decision-makers
 Freeing up developers from technical implementation tasks for data movement and
maintenance, allowing them to focus on more purposeful work.
 Data migration from legacy systems to a data warehouse
 Deeper analytics after exhausting the insights provided by basic transformation
Characteristics of an ETL Pipeline
 Provide continuous data processing
 Be elastic and agile
 Use isolated, independent processing resources
 Increase data access
 Be easy to set up and maintain
An ETL pipeline is a set of processes that moves data from one or more sources into a database,
such as a data warehouse. ETL stands for "extract, transform, load," which are the three main
steps in the data integration process:
 Extract: Pull data from the source
 Transform: Clean, organize, and make the data usable
 Load: Move the data into the target system

ETL pipelines are used to combine data from multiple sources into a single, consistent data
set. This makes the data easier to analyze and ensures that everyone accessing the information
has the most current data.

Q) Can you describe the components of a typical data pipeline?

Data Pipeline Components

1. Origin: The entry point for data in the pipeline, including data sources
(e.g., IoT sensors, APIs, social media) and storage systems (data warehouse
or lake).
2. Destination: The endpoint where data is transferred for visualization,
analysis, or storage.
3. Dataflow: Movement of data from origin to destination via ETL (Extract,
Transform, Load):
o Extract: Retrieve data from source systems.
o Transform: Reformat data in a staging area.
o Load: Save processed data to its final destination.
4. Storage: Systems for preserving data during pipeline stages. Choices
depend on data volume, query frequency, and use cases.
5. Processing: Activities to ingest, store, transform, and deliver data, e.g.,
database replication or streaming.
6. Workflow: Sequence of tasks in the pipeline:
o Upstream: Source tasks that must complete first.
o Downstream: Dependent tasks or destinations.
7. Monitoring: Ensures efficiency, accuracy, and consistency in the pipeline
while preventing data loss.
8. Technology: Tools and infrastructure supporting the pipeline:
o ETL tools (Informatica, Apache Spark).
o Data warehouses (Amazon Redshift, Snowflake).
o Data lakes (Azure, IBM).
o Workflow schedulers (Airflow, Luigi).
o Streaming tools (Kafka, Flink).
o Programming languages (Python, Java).

9. delivery

Q. Evolving from ETL to ELT

Traditional data pipelines followed the ETL (Extract, Transform, Load) approach, where data
is extracted from sources, transformed into a specific format in a staging area, and then loaded
into storage. However, with advancements in cloud computing, storage, and processing
technologies, the paradigm has shifted to ELT (Extract, Load, Transform).
Key Differences Between ETL and ELT:

1. Order of Operations:
o ETL: Transformation occurs before loading into storage.
o ELT: Data is loaded in its raw form and transformed post-loading.
2. Processing Power:
o ETL: Relies on separate ETL tools for transformations, which can be resource-
intensive.
o ELT: Leverages the computational capabilities of modern data warehouses or
lakes for transformation.
3. Scalability:
o ETL: May struggle with massive data volumes as transformations happen
before storage.
o ELT: Handles large datasets efficiently by taking advantage of scalable cloud
storage and computing.
4. Flexibility:
o ETL: Requires pre-defined transformations, limiting flexibility for future use
cases.
o ELT: Stores raw data, enabling flexible transformations as new needs arise.

Advantages of ELT:

 Faster Data Ingestion: Directly loading raw data reduces initial latency.
 Cost-Effectiveness: Utilizes existing cloud storage and processing platforms instead of
dedicated ETL infrastructure.
 Adaptability: Raw data supports diverse use cases like analytics, machine learning, or
ad hoc reporting.
 Improved Performance: Modern data platforms like Snowflake and BigQuery
optimize transformation processing.

Use Cases:

 ETL: Still relevant for legacy systems or highly regulated environments with strict data
requirements.
 ELT: Ideal for modern, cloud-based systems with large-scale data and real-time
processing needs.
Chapter 5
Virtualization & Containerization On & Elasticity In Cloud Computing:

Q. Explain Elastic Resources

Elastic resources refer to the ability of a system to dynamically allocate and deallocate computing
resources (e.g., processing power, memory, and storage) based on real-time demand. This
concept is a cornerstone of cloud computing, enabling organizations to optimize resource usage,
reduce costs, and enhance scalability. Elastic resources dynamically allocate and deallocate
computing power, storage, and memory based on real-time demand. This cloud computing
feature ensures cost efficiency, scalability, and adaptability.

Key Features:
 Scalability: Resources scale up/down automatically with workload.
 On-Demand: Pay only for what you use.
 Automation: Tools like auto-scaling adjust resources as needed.
Examples:
 Compute: Auto-scaling VMs or containers (e.g., AWS EC2).
 Storage: Expanding/shrinking storage (e.g., Amazon S3).
 Networking: Load balancers manage traffic surges.
Benefits:
 Reduces costs by avoiding overprovisioning.
 Enhances performance during demand spikes.
 Ensures high availability and reliability.
Use Cases:
 E-commerce sales traffic.
 Streaming live events.
 Big data analytics.
Explain Containerization in Cloud Computing
Containerization is a lightweight method of packaging software applications and their
dependencies (libraries, configurations, binaries) into isolated units called containers. These
containers can run consistently across different computing environments, from development to
production, ensuring seamless deployment.

Key Features of Containerization:

1. Isolation: Each container operates independently, ensuring applications don’t interfere
with one another.
2. Portability: Containers work consistently across environments, whether on-premises or
in the cloud.
3. Efficiency: Containers share the host operating system, making them lightweight
compared to virtual machines.
Benefits:
1. Scalability: Containers can be easily replicated and distributed across cloud instances to
handle increased demand.
2. Flexibility: Applications can run on any system that supports the container runtime (e.g.,
Docker, Kubernetes).
3. Cost-Effectiveness: Containers use fewer resources than traditional VMs, reducing
infrastructure costs.
4. Faster Deployment: Containers can be quickly started, updated, or replaced without
affecting others.
Use Cases:
 Microservices Architecture: Containers are ideal for running microservices, where each
service runs in its own container.
 DevOps Pipelines: Simplifies continuous integration and delivery (CI/CD).
 Cloud-Native Applications: Containers maximize cloud resource utilization.
Tools for Containerization:
1. Docker: Popular containerization platform for building and running containers.
2. Kubernetes: Orchestrates container deployment, scaling, and management.
3. Cloud Platforms: Services like AWS ECS, Azure AKS, and Google Kubernetes Engine
provide container management in the cloud.

Explain Container Registries

A container registry is a centralized repository where container images are stored, managed,
and retrieved for deployment. These images are prepackaged versions of applications and their
dependencies, created using tools like Docker. Registries simplify the sharing and reuse of
container images across teams and environments.
Key Features of Container Registries:
1. Image Storage: Stores container images for public or private access.
2. Versioning: Tracks different versions of an image, enabling rollbacks if needed.
3. Access Control: Offers authentication and role-based access for secure image
management.
4. Integration: Works with tools like Docker, Kubernetes, and CI/CD pipelines for
streamlined deployments.
There are agin 2 types of registries

1) Public container registries are generally the faster and easier route when initiating a container
registry.
2)Public registries are also seen to be easier to use.
3)they may also be less secure than private registries.
4) They are for smaller teams and wroks for standard and open sourced images from public
registries.

1) A private container registry is set up by the organization using it.

2) Private registries are either hosted or on premises and popular with larger organization or
enterprises that are more set on using a container registry.
3) Having complete control over the registry in development allows an organization more
freedom in how they choose to manage it.
4) private registries are seen to be the more secure.
What is Docker and why it is used for containerization?

1) Docker is the containerization platform that is used to package your application and all its dependencies
together in the form of containers to make sure that your application works seamlessly in any environment
which can be developed or tested or in production.
2) Docker is a tool designed to make it easier to create,deploy, and run applications by using containers.
3) Docker is the world’s leading software containerplatform. It was launched in 2013 by a company
called Dotcloud, Inc which was later renamed Docker, Inc. It is written in the Go language.

Docker architecture consists of Docker client, Docker Daemon running on Docker Host, and
Docker Hub repository.
Docker has client Server architecture in which the client communicates with the Docker Daemon
running on the Docker Host using a combination of APIs, Socket IO, and TCP.
What are components of Docker

. Docker Clients and Servers: Docker has a client-server architecture. The Docker Daemon/Server
consists of all containers.
The Docker Daemon/Server receives the request from the Docker client through CLI or REST APIs and
thus processes the request accordingly. Docker client and Daemon can be present on the same host or
different host.
. Docker Clients and Servers: Docker has a client-server architecture. The Docker Daemon/Server
consists of all containers.
The Docker Daemon/Server receives the request from the Docker client through CLI or REST APIs and
thus processes the request accordingly. Docker client and Daemon can be present on the same host or
different host.

Advantages of Docker

1. Speed: Containers are lightweight and start quickly, reducing build and deployment
time.
2. Portability: Docker ensures consistent performance across environments, making it
easy to move applications.
3. Scalability: Docker can be deployed on multiple servers, data centers, and cloud
platforms, with seamless transitions between them.
4. Density: Docker uses resources efficiently, allowing more containers to run on a single
host without overhead.
KUBERNETES IN CLOUD W.R.T TO SCALING , PIPELINE
AND MICRO SERVCES
Kubernetes in Cloud: Scaling (4 Marks)

 Automatic Scaling: Adjusts number of pods and nodes based on workload.

 Horizontal Pod Autoscaling: Scales pods based on CPU or memory usage.
 Cluster Autoscaling: Increases or decreases nodes to match demand.
 Vertical and Horizontal Scaling: Supports both resource adjustments and pod
replication for optimal performance.

Kubernetes in Cloud: Pipeline (4 Marks)

 CI/CD Integration: Automates containerized application deployment.

 Rolling Updates: Gradual deployment of new versions to avoid downtime.
 Automated Rollbacks: Reverts to previous stable versions in case of issues.
 Tool Integration: Works with Jenkins, GitLab, Helm for streamlined pipelines.

Kubernetes in Cloud: Microservices (4 Marks)

 Microservice Isolation: Each service runs in its own container.

 Service Discovery & Load Balancing: Facilitates communication between services.
 Independent Scaling: Microservices can scale individually based on demand.
 Self-Healing & High Availability: Automatically restarts failed services, ensuring
reliability.
Chapter 6: Managed ML Systems
Q) Compare commercial and open-source ML systems:

Q) Explain Jupyter Notebook and Explain Its Workflow? ★★

What is Jupyter?
Ju(lia) + Py(thon) + (e)R
• Jupyter Notebooks are a community standard for communicating and performing
interactive computing. They are a document that blends computations, outputs,
explanatory text, mathematics, images, and rich media representations of objects.
• JupyterLab is one interface used to create and interact with Jupyter Notebooks.
Advantages of Jupyter?
• Useful for data cleaning and transformation, numerical simulation, statistical modelling, data
visualization, machine learning, and much more.
• Language of choice  40+ Languages
• Notebooks can be shared with others using email, Dropbox, GitHub and the Jupyter
Notebook Viewer.
• Your code can produce rich, interactive output: HTML, images, videos, and custom MIME
types.
• Big data integration - Leverage big data tools, such as Apache Spark, from Python, R and
Scala.
Explore that same data with pandas, scikit-learn, ggplot2, TensorFlow.
Limitations of Jupyter:
• It messes with your version control.
• The Jupyter Notebook format is just a big JSON, which contains your code and the outputs of
the code
• Code can only be run in chunks.

Q) Explain Azure ML Studio? ★

1. Azure is Microsoft’s cloud computing platform which helps to build solutions to meet
business goals.

2. It supports infrastructure (IaaS), platform (PaaS), and software as a service (SaaS)

computing services.

3. It also supports advanced computing services like artificial intelligence, machine

learning, and IoT.
Azure allows you to build, manage and deploy the application on a global network.

4. Microsoft Azure Machine Learning Studio is a collaborative, drag-and-drop tool you

can use to build, test, and deploy predictive analytics solutions on your data.

5. Machine Learning Studio publishes models as web services that can easily be consumed
by custom apps or BI tools.

Q) Explain AWS Sagemaker? ★

What is Sagemaker
1. SageMaker provides every developer and data scientist with the ability to build, train, and
deploy machine learning models quickly.

2. Amazon SageMaker is a fully-managed service that covers the entire machine learning
workflow to label and prepare your data, choose an algorithm, train the model, tune and
optimize it for deployment, make predictions, and take action.

3. Your models get to production faster with much less effort and lower cost
4. Amazon SageMaker is a fully-managed service that enables data scientists and developers
to quickly and easily build, train, and deploy machine learning models at any scale.

5. Amazon SageMaker includes modules that can be used together or independently to

build, train, and deploy your machine-learning models.

Cloud Data Pipelines Explained
No ratings yet
Cloud Data Pipelines Explained
8 pages
4-Data Processing Pipelines in Science and Business
100% (1)
4-Data Processing Pipelines in Science and Business
22 pages
DZ Data Pipeline Essentials 2024
No ratings yet
DZ Data Pipeline Essentials 2024
6 pages
UNIT 1 To 5
No ratings yet
UNIT 1 To 5
37 pages
CCD Unit 4
No ratings yet
CCD Unit 4
5 pages
Unit 4
No ratings yet
Unit 4
11 pages
N3 2020 Copy Updated
No ratings yet
N3 2020 Copy Updated
22 pages
Pipeline
No ratings yet
Pipeline
19 pages
Week8 Classroom Exercise
No ratings yet
Week8 Classroom Exercise
17 pages
Data Pipeline Essentials: See Ya Later
No ratings yet
Data Pipeline Essentials: See Ya Later
6 pages
Streaming Data Pipelines Guide
No ratings yet
Streaming Data Pipelines Guide
9 pages
Data Pipeline
No ratings yet
Data Pipeline
14 pages
554 de
No ratings yet
554 de
33 pages
What Is A Data Pipeline - IBM
No ratings yet
What Is A Data Pipeline - IBM
10 pages
Data Models (Module - II)
No ratings yet
Data Models (Module - II)
101 pages
001 Resource Eb Data Governance in The Modern Data Analytics Pipeline en Uycokl
No ratings yet
001 Resource Eb Data Governance in The Modern Data Analytics Pipeline en Uycokl
11 pages
D Report
No ratings yet
D Report
19 pages
Data Pipelines for Retail Insights
No ratings yet
Data Pipelines for Retail Insights
13 pages
Data Engineering and Data Engineer - Students
No ratings yet
Data Engineering and Data Engineer - Students
56 pages
AI ML Data Pipeline
No ratings yet
AI ML Data Pipeline
10 pages
Unit 3 - BDA - Notes
No ratings yet
Unit 3 - BDA - Notes
9 pages
Data Engineering
No ratings yet
Data Engineering
22 pages
Data Engineer Handbook
No ratings yet
Data Engineer Handbook
21 pages
Data Pipelines
No ratings yet
Data Pipelines
10 pages
DE Skills and Tools Guide
No ratings yet
DE Skills and Tools Guide
20 pages
Course1 Summary
No ratings yet
Course1 Summary
4 pages
Internship Report1
No ratings yet
Internship Report1
16 pages
Evolution of Data Engineering in Modern Software D
No ratings yet
Evolution of Data Engineering in Modern Software D
15 pages
Data Engineering
No ratings yet
Data Engineering
14 pages
InfoQ Modern Data Architectures Pipelines Streams
No ratings yet
InfoQ Modern Data Architectures Pipelines Streams
42 pages
Ram Data Engineering
No ratings yet
Ram Data Engineering
17 pages
Data Engineering
No ratings yet
Data Engineering
6 pages
ETL For Data Infra-2
No ratings yet
ETL For Data Infra-2
33 pages
DocScanner 20 Oct 2024 2-19 PM
No ratings yet
DocScanner 20 Oct 2024 2-19 PM
16 pages
Google Cloud Data Engineering
No ratings yet
Google Cloud Data Engineering
129 pages
DZone TR Data Pipelines 2022 Spotlight Dremio
No ratings yet
DZone TR Data Pipelines 2022 Spotlight Dremio
42 pages
T-GCPBDML-B - M2 - Data Engineering For Streaming Data - ILT Slides
No ratings yet
T-GCPBDML-B - M2 - Data Engineering For Streaming Data - ILT Slides
71 pages
Snowflake For: Data Engineering
No ratings yet
Snowflake For: Data Engineering
15 pages
11 Best Practices For Data Engineers
No ratings yet
11 Best Practices For Data Engineers
7 pages
Data Management Guide Checklists
No ratings yet
Data Management Guide Checklists
15 pages
Bring Data Lakes and Data Warehouses Together
100% (1)
Bring Data Lakes and Data Warehouses Together
19 pages
Qlik Data Integration Architectures
No ratings yet
Qlik Data Integration Architectures
13 pages
Data Pipeline Architecture
No ratings yet
Data Pipeline Architecture
6 pages
Ds 6
No ratings yet
Ds 6
7 pages
12 Best Practices For Modern Data Integration: White Paper
100% (3)
12 Best Practices For Modern Data Integration: White Paper
10 pages
Ai&ds Ie Report
No ratings yet
Ai&ds Ie Report
6 pages
What Is Stream Processing
No ratings yet
What Is Stream Processing
3 pages
Module 2-3 Fuba Midterms
100% (1)
Module 2-3 Fuba Midterms
5 pages
T-GCPBDML-B - M2 - Data Engineering For Streaming Data - ILT Slides
No ratings yet
T-GCPBDML-B - M2 - Data Engineering For Streaming Data - ILT Slides
59 pages
Data Pipeline
No ratings yet
Data Pipeline
34 pages
Data Pipeline Architecture by Enterprises AIM Research 1677722514
No ratings yet
Data Pipeline Architecture by Enterprises AIM Research 1677722514
45 pages
Next-Generation Data Pipeline Designs For Modern A
No ratings yet
Next-Generation Data Pipeline Designs For Modern A
7 pages
Big Data: Collection, Storage & Benefits
No ratings yet
Big Data: Collection, Storage & Benefits
6 pages
BIG Data Analytics Pipeline
No ratings yet
BIG Data Analytics Pipeline
3 pages
Aditya Technical Seminar
No ratings yet
Aditya Technical Seminar
10 pages
Big Data Pipelines For Real-Time Computing
No ratings yet
Big Data Pipelines For Real-Time Computing
1 page
Data Engineering 101
No ratings yet
Data Engineering 101
1 page
Chapter 6
No ratings yet
Chapter 6
26 pages
Data Analyst Training Guide
No ratings yet
Data Analyst Training Guide
4 pages
63а wifi relay with energy monitoring manual
No ratings yet
63а wifi relay with energy monitoring manual
6 pages
Aerodynamics in Cars
No ratings yet
Aerodynamics in Cars
31 pages
Jot Brochure 2021-22
No ratings yet
Jot Brochure 2021-22
60 pages
Psim Mathematical Tools To Simulate Pem Fuel Cells Including The Power Converter PDF
No ratings yet
Psim Mathematical Tools To Simulate Pem Fuel Cells Including The Power Converter PDF
6 pages
Etsi 102 250-2 2.7.1
No ratings yet
Etsi 102 250-2 2.7.1
269 pages
Chemistry IGCSE O LEVELS Revision Notes
No ratings yet
Chemistry IGCSE O LEVELS Revision Notes
3 pages
Acer Aspire 4253 - 5253 - Compal - LA-7092P - JE50 - HM50 - SJV50 - BZ - P5 1
No ratings yet
Acer Aspire 4253 - 5253 - Compal - LA-7092P - JE50 - HM50 - SJV50 - BZ - P5 1
47 pages
Transformer Less Power Supply For Microcontrollers
No ratings yet
Transformer Less Power Supply For Microcontrollers
11 pages
SAS B.Inggris
No ratings yet
SAS B.Inggris
4 pages
Memory Management in Operating Systems
No ratings yet
Memory Management in Operating Systems
10 pages
BOOK Superalloys
No ratings yet
BOOK Superalloys
341 pages
A MATLAB Simulink Model For Toyota Prius 2004 Based On DOE Reports
No ratings yet
A MATLAB Simulink Model For Toyota Prius 2004 Based On DOE Reports
9 pages
Bhel Mini Pro Report On Turbo Generators 1
100% (2)
Bhel Mini Pro Report On Turbo Generators 1
53 pages
Atlib 2016
No ratings yet
Atlib 2016
7 pages
RC DESIGN-Columns
No ratings yet
RC DESIGN-Columns
29 pages
Essay Guideline: Aristotle's Objections To Plato's Theory of Forms
No ratings yet
Essay Guideline: Aristotle's Objections To Plato's Theory of Forms
1 page
Pratama Et Al
No ratings yet
Pratama Et Al
7 pages
(Ebook PDF) Programming Language Pragmatics, 4th Edition Kindle & PDF Formats
100% (5)
(Ebook PDF) Programming Language Pragmatics, 4th Edition Kindle & PDF Formats
85 pages
PH Scale: Rules of PH Value
No ratings yet
PH Scale: Rules of PH Value
6 pages
Cyber Final
No ratings yet
Cyber Final
33 pages
Process For Obtaining Honey Form Husk Coffe
No ratings yet
Process For Obtaining Honey Form Husk Coffe
10 pages
MA3103
No ratings yet
MA3103
1 page
Feasibility of Two-Factor Payment Authentication Using Eeg-Based Bcis
No ratings yet
Feasibility of Two-Factor Payment Authentication Using Eeg-Based Bcis
6 pages
Can Ulc S701 05 en
No ratings yet
Can Ulc S701 05 en
22 pages
Proceedings Rockfall
No ratings yet
Proceedings Rockfall
131 pages
Solution Manual For Manufacturing Engineering and Technology 7th Edition
No ratings yet
Solution Manual For Manufacturing Engineering and Technology 7th Edition
14 pages
Photocopiable Activities-Part 1
No ratings yet
Photocopiable Activities-Part 1
1 page
A Comprehensive (SI) Units Package: Siunitx
No ratings yet
A Comprehensive (SI) Units Package: Siunitx
60 pages
Zoology Assignment Guide
No ratings yet
Zoology Assignment Guide
4 pages

CCD 4,5,6

Uploaded by

CCD 4,5,6

Uploaded by

Chapter 4: Data management using Cloud Compu ng

What is data pipelining?

What is purpose of data pipelining

Who needs data pipelining?

1) Data scientists and other members of data science teams.

Other best practices on data pipelines include the following:

Real-Time Streaming: Data is processed as soon as it is generated, with minimal delay.

Microservices Architecture: Data is processed using loosely coupled, independently deployable

What are data pipeline design patterns? Or Designing pipelines?

Difference between ELT and Data Pipeline

2) Data pipelines work with ongoing data streams in real time

Q) Can you describe the components of a typical data pipeline?

Data Pipeline Components

Q. Evolving from ETL to ELT

Q. Explain Elastic Resources

Key Features of Containerization:

Explain Container Registries

1) A private container registry is set up by the organization using it.

 Automatic Scaling: Adjusts number of pods and nodes based on workload.

Kubernetes in Cloud: Pipeline (4 Marks)

 CI/CD Integration: Automates containerized application deployment.

Kubernetes in Cloud: Microservices (4 Marks)

 Microservice Isolation: Each service runs in its own container.

Q) Explain Jupyter Notebook and Explain Its Workflow? ★★

Q) Explain Azure ML Studio? ★

2. It supports infrastructure (IaaS), platform (PaaS), and software as a service (SaaS)

3. It also supports advanced computing services like artificial intelligence, machine

4. Microsoft Azure Machine Learning Studio is a collaborative, drag-and-drop tool you

Q) Explain AWS Sagemaker? ★

5. Amazon SageMaker includes modules that can be used together or independently to

You might also like