De Unit-2
De Unit-2
Data Engineering Life Cycle: Data Life Cycle Versus Data Engineering Life Cycle,
Generation: Source System, Storage, Ingestion, Transformation, Serving Data. Major
undercurrents across the Data Engineering Life Cycle: Security, Data Management,
DataOps, Data Architecture, Orchestration, Software Engineering.
1. Data Creation
2. Data Collection 1. Requirements Gathering
3. Data Processing 2. Data Ingestion
Stages/Phases 4. Data Storage 3. Data Storage Design
5. Data Analysis 4. Data Processing
6. Data Sharing 5. Data Monitoring & Optimization
7. Data Archiving/Deletion
Each source system has unique data volume and frequency characteristics. A data
engineer should know how data is generated and any specific quirks of the system. It's
also crucial to understand the system's limitations, such as whether running analytical
queries could slow down its performance.
One of the most challenging variations of source data is the schema. The schema defines the
structure of data, from the overall system down to individual tables and fields. Handling
schema correctly is crucial, as the way data is structured impacts how data is ingested and
processed. There are two common approaches:
1. Schemaless Systems: In these systems, the schema is dynamic and defined as data is
written. This is often the case with NoSQL databases like MongoDB or when data is
written to a messaging queue or a blob storage.
2. Fixed Schema Systems: In more traditional relational database systems, the schema
is predefined, and any data written to the database must conform to it.
Data engineers need to adapt to schema evolution, as source systems may change over time.
For example, in an Agile development process, the schema may evolve to accommodate new
requirements, and data engineers must ensure that their data pipelines can handle these
changes without disrupting downstream analytics.
When selecting a storage system for a data warehouse, lakehouse, database, or object storage,
key questions include:
Is this storage solution compatible with the architecture’s required write and read
speeds?
Will storage create a bottleneck for downstream processes?
Do you understand how this storage technology works? Are you utilizing the storage
system optimally or committing unnatural acts?
Will this storage system handle anticipated future scale?
Will downstream users and processes be able to retrieve data in the required service-
level agreement (SLA)?
Are you capturing metadata about schema evolution, data flows, data lineage, and so
forth?
Is this a pure storage solution (object storage), or does it support complex query
patterns (i.e., a cloud data warehouse)?
Is the storage system schema-agnostic, flexible schema, or enforced schema?
How are you tracking master data, golden records, data quality, and data lineage for
governance?
Hot Data: Frequently accessed, often multiple times per day or second. Stored for
fast retrieval, suitable for real-time systems.
Lukewarm Data: Accessed occasionally, such as weekly or monthly.
Cold Data: Rarely accessed, typically stored for compliance or backup purposes.
Traditionally stored on tapes, but cloud vendors now offer low-cost archival options
with high retrieval costs.
There is no universal storage solution—each technology comes with trade-offs, and the
choice should align with the specific needs of the data architecture.
4. Data Ingestion
Data ingestion is the second phase of the data engineering lifecycle, involving the collection
of data from various source systems. After understanding the data sources and their
characteristics, it becomes essential to ensure smooth and reliable data flow into storage,
processing, and serving systems.
Source systems and the ingestion process often form bottlenecks in data pipelines, as these
systems may become unresponsive, provide poor-quality data, or encounter unexplained
failures. Any disruption in ingestion affects the entire data lifecycle. Addressing key
questions about source systems can help minimize these issues.
1. Use Cases: How will the ingested data be used? Can the same data be reused to avoid
multiple versions?
2. Data Reliability: Are the systems generating and ingesting data reliable? Is the data
available when required?
3. Data Destination: Where will the data be stored after ingestion?
4. Access Frequency: How often will the data be accessed?
5. Data Volume: What is the typical volume of incoming data?
6. Data Format: Can downstream storage and transformation systems handle this
format?
7. Data Quality: Is the data ready for immediate use? If so, for how long?
8. Data Transformation: If the data is streaming, does it need in-flight transformation
before reaching its destination?
Batch Ingestion
Processes data in large chunks based on a fixed time interval or size threshold.
Suitable for traditional systems, analytics, and machine learning (ML) tasks.
Introduces inherent latency for downstream consumers.
Streaming Ingestion
1. Data Flow Rate: Can downstream storage systems handle real-time data flow?
2. Timeliness Requirements: Is millisecond latency essential, or will minute-level
micro-batches suffice?
3. Use Cases: What are the benefits of real-time data over batch?
4. Cost vs. Benefit: Will a streaming-first approach be more expensive in time,
maintenance, and opportunity cost?
5. System Reliability: Is the streaming pipeline reliable with redundancy?
6. Tool Selection: Should you use managed services (like AWS Kinesis or Google
Cloud Pub/Sub) or deploy tools like Kafka or Flink?
7. Machine Learning: Does real-time ingestion benefit online predictions or continuous
training?
8. Source Impact: What is the impact of ingestion on the production instance?
Push Model
The source system sends data directly to a target (database, object store, or
filesystem).
Common in streaming ingestion for IoT sensors and real-time applications.
Pull Model
1. ETL Process: Typically uses a pull model, querying a source database snapshot
periodically.
2. Change Data Capture (CDC):
o Push-based CDC: A message is triggered and sent to a queue for ingestion
when a row changes in the database.
o Pull-based CDC: Timestamp-based queries pull rows that changed since the
last update.
By carefully considering batch versus streaming and push versus pull models, data engineers
can design efficient, reliable ingestion systems tailored to specific use cases.
5. Transformation
Transformation is a crucial stage in the data engineering lifecycle, occurring after data
ingestion and storage. It involves converting raw data into a structured form that is useful for
downstream applications, such as reporting, analysis, or machine learning. Without proper
transformations, data remains inert and cannot provide value to users.
Immediately after ingestion, basic transformations are applied, such as mapping data into
correct types by converting string data into numeric or date types. Records are standardized
into consistent formats, and invalid or corrupt records are removed. In later stages,
transformations may include schema changes and data normalization. Large-scale
aggregation may be used for reporting, while data featurization prepares data for machine
learning processes.
Key Considerations for Transformation
When transforming data, it’s essential to address the following:
Cost and ROI: What’s the cost and return on investment (ROI) of the
transformation? What is the associated business value?
Simplicity and Isolation: Is the transformation as simple and self-isolated as
possible?
Business Rules: What business rules do the transformations support?
Data Movement: Am I minimizing data movement between the transformation and
the storage system during transformation?
Batch vs. Streaming Transformations
Batch Processing: Processes large volumes of data at once. It remains widely
popular.
Streaming Processing: Handles continuous data streams in real time. As streaming
data grows, this method is expected to gain dominance, potentially replacing batch
processing in certain areas.
Transformation Entanglement
In practice, transformation often overlaps with other phases of the data lifecycle:
In Source Systems: Data may be transformed before being ingested (e.g., a source
system adding timestamps).
In Flight: Data may be enriched with additional fields or calculations while in a
streaming pipeline before being stored. Transformations occur throughout the
lifecycle, including data preparation, wrangling, and cleaning tasks that enhance data
value.
Role of Business Logic in Transformation
Business logic often drives data transformation, particularly in data modeling. Business rules
translate raw data into meaningful information. For example:
A retail transaction might show "someone bought 12 picture frames for $30 each,
totaling $360."
Proper transformation ensures this data is modeled with accounting logic to give a
clear picture of the business's financial health.
To ensure consistency, it's crucial to follow a standardized approach to implementing
business logic across transformations.
Data Featurization for ML
Featurization is another important transformation process aimed at extracting features from
data for ML models. It requires:
Domain Expertise: Understanding which features are valuable for predictions.
Data Science Knowledge: Experience in data manipulation and modeling.
Once data scientists determine the necessary features, data engineers can automate the
featurization processes as part of the transformation stage.
Transformation is a complex but essential phase, turning raw data into actionable information
that supports business and technical needs. Further discussions on queries, data modeling,
and transformation techniques offer a more detailed understanding of this vital process.
6. Serving Data
The final stage of the data engineering lifecycle involves using data that has been ingested,
stored, and transformed into valuable structures. The value of data depends on its practical
usage. If data remains unused, it becomes meaningless. In the past, many companies amassed
vast datasets that were never effectively utilized. This risk persists with modern cloud-based
systems. Data projects should be intentional and focused on achieving specific business
outcomes.
Data Serving
Data serving is where the true potential of data is realized. It enables analytics, machine
learning (ML), and reverse ETL.
Analytics
Analytics is fundamental to data projects. Once data is stored and transformed, it can be used
to generate reports, dashboards, and ad hoc analyses. Analytics encompasses several forms:
8. Security:
Security is a critical concern for data engineers and ignoring it can lead to serious
consequences. Therefore, it is the first key undercurrent in the data engineering
lifecycle.
Data engineers must have a strong understanding of both data security and access
security, adhering to the principle of least privilege—granting users or systems only
the minimum access necessary to perform their tasks.
A common mistake inexperienced data engineers make is granting admin access to
all users, which is a recipe for disaster. Instead, access should be limited strictly to
what is needed for current job functions.
Security vulnerabilities often stem from human errors and organizational issues.
Many high-profile security breaches are the result of neglected security practices,
phishing attacks, or irresponsible actions by internal staff.
Therefore, building a security-conscious culture is essential. Everyone with access to
data must recognize their role in protecting sensitive company and customer
information.
Timely access control is another vital aspect—granting data access only to those who
need it and only for as long as necessary.
Data protection should involve methods such as encryption, tokenization, data
masking, and robust access control mechanisms to prevent unauthorized visibility.
Data engineers need to be competent security administrators, understanding security
practices for both cloud environments and on-premises systems.
Key areas to master include user and identity access management, roles, policies,
network security, password protocols, and encryption techniques.
Throughout the data engineering lifecycle, maintaining a focus on security is essential
to safeguard data and systems effectively.
9. Data Management
Data management is defined as the development, execution, and supervision of plans,
policies, programs, and practices that ensure the control, protection, and enhancement
of data throughout its lifecycle. The goal is to maximize the value of data and
information assets, ensuring they are properly handled from creation to usage.
Evolving Role of Data Engineers:
Historically, data management practices were reserved for large enterprises, but now
even smaller companies are adopting these best practices, such as data governance,
master data management, and data quality management.
As tools become simpler and more accessible, data engineers can focus more on
strategic management of data, becoming integral to the organization’s broader data-
driven goals.
Importance of Data Management:
Data is treated as an asset, just like financial resources or physical goods. Proper data
management ensures its value is realized, maintained, and protected. Without a
structured data management framework, data engineers risk operating without a clear
vision, reducing the effectiveness of their work.
Key Areas of Data Management:
Data Governance: Ensures data quality, integrity, security, and usability. It aligns
people, processes, and technology to maximize data value while safeguarding it.
Data Modeling and Design: Converts raw data into a structured, usable form that can
support business analytics and decision-making.
Data Lineage: Tracks the flow and transformation of data across systems, providing
transparency and auditing capabilities.
Data Integration and Interoperability: Ensures smooth communication between
different data systems and platforms.
Data Lifecycle Management: Manages the data from its creation to its archival or
deletion, ensuring that it remains usable throughout its life.
Data Systems for Advanced Analytics and ML: Supports advanced analytics and
machine learning by ensuring that data is appropriately structured and stored.
Ethics and Privacy: Ensures that data is collected, stored, and used in a way that
complies with privacy laws and ethical guidelines.
Data Governance in Practice:
Data governance ensures that data is not only accurate and complete but also properly
secured and accessible. Poor data governance practices can lead to issues such as
untrusted data or security breaches, which undermine the company’s ability to make
informed decisions.
The key pillars of data governance include discoverability, security, and
accountability, which are supported by practices such as metadata management and
data quality control.
Metadata Management:
Metadata (data about data) is essential for making data discoverable and manageable.
It comes in two categories: autogenerated (from systems) and human-generated (from
internal knowledge). Data engineers use metadata to understand the context,
relationships, and lineage of data.
There are four categories of metadata:
o Business Metadata: Describes how data is used within the business context.
o Technical Metadata: Covers the technical aspects, such as schema, data
lineage, and system dependencies.
o Operational Metadata: Includes process logs, error logs, and job statistics.
o Reference Metadata: Provides lookup data, such as geographic codes or
standard measurement units.
Data Accountability:
Data accountability assigns responsibility to individuals who manage specific data
assets. These individuals coordinate with other stakeholders to ensure data quality and
adherence to governance practices.
It’s crucial that the accountable parties are not just data engineers but may also
include product managers, software engineers, and others who touch the data.
Data Quality:
Data quality refers to the trustworthiness and completeness of data. Data engineers are
responsible for ensuring that data is accurate, complete, and timely. This includes
handling challenges like distinguishing between human and machine-generated data
or dealing with late-arriving data, such as in the case of delayed ad views in video
platforms.
Data Modeling and Design:
To derive insights from data, it must be transformed into a usable format through data
modeling and design. While traditionally handled by database administrators (DBAs)
and ETL developers, data modeling is increasingly a part of the data engineer’s role.
10. DataOps
DataOps is an approach that applies the best practices of Agile, DevOps, and
Statistical Process Control (SPC) to manage data, aiming to improve the speed,
quality, and consistency of data products in the same way DevOps enhances software
products.
Data products, unlike software products, focus on business logic and metrics to
support decision-making or automated actions, which means data engineers need to
understand both technical software development and the business aspects of data.
A key aspect of DataOps is its cultural foundation, which involves constant
communication and collaboration between data teams and the business, breaking
down silos, learning from both successes and failures, and iterating rapidly.
Once these cultural practices are in place, the use of technology and tools becomes
more effective. Depending on a company's data maturity, DataOps can be integrated
from scratch (greenfield) or added incrementally to existing workflows.
The first steps often involve improving monitoring and observability before
automating processes and refining incident response.
The three technical pillars of DataOps are automation, monitoring and observability, and
incident response. These elements guide the data engineering lifecycle:
11. Orchestration
Orchestration is the process of coordinating multiple tasks or jobs to run efficiently
and on schedule.
It involves managing job dependencies, typically through a directed acyclic graph
(DAG), which determines the order and timing of tasks.
Orchestration tools like Apache Airflow go beyond simple scheduling tools (like cron)
by tracking job dependencies and managing tasks based on metadata.
Orchestration systems ensure that jobs run on time, monitor for errors, and send alerts
when things go wrong.
They can handle dependencies, such as making sure a report doesn’t run until all
necessary data has been processed. These systems also offer features like job history,
visualization, and backfilling (filling in missing tasks if needed).
Orchestration has been important for data processing but was once only accessible to
large companies. Tools like Apache Oozie were used for job management but were
expensive and difficult for smaller companies to use.
Airflow, developed by Airbnb in 2014, was open-source, written in Python, and
highly adaptable, making it widely popular. Other orchestration tools, like Prefect and
Dagster, aim to improve portability and testability of tasks, while Argo and Metaflow
focus on different aspects like Kubernetes integration and data science.
However, orchestration is a batch process, meaning it handles tasks in chunks. For
real-time data processing, streaming DAGs are used, but they’re more complex to
build. New platforms, like Pulsar, are working to make streaming orchestration easier.