100% found this document useful (1 vote)
47 views17 pages

De Unit-2

The document outlines the Data Engineering Life Cycle, contrasting it with the Data Life Cycle, and detailing stages such as generation, storage, ingestion, transformation, and serving data. It emphasizes the importance of understanding source systems, evaluating storage solutions, and ensuring effective data ingestion and transformation processes. Additionally, it highlights the role of data engineers in facilitating analytics and machine learning through structured data serving.

Uploaded by

Saranya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
47 views17 pages

De Unit-2

The document outlines the Data Engineering Life Cycle, contrasting it with the Data Life Cycle, and detailing stages such as generation, storage, ingestion, transformation, and serving data. It emphasizes the importance of understanding source systems, evaluating storage solutions, and ensuring effective data ingestion and transformation processes. Additionally, it highlights the role of data engineers in facilitating analytics and machine learning through structured data serving.

Uploaded by

Saranya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

UNIT-2

Data Engineering Life Cycle: Data Life Cycle Versus Data Engineering Life Cycle,
Generation: Source System, Storage, Ingestion, Transformation, Serving Data. Major
undercurrents across the Data Engineering Life Cycle: Security, Data Management,
DataOps, Data Architecture, Orchestration, Software Engineering.

1. Data Life Cycle Vs Data Engineering Life Cycle


The Data Life Cycle and Data Engineering Life Cycle are related but focus on different
aspects of handling data. Here's a breakdown of their differences:

Aspect Data Life Cycle Data Engineering Life Cycle

The stages data goes through The process of designing, building,


Definition
from creation to disposal. and maintaining data infrastructure.

Data management and usage Data infrastructure and processing


Focus
over time. workflows.

1. Data Creation
2. Data Collection 1. Requirements Gathering
3. Data Processing 2. Data Ingestion
Stages/Phases 4. Data Storage 3. Data Storage Design
5. Data Analysis 4. Data Processing
6. Data Sharing 5. Data Monitoring & Optimization
7. Data Archiving/Deletion

Ensuring data is useful and Ensuring data is useful and


Objective
properly handled. properly handled.

Business users, data Data engineers, software


Stakeholders scientists, compliance developers, IT administrators.
officers.

Excel, SQL, BI tools (Power ETL tools (Apache NiFi, Airflow),


Tools Used BI, Tableau). Databases (Snowflake, Redshift),
Big Data tools (Hadoop, Spark).

Meaningful insights and Efficient data pipelines and


End Goal
compliance. infrastructure.

2. Generation: Source System


A source system is the origin of data used throughout the data engineering process. Examples
include IoT devices, application message queues, and transactional databases. Data engineers
consume data from these systems but do not typically own or control them. They must
understand how source systems generate data, the speed and frequency of data flow, and the
variety of data types involved.
Maintaining communication with source system owners is crucial to handle changes that may
affect data pipelines and analytics. Changes such as modifications to application code or
migration to a new database can impact data structures and flows.
Examples of Source Systems
1. Traditional Source System: Application Database
 Consists of application servers connected to a relational database management
system (RDBMS).
 This pattern has been in use since the 1980s and remains popular today with
microservices architecture, where each service has its own database.
 Example: An e-commerce platform where product and customer data are
stored in a MySQL or PostgreSQL database.

2. Modern Source System: IoT Swarm


 Comprises numerous IoT devices (e.g., sensors, smart appliances) sending
data to a central system via message queues or cloud services.
 These systems generate high-velocity, real-time data streams that need
processing and analysis.
 Example: A network of weather sensors sending temperature and humidity
data to a cloud-based collection system.
Key Considerations for Evaluating Source Systems

When working with a source system, data engineers should ask:

1. What type of system is it?


o Is it an application, IoT devices, or something else?
2. How is data stored?
o Is the data stored long-term or deleted after a short time?
3. How fast is data generated?
o Are there millions of events per second, or a few per hour?
4. Is the data reliable?
o Are there missing values, incorrect formats, or duplicates?
5. How often do errors occur?
o Does the system have frequent failures?
6. Does data arrive late?
o Some messages might be delayed due to network issues.
7. What is the data structure (schema)?
o Is the data spread across multiple tables or systems?
o How are schema changes handled?
8. How often should data be collected?
o Is data collected in real-time or at fixed intervals?
9. Will reading data slow down the system?
o Extracting data could impact system performance.

Understanding Source System Limits

Each source system has unique data volume and frequency characteristics. A data
engineer should know how data is generated and any specific quirks of the system. It's
also crucial to understand the system's limitations, such as whether running analytical
queries could slow down its performance.

One of the most challenging variations of source data is the schema. The schema defines the
structure of data, from the overall system down to individual tables and fields. Handling
schema correctly is crucial, as the way data is structured impacts how data is ingested and
processed. There are two common approaches:

1. Schemaless Systems: In these systems, the schema is dynamic and defined as data is
written. This is often the case with NoSQL databases like MongoDB or when data is
written to a messaging queue or a blob storage.
2. Fixed Schema Systems: In more traditional relational database systems, the schema
is predefined, and any data written to the database must conform to it.

Data engineers need to adapt to schema evolution, as source systems may change over time.
For example, in an Agile development process, the schema may evolve to accommodate new
requirements, and data engineers must ensure that their data pipelines can handle these
changes without disrupting downstream analytics.

3. Storage in Data Engineering


Once data is ingested, it must be stored appropriately. Selecting the right storage solution is
crucial for the success of the entire data lifecycle, yet it is one of the most complex stages due
to several factors.

1. Complexity of Storage Solutions:


o Cloud architectures often use multiple storage solutions.
o Many storage solutions offer not only storage but also data transformation
capabilities (e.g., Amazon S3 Select).
o Storage overlaps with other lifecycle stages such as ingestion, transformation,
and serving.
2. Impact Across the Data Lifecycle:
o Storage occurs at multiple points in a data pipeline and affects processes
across the lifecycle.
o Cloud data warehouses can store, process, and serve data.
o Streaming platforms like Apache Kafka and Pulsar serve as ingestion, storage,
and query systems, with object storage as a common data transmission layer.

Key Considerations for Evaluating Storage Systems:

When selecting a storage system for a data warehouse, lakehouse, database, or object storage,
key questions include:

 Is this storage solution compatible with the architecture’s required write and read
speeds?
 Will storage create a bottleneck for downstream processes?
 Do you understand how this storage technology works? Are you utilizing the storage
system optimally or committing unnatural acts?
 Will this storage system handle anticipated future scale?
 Will downstream users and processes be able to retrieve data in the required service-
level agreement (SLA)?
 Are you capturing metadata about schema evolution, data flows, data lineage, and so
forth?
 Is this a pure storage solution (object storage), or does it support complex query
patterns (i.e., a cloud data warehouse)?
 Is the storage system schema-agnostic, flexible schema, or enforced schema?
 How are you tracking master data, golden records, data quality, and data lineage for
governance?

Understanding Data Access Frequency ("Data Temperatures")

Data is accessed at different frequencies, leading to classification based on "temperature":

 Hot Data: Frequently accessed, often multiple times per day or second. Stored for
fast retrieval, suitable for real-time systems.
 Lukewarm Data: Accessed occasionally, such as weekly or monthly.
 Cold Data: Rarely accessed, typically stored for compliance or backup purposes.
Traditionally stored on tapes, but cloud vendors now offer low-cost archival options
with high retrieval costs.

Selecting the Right Storage Solution

The choice of storage depends on various factors:

 Use Cases: Different storage types suit different needs.


 Data Volume: Large volumes may require scalable solutions.
 Ingestion Frequency: High-frequency ingestion may need specialized storage.
 Data Format and Size: The structure and size influence storage decisions.

There is no universal storage solution—each technology comes with trade-offs, and the
choice should align with the specific needs of the data architecture.

4. Data Ingestion
Data ingestion is the second phase of the data engineering lifecycle, involving the collection
of data from various source systems. After understanding the data sources and their
characteristics, it becomes essential to ensure smooth and reliable data flow into storage,
processing, and serving systems.

Source systems and the ingestion process often form bottlenecks in data pipelines, as these
systems may become unresponsive, provide poor-quality data, or encounter unexplained
failures. Any disruption in ingestion affects the entire data lifecycle. Addressing key
questions about source systems can help minimize these issues.

Key Engineering Considerations for Ingestion

When designing an ingestion system, here are crucial questions to address:

1. Use Cases: How will the ingested data be used? Can the same data be reused to avoid
multiple versions?
2. Data Reliability: Are the systems generating and ingesting data reliable? Is the data
available when required?
3. Data Destination: Where will the data be stored after ingestion?
4. Access Frequency: How often will the data be accessed?
5. Data Volume: What is the typical volume of incoming data?
6. Data Format: Can downstream storage and transformation systems handle this
format?
7. Data Quality: Is the data ready for immediate use? If so, for how long?
8. Data Transformation: If the data is streaming, does it need in-flight transformation
before reaching its destination?

Batch vs. Streaming Ingestion

Batch Ingestion

 Processes data in large chunks based on a fixed time interval or size threshold.
 Suitable for traditional systems, analytics, and machine learning (ML) tasks.
 Introduces inherent latency for downstream consumers.

Streaming Ingestion

 Provides continuous, real-time data to downstream systems.


 Real-time typically means data availability less than one second after production.
 Offers immediate insights but demands more robust infrastructure.

Key Considerations for Choosing Batch or Streaming

1. Data Flow Rate: Can downstream storage systems handle real-time data flow?
2. Timeliness Requirements: Is millisecond latency essential, or will minute-level
micro-batches suffice?
3. Use Cases: What are the benefits of real-time data over batch?
4. Cost vs. Benefit: Will a streaming-first approach be more expensive in time,
maintenance, and opportunity cost?
5. System Reliability: Is the streaming pipeline reliable with redundancy?
6. Tool Selection: Should you use managed services (like AWS Kinesis or Google
Cloud Pub/Sub) or deploy tools like Kafka or Flink?
7. Machine Learning: Does real-time ingestion benefit online predictions or continuous
training?
8. Source Impact: What is the impact of ingestion on the production instance?

Push vs. Pull Models

Push Model

 The source system sends data directly to a target (database, object store, or
filesystem).
 Common in streaming ingestion for IoT sensors and real-time applications.

Pull Model

 The ingestion system retrieves data from the source system.


 Common in ETL processes where data snapshots are queried on a schedule.
Examples of Push and Pull in Ingestion

1. ETL Process: Typically uses a pull model, querying a source database snapshot
periodically.
2. Change Data Capture (CDC):
o Push-based CDC: A message is triggered and sent to a queue for ingestion
when a row changes in the database.
o Pull-based CDC: Timestamp-based queries pull rows that changed since the
last update.

In modern data pipelines, push-based streaming is increasingly favored as it simplifies real-


time processing and allows tailored messages for downstream analytics.

By carefully considering batch versus streaming and push versus pull models, data engineers
can design efficient, reliable ingestion systems tailored to specific use cases.

5. Transformation
Transformation is a crucial stage in the data engineering lifecycle, occurring after data
ingestion and storage. It involves converting raw data into a structured form that is useful for
downstream applications, such as reporting, analysis, or machine learning. Without proper
transformations, data remains inert and cannot provide value to users.
Immediately after ingestion, basic transformations are applied, such as mapping data into
correct types by converting string data into numeric or date types. Records are standardized
into consistent formats, and invalid or corrupt records are removed. In later stages,
transformations may include schema changes and data normalization. Large-scale
aggregation may be used for reporting, while data featurization prepares data for machine
learning processes.
Key Considerations for Transformation
When transforming data, it’s essential to address the following:
 Cost and ROI: What’s the cost and return on investment (ROI) of the
transformation? What is the associated business value?
 Simplicity and Isolation: Is the transformation as simple and self-isolated as
possible?
 Business Rules: What business rules do the transformations support?
 Data Movement: Am I minimizing data movement between the transformation and
the storage system during transformation?
Batch vs. Streaming Transformations
 Batch Processing: Processes large volumes of data at once. It remains widely
popular.
 Streaming Processing: Handles continuous data streams in real time. As streaming
data grows, this method is expected to gain dominance, potentially replacing batch
processing in certain areas.
Transformation Entanglement
In practice, transformation often overlaps with other phases of the data lifecycle:
 In Source Systems: Data may be transformed before being ingested (e.g., a source
system adding timestamps).
 In Flight: Data may be enriched with additional fields or calculations while in a
streaming pipeline before being stored. Transformations occur throughout the
lifecycle, including data preparation, wrangling, and cleaning tasks that enhance data
value.
Role of Business Logic in Transformation
Business logic often drives data transformation, particularly in data modeling. Business rules
translate raw data into meaningful information. For example:
 A retail transaction might show "someone bought 12 picture frames for $30 each,
totaling $360."
 Proper transformation ensures this data is modeled with accounting logic to give a
clear picture of the business's financial health.
To ensure consistency, it's crucial to follow a standardized approach to implementing
business logic across transformations.
Data Featurization for ML
Featurization is another important transformation process aimed at extracting features from
data for ML models. It requires:
 Domain Expertise: Understanding which features are valuable for predictions.
 Data Science Knowledge: Experience in data manipulation and modeling.
Once data scientists determine the necessary features, data engineers can automate the
featurization processes as part of the transformation stage.
Transformation is a complex but essential phase, turning raw data into actionable information
that supports business and technical needs. Further discussions on queries, data modeling,
and transformation techniques offer a more detailed understanding of this vital process.

6. Serving Data
The final stage of the data engineering lifecycle involves using data that has been ingested,
stored, and transformed into valuable structures. The value of data depends on its practical
usage. If data remains unused, it becomes meaningless. In the past, many companies amassed
vast datasets that were never effectively utilized. This risk persists with modern cloud-based
systems. Data projects should be intentional and focused on achieving specific business
outcomes.
Data Serving
Data serving is where the true potential of data is realized. It enables analytics, machine
learning (ML), and reverse ETL.
Analytics
Analytics is fundamental to data projects. Once data is stored and transformed, it can be used
to generate reports, dashboards, and ad hoc analyses. Analytics encompasses several forms:

1. Business Intelligence (BI)


BI uses data to understand past and present business conditions by applying business
logic. This logic can be embedded in the data during the transformation stage or
applied at query time. Mature companies often transition to self-service analytics,
allowing employees to explore data independently without IT support. However, this
is challenging due to data quality issues and organizational barriers.
2. Operational Analytics
Operational analytics provides real-time insights that users can act upon immediately,
such as live inventory views or website monitoring dashboards. Data here is
consumed directly from source systems or real-time pipelines, focusing on present
insights rather than historical trends.
3. Embedded Analytics (Customer-Facing Analytics)
Embedded analytics serve customers through SaaS platforms. These systems face
unique challenges, such as handling high request volumes and maintaining strict
access controls to prevent data leaks.
Machine Learning (ML)
ML is a revolutionary technology that organizations can adopt after achieving data maturity.
Data engineers play a significant role in supporting ML pipelines, orchestrating tasks, and
maintaining metadata systems. Feature stores are essential tools that combine data and ML
engineering, making feature management and sharing easier.
Data engineers should have a foundational understanding of ML techniques, model use cases,
and analytics requirements to foster collaboration with other teams. Before investing heavily
in ML, organizations should ensure their data foundation is robust.
Key considerations for ML in data serving include:
 Is the data of sufficient quality to perform reliable feature engineering?
 Is the data discoverable? Can data scientists and ML engineers easily find valuable
data?
 Where are the technical and organizational boundaries between data engineering and
ML engineering?
 Does the dataset properly represent ground truth? Is it unfairly biased?
Reverse ETL
Reverse ETL involves feeding processed data back into source systems, enabling businesses
to apply analytics insights and model outputs in production environments. Historically seen
as an undesirable pattern, it has become increasingly important due to the rise of SaaS
platforms.
For example, marketing analysts might upload bid data from a warehouse to Google Ads
manually. Now, tools like Hightouch and Census automate this process. While reverse ETL is
evolving, it is likely to remain a crucial data practice, helping ensure that transformed data is
appropriately reintegrated into operational systems with correct lineage and business context.

7. Major Undercurrents Across the Data Engineering Lifecycle:


Data engineering is evolving rapidly. Earlier phases focused mainly on the technology layer.
However, advancements in tools and practices have led to greater abstraction and
simplification, shifting the focus beyond technology.
Now, data engineering extends higher up the value chain, incorporating traditional enterprise
practices like data management and cost optimization, as well as newer methods such as
DataOps.
These evolving practices, referred to as undercurrents, play a vital role throughout the data
engineering lifecycle. They include:
 Security: Ensuring the protection of data systems and information.
 Data Management: Handling the organization, storage, and governance of data.
 DataOps: Streamlining operations and collaboration in data processes.
 Data Architecture: Designing the structure and integration of data systems.
 Orchestration: Managing the execution of data workflows.
 Software Engineering: Applying engineering principles to data-related software
systems.

8. Security:
 Security is a critical concern for data engineers and ignoring it can lead to serious
consequences. Therefore, it is the first key undercurrent in the data engineering
lifecycle.
 Data engineers must have a strong understanding of both data security and access
security, adhering to the principle of least privilege—granting users or systems only
the minimum access necessary to perform their tasks.
 A common mistake inexperienced data engineers make is granting admin access to
all users, which is a recipe for disaster. Instead, access should be limited strictly to
what is needed for current job functions.
 Security vulnerabilities often stem from human errors and organizational issues.
 Many high-profile security breaches are the result of neglected security practices,
phishing attacks, or irresponsible actions by internal staff.
 Therefore, building a security-conscious culture is essential. Everyone with access to
data must recognize their role in protecting sensitive company and customer
information.
 Timely access control is another vital aspect—granting data access only to those who
need it and only for as long as necessary.
 Data protection should involve methods such as encryption, tokenization, data
masking, and robust access control mechanisms to prevent unauthorized visibility.
 Data engineers need to be competent security administrators, understanding security
practices for both cloud environments and on-premises systems.
 Key areas to master include user and identity access management, roles, policies,
network security, password protocols, and encryption techniques.
 Throughout the data engineering lifecycle, maintaining a focus on security is essential
to safeguard data and systems effectively.

9. Data Management
 Data management is defined as the development, execution, and supervision of plans,
policies, programs, and practices that ensure the control, protection, and enhancement
of data throughout its lifecycle. The goal is to maximize the value of data and
information assets, ensuring they are properly handled from creation to usage.
Evolving Role of Data Engineers:
 Historically, data management practices were reserved for large enterprises, but now
even smaller companies are adopting these best practices, such as data governance,
master data management, and data quality management.
 As tools become simpler and more accessible, data engineers can focus more on
strategic management of data, becoming integral to the organization’s broader data-
driven goals.
Importance of Data Management:
 Data is treated as an asset, just like financial resources or physical goods. Proper data
management ensures its value is realized, maintained, and protected. Without a
structured data management framework, data engineers risk operating without a clear
vision, reducing the effectiveness of their work.
Key Areas of Data Management:
 Data Governance: Ensures data quality, integrity, security, and usability. It aligns
people, processes, and technology to maximize data value while safeguarding it.
 Data Modeling and Design: Converts raw data into a structured, usable form that can
support business analytics and decision-making.
 Data Lineage: Tracks the flow and transformation of data across systems, providing
transparency and auditing capabilities.
 Data Integration and Interoperability: Ensures smooth communication between
different data systems and platforms.
 Data Lifecycle Management: Manages the data from its creation to its archival or
deletion, ensuring that it remains usable throughout its life.
 Data Systems for Advanced Analytics and ML: Supports advanced analytics and
machine learning by ensuring that data is appropriately structured and stored.
 Ethics and Privacy: Ensures that data is collected, stored, and used in a way that
complies with privacy laws and ethical guidelines.
Data Governance in Practice:
 Data governance ensures that data is not only accurate and complete but also properly
secured and accessible. Poor data governance practices can lead to issues such as
untrusted data or security breaches, which undermine the company’s ability to make
informed decisions.
 The key pillars of data governance include discoverability, security, and
accountability, which are supported by practices such as metadata management and
data quality control.
Metadata Management:
 Metadata (data about data) is essential for making data discoverable and manageable.
It comes in two categories: autogenerated (from systems) and human-generated (from
internal knowledge). Data engineers use metadata to understand the context,
relationships, and lineage of data.
 There are four categories of metadata:
o Business Metadata: Describes how data is used within the business context.
o Technical Metadata: Covers the technical aspects, such as schema, data
lineage, and system dependencies.
o Operational Metadata: Includes process logs, error logs, and job statistics.
o Reference Metadata: Provides lookup data, such as geographic codes or
standard measurement units.
Data Accountability:
 Data accountability assigns responsibility to individuals who manage specific data
assets. These individuals coordinate with other stakeholders to ensure data quality and
adherence to governance practices.
 It’s crucial that the accountable parties are not just data engineers but may also
include product managers, software engineers, and others who touch the data.
Data Quality:
 Data quality refers to the trustworthiness and completeness of data. Data engineers are
responsible for ensuring that data is accurate, complete, and timely. This includes
handling challenges like distinguishing between human and machine-generated data
or dealing with late-arriving data, such as in the case of delayed ad views in video
platforms.
Data Modeling and Design:
 To derive insights from data, it must be transformed into a usable format through data
modeling and design. While traditionally handled by database administrators (DBAs)
and ETL developers, data modeling is increasingly a part of the data engineer’s role.

10. DataOps
 DataOps is an approach that applies the best practices of Agile, DevOps, and
Statistical Process Control (SPC) to manage data, aiming to improve the speed,
quality, and consistency of data products in the same way DevOps enhances software
products.
 Data products, unlike software products, focus on business logic and metrics to
support decision-making or automated actions, which means data engineers need to
understand both technical software development and the business aspects of data.
 A key aspect of DataOps is its cultural foundation, which involves constant
communication and collaboration between data teams and the business, breaking
down silos, learning from both successes and failures, and iterating rapidly.
 Once these cultural practices are in place, the use of technology and tools becomes
more effective. Depending on a company's data maturity, DataOps can be integrated
from scratch (greenfield) or added incrementally to existing workflows.
 The first steps often involve improving monitoring and observability before
automating processes and refining incident response.
The three technical pillars of DataOps are automation, monitoring and observability, and
incident response. These elements guide the data engineering lifecycle:

1. Automation: Ensures consistency and reliability in the DataOps process, enabling


faster deployment of features and updates. DataOps automation mirrors DevOps
practices, including version control, continuous integration, and deployment. As data
systems grow more complex, tools like Airflow or Dagster are used to manage
dependencies and schedules, improving operational reliability. Automation
continuously improves to reduce manual intervention and operational burdens.
2. Observability and Monitoring: Prevents the risks of "bad data" by ensuring systems
and data pipelines are continually monitored. Data teams use tools to track data
quality, system performance, and errors, identifying issues before they escalate. SPC
techniques can help assess when monitored events fall outside normal bounds,
facilitating early intervention. Methods like DODD (Data Observability Driven
Development) ensure visibility across the data chain, from ingestion to
transformation, helping teams proactively address data problems.
3. Incident Response: DataOps involves rapid identification and resolution of issues,
using automation and observability tools to diagnose and fix problems quickly.
Effective incident response depends on open, blameless communication across teams
and the organization. Data engineers must be prepared for failures, finding and fixing
issues before stakeholders notice them, which builds trust with users. The goal is not
only to address problems as they arise but to anticipate and prevent them whenever
possible.

11. Orchestration
 Orchestration is the process of coordinating multiple tasks or jobs to run efficiently
and on schedule.
 It involves managing job dependencies, typically through a directed acyclic graph
(DAG), which determines the order and timing of tasks.
 Orchestration tools like Apache Airflow go beyond simple scheduling tools (like cron)
by tracking job dependencies and managing tasks based on metadata.
 Orchestration systems ensure that jobs run on time, monitor for errors, and send alerts
when things go wrong.
 They can handle dependencies, such as making sure a report doesn’t run until all
necessary data has been processed. These systems also offer features like job history,
visualization, and backfilling (filling in missing tasks if needed).
 Orchestration has been important for data processing but was once only accessible to
large companies. Tools like Apache Oozie were used for job management but were
expensive and difficult for smaller companies to use.
 Airflow, developed by Airbnb in 2014, was open-source, written in Python, and
highly adaptable, making it widely popular. Other orchestration tools, like Prefect and
Dagster, aim to improve portability and testability of tasks, while Argo and Metaflow
focus on different aspects like Kubernetes integration and data science.
 However, orchestration is a batch process, meaning it handles tasks in chunks. For
real-time data processing, streaming DAGs are used, but they’re more complex to
build. New platforms, like Pulsar, are working to make streaming orchestration easier.

12. Data Architecture


 Data architecture represents the current and future setup of data systems that support
an organization's long-term strategy and data needs. Since data requirements often
change quickly and new tools and practices emerge frequently, understanding good
data architecture is essential for data engineers.
 A data engineer must start by understanding the business needs and gathering
requirements for new use cases. Then, they must design new ways to capture and
deliver data while balancing cost and operational simplicity.
 This involves making trade-offs between design patterns, technologies, and tools
across different stages such as source systems, data ingestion, storage, transformation,
and serving.
 However, this doesn’t mean a data engineer is the same as a data architect, as these
are usually distinct roles.
 When working with a data architect, a data engineer should be able to implement the
architect's designs and offer feedback on the architecture.
13. Software Engineering in Data Engineering
Software engineering has always been an essential skill for data engineers. In the early days
(2000-2010), data engineers worked with low-level frameworks and wrote code in languages
like C, C++, and Java. By the mid-2010s, engineers began using more advanced frameworks
that hid these low-level details, making the process easier. Today, tools like SQL-based cloud
data warehouses and user-friendly systems like Spark continue this trend of abstraction.
However, software engineering remains critical in data engineering. Let’s discuss some key
areas where software engineering plays a role in the data engineering lifecycle.
1. Core Data Processing Code
Even though the process has become easier, data engineers still need to write core
processing code, which is used throughout the data engineering lifecycle (ingestion,
transformation, and serving data). Data engineers must be skilled in tools and
languages like Spark, SQL, and Beam. It's important to understand proper code-
testing practices like unit tests, regression tests, integration tests, and end-to-end tests.
2. Development of Open-Source Frameworks
Many data engineers contribute to open-source frameworks that help solve specific
problems in data engineering. These frameworks are often adopted to improve the
tools and help in areas like data management, optimization, and monitoring. For
example, Airflow was a popular tool for orchestration in the past, but now newer tools
like Prefect, Dagster, and Metaflow have emerged, addressing some of Airflow’s
limitations. Data engineers should explore existing tools before building new internal
solutions to save on time and costs.
3. Streaming Data
Processing streaming data is more complex than batch processing, and tools for this
are still maturing. Real-time processing requires more intricate software engineering,
especially for tasks like joins, which are easier in batch processing. There are various
frameworks for streaming data processing, such as OpenFaaS, AWS Lambda, and
Spark.
4. Infrastructure as Code (IaC)
IaC applies software engineering principles to managing infrastructure. As companies
move to cloud-based solutions, data engineers use IaC frameworks to automate the
deployment and management of infrastructure. This is especially important for
maintaining consistency and repeatability in the deployment process.
5. Pipelines as Code
Pipelines as code is a key idea in modern orchestration systems. Data engineers use
code to define tasks and dependencies. The orchestration engine then runs these tasks
using available resources. This approach is fundamental for handling the different
stages of the data engineering lifecycle.
6. General-Purpose Problem Solving
Despite using high-level tools, data engineers often face unique problems that require
custom solutions. For instance, when using tools like Fivetran or Airbyte, engineers
might encounter data sources that don’t have existing connectors. In these cases, they
need to write custom code to pull, transform, and handle data. Being skilled in
software engineering is important for solving these kinds of problems.

You might also like