Dam Unit - Iii
Dam Unit - Iii
Extract:
Reads data from multiple data sources and extracts required set of data
Recovers necessary data with optimum usage of resources
Should not disturb data sources, performance and functioning
Transform:
Filtration, cleansing, and preparation of data extracted, with lookup tables
Authentication of records, refutation, and integration of data
Data to be sorted, filtered, cleared, standardized, translated or verified for
consistency
NEELIMA 1
SEM III UNIT - III BA
DATA ANALYTICS MODELLING
Load:
Writing data output, after extraction and transformation to a data warehouse
Either physical insertion of record as a new row in database table or link processes
for each record from the main source
Benefits
Brings out meaningful patterns & insights
Converts assorted data into a consistent format
Aids derive business intelligence from data
Contains readily usable components
Effortlessly manages complex transformation
Offers maximized RoI
Future of ETL
Unified data management architecture
Data lakes
ETL and cloud
Machine learning with data integration
NEELIMA 2
SEM III UNIT - III BA
DATA ANALYTICS MODELLING
Apache NiFi: Emphasizes data flow automation, data enrichment, and
transformation.
Apache Airflow: Primarily a workflow automation tool that can also
handle ETL tasks through its Directed Acyclic Graph (DAG) workflows.
3. Cloud-Based ETL Tools:
With the rise of cloud computing, many ETL tools are available as cloud
services, offering scalability, flexibility, and ease of use.
Examples include:
AWS Glue: A fully managed ETL service offered by Amazon Web
Services.
Azure Data Factory: A cloud-based data integration service provided
by Microsoft Azure.
Google Cloud Dataflow: A fully managed stream and batch data
processing service on Google Cloud.
4. Specialized ETL Tools:
Some tools are designed for specific data integration needs or industries.
Examples include:
Alteryx: Focuses on self-service data analytics and data preparation.
Talend Big Data: A specialized version of Talend for handling big data
integration challenges.
Matillion: Optimized for ETL within cloud data warehouses like
Snowflake, Amazon Redshift, and Google BigQuery.
5. Data Integration Platforms:
These platforms offer more than just ETL; they encompass a broader range of
data integration tasks, including data migration, data synchronization, data
replication, and more.
Examples include:
Dell Boomi: A unified integration platform as a service (iPaaS) for
connecting cloud and on-premises applications and data.
SnapLogic: An iPaaS that provides both ETL and application
integration capabilities.
It's important to choose an ETL tool that aligns with your organization's specific
requirements, technical environment, budget, and scalability needs. Consider factors
such as ease of use, integration capabilities, performance, support, and whether the
tool can handle your current and future data integration challenges.
The process of merging data from various sources into a single view is known as data
integration. Starting from mapping, ingestion, cleansing, transforming to a
destination sink, and making data valuable and actionable for the individual who
accesses it.
Talend offers strong data integration tools for performing ETL processes. As data
integration is a complex and slow process, talent solves the problem by completing the
integration jobs 10x faster than manual programming with a very low cost.
NEELIMA 3
SEM III UNIT - III BA
DATA ANALYTICS MODELLING
Talend data integration has two versions they are:
ETL processes are essential in the realm of data warehousing, business intelligence,
and data analytics. Talend provides a user-friendly graphical interface for creating ETL
workflows, making it easier to manage the movement and transformation of data
between various sources and targets.
1. Extract (E):
Data extraction involves retrieving data from various sources such as databases,
files (CSV, Excel, XML, JSON, etc.), APIs, and other systems.
Talend provides connectors for a wide range of data sources, allowing you to
easily connect to and extract data.
2. Transform (T):
Data transformation involves cleaning, enriching, and reshaping the extracted
data to meet the requirements of the target system or analysis.
Talend offers a wide array of built-in transformation functions and components
to manipulate and manipulate data. This could include tasks like data cleansing,
data mapping, data aggregation, data enrichment, and more.
You can design complex transformation logic using Talend's graphical interface
by connecting different components in a visual flow.
3. Load (L):
Data loading involves transferring the transformed data to the target systems,
which could be data warehouses, databases, reporting tools, or other storage
mediums.
Talend supports various loading options, including bulk loading, incremental
loading, and real-time loading.
You can define the target structure and mapping in Talend to ensure that the
transformed data fits the destination schema.
Key features and concepts in Talend Data Integration related to ETL processes:
Job Design: In Talend, you design ETL processes using the graphical interface by
dragging and dropping components onto a canvas and connecting them to create a
data flow.
Components: Talend provides a vast library of pre-built components for data
extraction, transformation, and loading. Examples include tFileInput, tMap,
tFilterRow, tAggregateRow, tMysqlOutput, and many more.
Connectivity: Talend supports a wide range of data sources and targets, including
various databases, cloud services, APIs, flat files, and more.
Data Quality: Talend offers capabilities for data profiling and cleansing to ensure the
quality of data being processed.
Parallel Execution: ETL jobs can be executed in parallel to optimize performance
and throughput.
NEELIMA 4
SEM III UNIT - III BA
DATA ANALYTICS MODELLING
Error Handling: Talend provides mechanisms for handling errors during the ETL
process, such as logging, notifications, and retries.
Automation and Scheduling: ETL jobs can be scheduled to run at specific times
or triggered by events.
Version Control: Talend allows you to manage your ETL jobs using version control
systems, ensuring better collaboration and code management.
Job Deployment: You can deploy ETL jobs to various environments, such as
development, testing, and production.
Monitoring and Reporting: Talend provides monitoring tools and dashboards to
track job execution and performance.
Remember that ETL processes can become quite complex depending on the scale of
data and the complexity of transformations required. Talend Data Integration aims to
simplify the design and management of these processes through its intuitive interface
and robust set of features.
NEELIMA 5
SEM III UNIT - III BA
DATA ANALYTICS MODELLING
SSIS packages can incorporate control flow tasks for conditional branching,
looping, and executing tasks based on events.
Error handling components allow users to handle errors, log details, and
reroute data for further processing.
6. Package Configurations and Parameters:
SSIS packages can be parameterized to allow flexibility and reusability across
different environments.
Package configurations enable externalization of settings and configurations.
7. Debugging and Testing:
SSIS provides debugging capabilities, allowing users to identify and address
issues in the package logic.
The SSIS designer includes tools to step through package execution and inspect
data at various points.
8. Logging and Monitoring:
SSIS supports logging execution details and events, helping users monitor
package execution and troubleshoot issues.
9. Deployment and Execution:
SSIS packages can be deployed to SQL Server Integration Services Catalog or
saved as standalone files.
Packages can be executed manually, scheduled using SQL Server Agent Jobs,
or triggered by external events.
10. Scalability and Performance:
SSIS supports parallel execution, enabling high-performance ETL processes
that can handle large volumes of data.
11. Advanced Features:
SSIS offers data profiling, data quality services, change data capture (CDC), and
support for bulk loading in data warehousing scenarios.
SQL Server Integration Services is widely used across industries to address diverse
data integration needs. It empowers organizations to efficiently manage their data
movement and transformation processes, enabling better decision-making and
insights from data. As businesses continue to rely on data-driven strategies, SSIS
remains a valuable tool for data professionals.
NEELIMA 6
SEM III UNIT - III BA
DATA ANALYTICS MODELLING
Data Quality
Introduction :
In today’s business world, data quality is essential. Businesses rely on data
to carry out essential processes. This can include everything from day-to-
day marketing and advertising to key business strategies.
Data quality refers to the overall utility of a dataset and its ability to be easily processed
and analyzed for other uses.
It is an integral part of data governance that ensures that your organization’s data
is fit for purpose.
Data quality dimensions include completeness, conformity, consistency, accuracy and
integrity. Managing these helps your data governance, analytics and Artificial
Intelligence (AI) / Machine Learning (ML) initiatives deliver reliable and trustworthy
results.
Over the last decade, developments within hybrid cloud, artificial intelligence, the
Internet of Things (IoT), and edge computing have led to the exponential growth of
big data.
As a result, the practice of master data management (MDM) has become more
complex, requiring more data stewards and rigorous safeguards to ensure good data
quality.
Businesses rely on data quality management to support their data analytics initiatives,
such as business intelligence dashboards. Without this, there can be devastating
consequences, even ethical ones, depending on the industry (e.g. healthcare).
NEELIMA 7
SEM III UNIT - III BA
DATA ANALYTICS MODELLING
4. Regulatory Compliance: Many industries are subject to regulations that require
accurate and complete data reporting. Maintaining data quality ensures compliance
with legal and regulatory requirements.
5. Risk Management: Reliable data is essential for identifying and managing risks
effectively. Inaccurate data can lead to poor risk assessment and inadequate
mitigation strategies.
6. Strategic Planning: Organizations rely on data for strategic planning and
forecasting. Accurate data supports the development of realistic and achievable
business goals.
7. Business Intelligence and Analytics: Quality data serves as the foundation for
meaningful analytics and insights. Clean data enhances the accuracy and reliability
of analytical models and predictions.
8. Reputation and Trust: Consistently delivering accurate and reliable information
to stakeholders, including customers, partners, and investors, builds trust and
enhances the organization's reputation.
9. Cost Savings: Poor data quality can lead to costly mistakes, such as shipping
errors, product recalls, or misallocated resources. Maintaining data quality helps
prevent these types of costly errors.
10. Data Integration: High-quality data is easier to integrate across different systems
and platforms, ensuring seamless data flows and reducing integration challenges.
11. Data Collaboration: Organizations often need to share data with partners,
suppliers, and other stakeholders. Quality data ensures that the information shared
is accurate and reliable.
12. Data-driven Innovation: Organizations looking to innovate and create new
products or services often rely on data. High-quality data supports innovative
efforts by providing a solid foundation for exploration and experimentation.
13. Employee Productivity: Reliable data reduces the time employees spend
correcting errors or searching for accurate information, allowing them to focus on
more valuable tasks.
To achieve and maintain data quality, organizations should implement robust data
governance practices, invest in data validation and cleansing tools, establish clear data
quality standards, and regularly monitor and audit their data sources. Ultimately,
prioritizing data quality contributes to the overall success and competitiveness of an
organization in today's data-driven business landscape.
DQM is a systematic approach that involves identifying and correcting errors and
inconsistencies in the data, as well as implementing policies and procedures to prevent
future errors.
Data quality is one of the aspects of data governance that aims at managing data in a
way to gain the greatest value from it. A senior executive who is in charge of the data
NEELIMA 8
SEM III UNIT - III BA
DATA ANALYTICS MODELLING
usage and governance on a company level is a chief data officer (CDO). The CDO
is the one who must gather a data quality team.
The number of roles in a data quality team depends on the company size and,
consequently, on the amount of data it manages. Generally, specialists with both
technical and business backgrounds work together in a data quality team. Possible
roles include:
Data owner – controls and manages the quality of a given dataset or several
datasets, specifying data quality requirements. Data owners are generally senior
executives representing the team’s business side.
Data consumer – a regular data user who defines data standards, reports on errors
to the team members.
Data producer – captures data ensuring that data complies with data consumers’
quality requirements.
Data steward – usually is in charge of data content, context, and associated business
rules. The specialist ensures employees follow documented standards and guidelines
for data and metadata generation, access, and use. Data steward can also advise on
how to improve existing data governance practices and may share responsibilities with
a data custodian.
Data analyst – explores, assesses, summarizes data, and reports on the results to
stakeholders.
Since a data analyst is one of the key roles within the data quality teams, let’s break
down this person’s profile.
NEELIMA 9
SEM III UNIT - III BA
DATA ANALYTICS MODELLING
Monitoring and reviewing the quality (accuracy, integrity) of data that users
enter into company systems, data that are extracted, transformed and loaded
into a data warehouse
Identifying the root cause of data issues and solving them
Measuring and reporting to management on data quality assessment results
and ongoing data quality improvement
Establishing and oversight of service level agreements, communication
protocols with data suppliers, and data quality assurance policies and
procedures
Documenting the ROI of data quality activities.
Q) What is Data Quality? Explain Data Quality Dimensions to
evaluate data quality.
A) Data quality is a crucial aspect of any data-driven organization, as it affects the
reliability, accuracy, and usability of the data.
Poor data quality can lead to inaccurate insights, wasted resources, and missed
opportunities.
Data quality dimensions are measurement attributes of data, which you
can individually assess, interpret, and improve
There are six primary, or core, dimensions to data quality. These are the metrics
analysts use to determine the data’s viability and its usefulness to the people who need
it.
Accuracy
The data must conform to actual, real-world scenarios and reflect real-world objects
and events. Analysts should use verifiable sources to confirm the measure of accuracy,
determined by how close the values jibe with the verified correct information sources.
Completeness
Completeness measures the data's ability to deliver all the mandatory values that are
available successfully.
Consistency
Data consistency describes the data’s uniformity as it moves across applications and
networks and when it comes from multiple sources. Consistency also means that the
same datasets stored in different locations should be the same and not conflict. Note
that consistent data can still be wrong.
Timeliness
Timely data is information that is readily available whenever it’s needed. This
dimension also covers keeping the data current; data should undergo real-time
updates to ensure that it is always available and accessible.
Uniqueness
Uniqueness means that no duplications or redundant information are overlapping
across all the datasets. No record in the dataset exists multiple times. Analysts use data
cleansing and deduplication to help address a low uniqueness score.
Validity
Data must be collected according to the organization’s defined business rules and
parameters. The information should also conform to the correct, accepted formats,
and all dataset values should fall within the proper range.
NEELIMA 10
SEM III UNIT - III BA
DATA ANALYTICS MODELLING
NEELIMA 11
SEM III UNIT - III BA
DATA ANALYTICS MODELLING
5. DataRobot Paxata:
DataRobot Paxata specializes in data preparation and data quality for
analytics and machine learning.
It includes features like data profiling, data cleaning, enrichment, and
collaboration.
6. Melissa Data Quality Suite:
Melissa offers a suite of data quality tools that provide address validation,
deduplication, email verification, and data enrichment.
7. Experian Data Quality:
Experian offers a range of data quality solutions for address validation,
deduplication, data enrichment, and more.
8. OpenRefine (formerly Google Refine):
OpenRefine is an open-source tool for data cleaning and transformation. It's
particularly useful for data preparation tasks.
9. Ataccama ONE:
Ataccama ONE offers data quality, master data management, and data
governance capabilities in a single platform.
10. Syncsort Trillium:
Syncsort Trillium provides data profiling, data quality, and data enrichment
features for better data management.
11. SAS Data Quality:
SAS offers data quality tools that include data profiling, data cleansing,
deduplication, and integration with analytics.
These tools help organizations identify and fix data quality issues, leading to
improved decision-making, compliance, and customer satisfaction. When selecting a
data quality tool, consider factors such as your organization's specific needs,
integration with existing systems, ease of use, scalability, and the depth of data
cleansing and enrichment capabilities.
Q) How Do You Improve Data Quality?
A) People looking for ideas on how to improve data quality turn to data quality
management for answers. Data quality management aims to leverage a balanced set of
solutions to prevent future data quality issues and clean (and ideally eventually
remove) data that fails to meet data quality KPIs (Key Performance Indicators).
These actions help businesses meet their current and future objectives.
NEELIMA 12
SEM III UNIT - III BA
DATA ANALYTICS MODELLING
NEELIMA 13
SEM III UNIT - III BA
DATA ANALYTICS MODELLING
Master data management (MDM) is the practice of maintaining and governing this
core data to ensure consistency, accuracy, and reliability across an organization. By
having a single, authoritative source for master data, organizations can reduce data
inconsistencies, improve decision-making, streamline processes, and enhance overall
operational efficiency.
NEELIMA 14
SEM III UNIT - III BA
DATA ANALYTICS MODELLING
Q) What is Master Data Management?
A) Master Data Management (MDM) is a comprehensive approach to managing and
maintaining an organization's critical business data, known as "master data."
Master data refers to the core data entities that are common across different
business units, applications, and processes. These entities include customers,
products, suppliers, locations, and more. MDM aims to ensure that master data is
consistent, accurate, and authoritative, regardless of where it is used within the
organization.
A) Master Data Management (MDM): This refers to a set of processes, tools, and
technologies used to create and manage a single, consistent, accurate, and
authoritative source of essential business data within an organization. This data could
include information about customers, products, suppliers, employees, etc. MDM aims
to eliminate data inconsistencies, duplication, and discrepancies that can arise from
different systems or departments using different versions of the same data.
1. Data Accuracy and Consistency: MDM ensures that master data is accurate,
consistent, and reliable across all systems and applications. This consistency is crucial
for making informed business decisions, improving operational efficiency, and
reducing errors.
2. Data Governance: MDM establishes data governance policies and procedures to
define how data should be created, updated, validated, and archived. This helps
maintain data quality, enforce data standards, and ensure compliance with
regulations.
3. Single Source of Truth: MDM provides a single, authoritative source of master data
that all departments and applications can rely on. This reduces the risk of using
conflicting or outdated information and promotes a unified view of critical business
entities.
4. Cross-Departmental Collaboration: MDM facilitates collaboration among
different business units by enabling them to share consistent data. This is particularly
important in large organizations where multiple teams and systems need access to the
same accurate data.
NEELIMA 15
SEM III UNIT - III BA
DATA ANALYTICS MODELLING
5. Data Integration: MDM helps integrate master data across disparate systems and
applications, which is especially valuable in scenarios where mergers, acquisitions, or
system upgrades have led to a heterogeneous IT landscape.
6. Improved Decision-Making: Accurate and consistent master data supports better
decision-making by providing a reliable foundation for analytics, reporting, and
strategic planning. When everyone is working with the same accurate data, decisions
are more informed and reliable.
7. Customer Experience: MDM contributes to a better customer experience by
ensuring that customer data is consistent and up-to-date across all touchpoints, such
as sales, marketing, and customer support.
8. Regulatory Compliance: MDM helps organizations meet regulatory requirements
related to data accuracy, privacy, and reporting. It ensures that data is handled in
accordance with relevant industry standards and regulations.
9. Efficiency and Cost Savings: By reducing data duplication, errors, and manual
data reconciliation efforts, MDM improves operational efficiency and lowers costs
associated with data maintenance.
10. Data Security: MDM establishes controls and permissions to manage access to
sensitive master data. This helps protect confidential information and maintain data
security.
11. Support for Digital Transformation: MDM is essential for organizations
undergoing digital transformation efforts. It provides a solid foundation for
implementing new technologies, processes, and business models by ensuring that the
underlying data is accurate and consistent.
In summary, Master Data Management is a strategic approach that plays a central role
in maintaining high-quality, consistent, and accurate master data across an
organization. This, in turn, supports better decision-making, improved customer
experiences, regulatory compliance, and overall operational efficiency.
1. Informatica MDM:
A comprehensive MDM solution that offers data governance, data quality, and
data integration capabilities.
Supports multiple domains, including customer, product, and reference data.
2. SAP Master Data Governance:
Part of the SAP ecosystem, this tool focuses on data governance, data quality,
and data consolidation across business units and systems.
3. IBM InfoSphere MDM:
Offers a versatile MDM platform with capabilities for data integration, data
governance, and data quality.
Supports multidomain MDM scenarios.
4. Talend MDM:
NEELIMA 16
SEM III UNIT - III BA
DATA ANALYTICS MODELLING
Provides a unified platform for managing, consolidating, and governing master
data across domains.
Integrates with Talend's ETL and data integration tools.
5. Semarchy xDM:
Offers a flexible and agile MDM solution that focuses on data stewardship,
governance, and data quality.
6. Reltio Connected Data Platform:
Combines MDM, data quality, and data governance features to create a holistic
view of master data.
7. Informatica MDM Cloud:
A cloud-based MDM solution that provides data governance, data quality, and
data integration capabilities.
8. Stibo Systems MDM:
Provides a multidomain MDM platform with features for data governance, data
quality, and data modeling.
9. Profisee:
Offers a scalable MDM solution with features for data stewardship, data quality,
and data consolidation.
10. Magnitude MDM:
A comprehensive MDM solution that includes data governance, data quality,
and data integration features.
11. TIBCO EBX:
Provides a multidomain MDM platform with features for data governance, data
quality, and data stewardship.
12. SAS Master Data Governance:
Part of the SAS Data Management suite, it offers MDM capabilities with a focus
on data quality and governance.
13. Kalido MDM (by Magnitude):
Offers MDM solutions with a focus on creating a centralized view of master data
for better decision-making.
When choosing an MDM tool, consider factors such as the tool's flexibility, scalability,
support for multiple domains, data governance capabilities, integration with existing
systems, ease of use, and alignment with your organization's MDM strategy and goals.
Keep in mind that MDM projects often involve significant planning, data modeling,
and collaboration among various stakeholders.
NEELIMA 17