0% found this document useful (0 votes)
32 views20 pages

Data Quality MDM

The document covers the concepts of Data Quality (DQ), Data Quality Management (DQM), and Master Data Management (MDM), emphasizing the importance of high-quality data for effective decision-making and operational efficiency. It outlines the characteristics of data quality, the benefits of maintaining it, and best practices for improving data quality through management strategies and tools. Additionally, it discusses the roles and responsibilities of a data quality team in ensuring data integrity and reliability.

Uploaded by

soumyasree
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views20 pages

Data Quality MDM

The document covers the concepts of Data Quality (DQ), Data Quality Management (DQM), and Master Data Management (MDM), emphasizing the importance of high-quality data for effective decision-making and operational efficiency. It outlines the characteristics of data quality, the benefits of maintaining it, and best practices for improving data quality through management strategies and tools. Additionally, it discusses the roles and responsibilities of a data quality team in ensuring data integrity and reliability.

Uploaded by

soumyasree
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

UNIT -3 ETL (EXTRACT, TRANSFORM AND LOAD)

Data Quality (DQ)


• Meaning & Definition
• Properties or Characteristics or Dimensions of Data Quality
• Role/Importance/Purpose/Need of Data Quality
• Benefits of Data Quality
Data Quality Management (DQM)
• How Do You Improve Data Quality ( Data Quality Management -DQM)
• How to measure Data Quality
• Data Quality Best Practices
• Roles of Responsibilities of Data Quality Team
• Data Quality Tools
Master Data Management (MDM)
• Importance of master data management
• Benefits of Master Data Management
• Challenges of master data management
• Key roles and participants in an MDM initiative (Optional)
• Following are MDM Tools/Solutions/Software
EXTRACT, TRANSFORM AND LOAD (ETL)
• Definition ETL
• Advantages And Disadvantages
• ETL Tools
• Extraction, Load and Transform (ELT)
• Difference between ELT and ETL

Meaning & Definition

The notion of quality is inherently customer centric. A collection of data is of high quality, in
the customer's eyes, if and only if it meets his, her, or its needs. It is perhaps the simplest way of thinking
about quality and it is certainly the most powerful; but it has profound implications. It means that Data
Quality is inherently subjective and is a multidimensional concept that can be defined from several
different perspectives. Data quality is more than simply data accuracy: Data quality refers-to the state of
qualitative or quantitative pieces of information.
There are many definitions of data quality; but data is generally considered high quality if it is
fit for its intended uses in operations, decision making and planning". Moreover, data is deemed of high
quality if it correctly represents the real-world construct to which it refers.

Definitions of data quality are as below:


Data that are fit for use by data consumers is called data quality."

Data are of high quality if they are fit for their intended uses in operations, decision making, and
planning. Data are fit for use if they are free of defects and possess desired features
- Wang and Strong
"Conformance to specifications" and "meeting or exceeding consumer expectations is called
data quality." - Redman

Data has quality if it satisfies the requirements of its intended use."


-Khan, Strong & Wang
Data Quality refers to the degree of fulfilment of all those requirements defined for data, which is
needed for a specific purpose." -Olson

Guilherme Morbey says that "In the world of information data resembles the water, which we need
in our everyday lives."

From a more technical perspective, data is of high quality when it is "free of defects" and "conforms
to specifications".

Properties or Characteristics or Dimensions of Data Quality


• Accuracy: The data must conform to actual, real-world scenarios and reflect real-world
objects and events. Analysts should use verifiable sources to confirm the measure of
accuracy, determined by how close the values will be with the verified correct information
sources.
• Completeness: Completeness measures the data's ability to deliver all the mandatory
values that are available successfully. It makes sure that data is not missing. Ex: the
estimated delivery date is missing
• Consistency: Data consistency describes the data’s uniformity as it moves across
applications and networks and when it comes from multiple sources. Consistency also
means that the same datasets stored in different locations should be the same and not
conflict. Note that consistent data can still be wrong.
• Timeliness: Timely data is information that is readily available whenever it’s needed. This
dimension also covers keeping the data current; data should undergo real-time updates to
ensure that it is always available and accessible.
• Uniqueness: Uniqueness means that no duplications or redundant information are
overlapping across all the datasets. No record in the dataset exists multiple times. Analysts
use data cleansing and deduplication to help address a low uniqueness score.
• Validity: Data must be collected according to the organization’s defined business rules
and parameters. The information should also conform to the correct, accepted formats,
and all dataset values should fall within the proper range. Validity makes sure that. the
data matches the rules. Validity refers to information that fails to follow specific company
for-nuts, rules. or processes. For example, many systems may ask for a customer's birth
date. However, if the customer does not enter their birth date using the proper format, the
level of data quality becomes automatically compromised. Therefore, many organizations
today design their systems to reject birth date information unless it is input using the pre-
assigned format
• Integrity: It is the level at which the information is reliable and trustworthy. Is the data
true or factual? Ex: Email
• Reasonability: Is the data reasonable/Useful
• Accessibility: Data and metadata are presented in an understandable way, and statistics are
up-to-date and easily available

Role/Importance/Purpose/Need of Data Quality


1. Decision-making: Data quality is important because people in any given domain
need accurate, dependable and timely information. Having quality data can help
businesses improve decision-making abilities, reduce risk and increase efficiency,
which leads to the next important points.
2. Productivity: When teams or professionals use quality data, productivity can be
improved dramatically. Having all the required information in one or more quality
data sets means that teams don't have to waste time, cobbling together information
to get the full picture or plug gaps in the data.
3. Marketing: It is often said that "data is the new oil". For example, having quality
data sets in terms of your customer profile lets you better segment and target your
customer based on their online habits and behaviours. thereby improving the
likelihood of the uptake of your products or services, Decisions powered by quality
and more granular data can lead to higher revenue numbers.
4. Product Development: If you are in the software development industry or
launching a new product, being able to track and audit KPIs, such as engagement
with your product is something likely you want to do. Auditing data points such as
button clicks and page views can help surface actionable insights. This can help
inform your product development team in terms of the next set or features to
develop, which features to remove and even help determine product viability.
5. Competitive Advantage: A reputation for world-class quality is profitable, a
business maker. High-quality data can be a major business asset, a unique source of
advantage.
6. Reduce Satisfaction of Customer and Employees: The poor-quality data can
reduce customer satisfaction. Poor quality data can lower employee job satisfaction
too, leading too turnover and the resulting loss of key process knowledge.
7. Mistrust: Poor-quality data can also breed organisational mistrust and make it hard
to mount efforts that lead to needed improvement.
8. Distort Financial Data: Poor quality data can distort key corporate financial data;
in extreme. this can make it impossible to determine the financial condition of a
business.
9. Important at All Levels or Government: data are also important to all of
government. Certainly, the military needs high-quality data for an of its operations,
especially its counter-terrorism efforts. At the level. High quality data are needed so
that individuals' residences are assessed accurately for real estate tax purpose

Benefits of Data Quality: Good data quality enables:

• Better Decision-making: Good quality data not only leads to more accurate and realistic
decision making but also boosts your confidence as you make the decisions. It takes away
the need to guesstimate and saves you the unnecessary costs of trials and errors.
• Better Audience Targeting: As part of the value chain proposition, it is critical you know
who your prospects are—something that you can only manage analysing and
understanding data. Data quality also leads to improved audience targeting. Without high-
quality data, marketers are forced to try to apply to a broad audience, which is not
efficient. Worse, they may have to guess at who their target audience should be.
When you have high-quality data, you can more accurately determine who
your target audience should be. You can do so by collecting data about your Current
audience and then finding potential new customers with similar attributes. You can use this
knowledge to target advertising campaigns and develop products or content that appeal to
the right people more accurately.
• More Effective Content and Marketing Campaigns: In addition to improving targeting,
data quality can also help to improve your content and marketing campaigns themselves.
The more you know about your audience, the more reliably you can create content or ads
that appeal to them. Having good quality data helps you prepare proper marketing content
and avoid spamming and message duplication, which can be annoying.

• Improved Relationships with Customers: You cannot succeed in any industry if you
have poor customer relations. people want to do business with brands they can trust.
Creating that bond with your customers starts with understanding want High-quality data
can also help you improve your relationships with customer which is crucial for success in
any industry. Gathering data about your customers helps you get to know them better, You
can use information about your preferences, interests and needs to provide them with
content that appeals to them even anticipates their needs.
• Easier Implementation of Data: High-quality data is also much easier to use than poor-
quality data. Having quality data at your fingertips increases your company’s efficiency as
well. If your information is not complete or consistent, you have to significant amounts of
time fixing that data to make it useable. This takes time away from other activities and that
means it takes longer for you to implement the insights your data uncovered. Quality data
also helps to keep your company's various departments on the same page so that they can
work together more effectively.
• Competitive Advantage: Being in possession of good quality data gives you a picture of
your industry and its dynamics. Your marketing messages be specific and your projections
in market changes will bear more accuracy, It will also be easier for you to anticipate the
needs of your customers, which will help you beat your rivals to sales.
• Less Time Spent Reconciling Data: When it comes to data quality, it is a time saver.
Since data quality is reliable and accurate, it eliminates the entire process of reconciling
data. Of course, the data has to be checked, but it can shorten the length of the in
validating data with the source,
• Increased profitability: Ultimately, high-quality data can lead to increased profitability. It
help you to craft more effective marketing campaigns and increase sales numbers. It also
decreases ad waste. making your marketing campaigns more cost effective. Similarly, if
you are a publisher, data can show which types of content are the most popular on your
site and which bring in the most revenue. Having this information enables you focus more
of your time and resources on these kinds of content.

How Do You Improve Data Quality ( Data Quality Management -DQM)


Data analysts who strive to improve data quality need to follow best practices to meet their
objectives. Here are ten critical best practices to follow:
1. Involvement of Top-level Management: Make sure top-level management is involved.
Data analysts can resolve many data quality issues through departmental participation.
2. Including Data Quality Activity Management: Include data quality management as
part of your data governance framework. The framework sets data policies and data
standards the required roles and offers a business glossary
3. Root Cause Analysis: Each data quality issue raised begin with a root analysis. If you do
not address the root cause of data issue, the problem will inevitably appear again. Do not
just address the symptoms of the disease; you need to cure disease itself
4. Maintaining Data Quality Issues Log: Maintain a data quality issue needs an entry,
complete with information regarding the assigned data owner, involved data stuward, the
issue's impact, the final resolution and the timing of necessary proceedings.
5. Roles: Fill data owner and data steward roles from your company's business side and fill
data custodian roles from either business or IT whenever possible and makes the most
sense.
6. Create Awareness: Use examples of data quality disasters to raise awareness about
importance of data quality. However, while anecdotes are great for illustrative purpose
you should rely on fact-based impact and risk analysis to justify your solutions and their
required finding.
7. Organisation's Business Glossary: Your organisation's business glossary must serve as
the foundation for metadata management.
8. Avoid Typing of Data: Avoid typing in data where possible. Instead, explore cost-
effective solutions for data onboarding that employ third-party data sources that provide
publicly available data. This data includes items such as names, locations in general,
company addresses and IDs, and in some cases, individual people. When dealing with
product data, use second-party data from trading partners whenever you can.
9. Relevant Process: When resolving data issues, make every effort to implement relevant
processes and technology that stops the problems from arising as close as possible to the
data on boarding point instead of depending on downstream data.

10. Establishing Data Quality KPIs: Establish data quality that work in tandem with the
general KPIs for business performance. Data quality KPIs, sometimes called Data
Quality Indicators (DQE), can often be associated with data quality dimensions like
uniqueness, completeness, and consistency.

Data Quality Management / Improving or Achieving Data Quality:


Data Quality Management (DQM) is a set of practices that aim at maintaining a high quality of
information. DQM goes all the way from the acquisition of data and the implementation of the
advanced data processes, to an effective distribution of data.
DQM leverages a balanced set of solutions to prevent future data quality issues and clean and
eventually remove data that fails to meet data quality Key Performance Indicator (KPI).
There are eight mandatory disciplines used to prevent data quality problems and improve data
quality cleansing all the data:
1. Data Governance: Data governance spells out the data policies and standards that
determine the required data quality KPIs and which data elements should be focused on.
These standards also include what business rules must be followed to ensure data quality.
2. Data Profiling: Data profiling is a methodology employed to understand all data assets
that are part of data quality management. Data profiling is crucial because many of the
assets in question have been populated by many different people over the years, adhering
to different standards.
3. Data Matching: Data matching technology is based on match codes used to determine if
two or more bits of data describe the same real-world thing. For instance, say there’s a
man named Michael Jones. A customer dataset may have separate entries for Mike Jones,
Mickey Jones, Jonesy, Big Mike Jones, and Michael Jones, but they’re all describing one
individual.
4. Data Quality Reporting: Information gathered from data profiling, and data matching
can be used to measure data quality KPIs. Reporting also involves operating a quality
issue log, which documents known data issues and any follow-up data cleansing and
prevention efforts.
5. Master Data Management (MDM): Master Data Management frameworks are great
resources for preventing data quality issues. MDM frameworks deal with product master
data, location master data, and party master data.
6. Customer Data Integration (CDI): CDI involves compiling customer master data
gathered via CRM applications, self-service registration sites. This information must be
compiled into one source of truth.
7. Product Information Management (PIM): Manufacturers and sellers of goods need to
align their data quality KPIs with each other so that when customers order a product, it
will be the same item at all stages of the supply chain. Thus, much of PIM involves
creating a standardized way to receive and present product data.
8. Digital Asset Management (DAM): Digital assets cover items like videos, text
documents, images, and similar files, used alongside product data. This discipline
involves ensuring that all tags are relevant and the quality of the digital assets.

Roles and Responsibilities of Data Quality Team:


Data quality is the one of the aspects of data governance that aims at managing data in a
way to gain the greatest value from it. A senior executive who is in charge of the data usage and
governance on a company level is a Chief Data Officer (COD) who is responsible to gather a data
quality team
1. Data Owner: Data Owner controls and manages the quality of a given set data sets.
specifying data quality requirements. owners are generally senior executives representing
the team's business Side.
2. Data Consumer: Data consumer is regular data user who defines data standards, reports
on errors to the team members,
3. Data Producer: Data Produced Captures data ensuring that data complies with data
consumers quality requirements.
4. Data Steward: Data Steward usually is in charge of data content, context and
associated business rules. The specialist ensures employees follow documented
standards and guidelines for data and metadata generation, access, and use. Data
steward can also advise on how to improve existing data governance practices and may
responsibilities with a data custodian.
5. Data Custodian: Data Custodian manages the technical environment Of data
maintenance and The data custodian ensures the quality, integrity, and safety of data
during ETL (extract, transform, and load) activities.
6. Data Analyst: Data Analyst explores, assesses, summarises data, and reports on the
results to stakeholders. Since a data analyst is one of the key roles within the data quality
teams, Data quality analyst is a multitasker. The data quality analyst's duties may vary.
They may perform the data consumer's duties, such as data standard definition and
documentation, maintain the quality of data before it's loaded into a data warehouse, which
is usually the data custodian's work- the data quality analyst responsibilities may include:
(i) Monitoring and reviewing the quality (accuracy, integrity) of data that users
enter into company systems, data that are extracted, transformed and loaded into a
data warehouse.
(ii) Identifying the root cause of data issues and solving them.
(iii) Measuring and reporting to management on data quality assessment results and
ongoing data quality improvement.
(iv) Establishing and oversight of service level agreements, communication
protocols with data suppliers, and data quality assurance policies and procedures.
(v) Documenting the ROI of data quality activities.
It is up to the organisation how to assign duties across the data quality team. However. any
team include the person who manages the whole process, the one who does quality
management checks. data quality rules, develops data models, and a techie who maintains the
now of data storage across the organisation.

Master Data Management (MDM)


Most organizations today operate a number of different systems that all contain important
data on customers, the business or other crucial business KPIs such as CRM’s, ERP’s, etc.
This leads to data silos, duplicated data, incomplete data and therefore, a disjointed view of
the business. Since data is in many different places and in many different languages,
answering simple business questions such as “What services did our customers use the most
last quarter?” or “Who is our most profitable customer?” becomes difficult.
For Master Data Management to work, it must be a group effort that
should be an ongoing endeavour. Usually, larger organizations will elect a group of people to
establish and enforce best practices for data quality.

Gartner defines Master Data management (MDM) is a technology-enabled discipline in


which business and IT work together to ensure the uniformity, accuracy, stewardship,
semantic consistency, and accountability of the enterprise’s official shared master data assets.
Master data is the consistent and uniform set of identifiers and extended attributes that
describes the core entities of the enterprise including customers, prospects, citizens, suppliers,
sites, hierarchies, and chart of accounts. Gartner Magic Quadrant for Master Data
Management Solutions.
What is Master Data?
Master data is often called a golden record of information in a data domain, which corresponds
to the entity that's the subject of the data being mastered. Data domains vary from industry to
industry. For example, common ones for manufacturers include customers, products, suppliers
and materials. Banks might focus on customers, accounts and products, the latter meaning
financial ones. Thus, it essentially functions as a master file of dates, names, addresses, customer
IDs, item numbers, product specifications and other attributes that are used in transaction
processing systems and analytics applications. As a result, well-managed master data is also
frequently described as a single source of truth -- or, alternatively, a single version of the truth
-- about an organization's data, as well as data from external sources that's ingested into corporate
systems to augment internal data sets.
MDM not only gives companies the opportunity to better manage their key data assets and there
by improve the overall value and utility the data provides internally. It involves an entire set of
processes, services, and policies that go along with it. Most MDM experts agree that the three
main reasons used to justify an MDM implementation are cost reduction, risk management, and
revenue growth.

Importance of master data management


Business operations depend on transaction processing systems. BI and analytics increasingly
drive marketing campaigns, customer engagement efforts, supply chain management and other
business processes.
But many companies don't have a single view of their customers. Commonly, that's because
customer data differs from one system to another.
For example, customer records might not be identical in order entry, shipping and customer
service systems due to variations in names, addresses and other attributes. Similar inconsistencies
can also occur in product data and other types of information. Such issues cause business
problems if critical data can't be accessed or is missed by end users.
Master data management programs help avoid that by consolidating data from multiple source
systems into a standard format to provide the needed single view of business entities.

Benefits of Master Data Management


Increase revenue growth
To provide personalized cross-sell and up-sell offers, you need access to reliable and complete
data for all customer touchpoints. MDM can provide a consolidated source of key master data on
customers, products, and relationships between master data entities.. With a better understanding
of your customers, you can ensure that the right cross-sell and up-sell offers are sent to the right
customers at the right time.
Improve productivity
With Master Data Management, you can eliminate IT overhead and costs and drive operational
efficiencies by providing a complete, consistent, reliable source of master data across your
organization. MDM will also help improve visibility and control over business activities by
managing sophisticated relationships across products, customers, vendors, and locations.
Optimize your supply chain
Master Data Management offers a centralized perspective on products, and accurate information
on inventory, product returns, and out-of-stock items across the supply chain, improving
inventory management, forecasting, and customer service. You can ensure accurate, timely
information is provided to support decisions and actions made by the applications, processes, and
people that run your business.
Identify and act on insights faster
Master Data Management can speed up time-to-insight and action by allowing business users to
directly access, manage, and visually interact with master data. With a richer source of product,
customer, and vendor data, you can introduce new products and services faster.
Improve Customer Satisfaction
You can also accelerate loyalty and increase sales with Master Data Management by
personalizing interactions, delivering a consistent experience across channels, and tailoring
products and services to your customer’s specific wants and needs.
Improve compliance
Centralized and complete master data helps to reduce costs associated with compliance reporting
and penalties. With master data management, fewer vendor and product compliance issues lead to
faster new product introductions and vendor on boarding.

Improved Operational Efficiency and Reduced Costs


Replication of the same data often is linked to replication of activities associated with managing
those data sets ranging from typical data management routines (backups, maintenance), to
licensing costs for infrastructure (such as RDBMS or Extract/Transform/Load (ETL) product
license and maintenance costs), to specialised application ware or services (such as data
enhancement or geo/demographic appends). Formulating a unified view of the data enables the
organisation to reduce operating costs and tasks.
Improved Decision-making: Inconsistency across business intelligence activities often occur
because of replication or duplication in the underlying data used to drive the decision-making
process. Questions regarding the consistency of reports can confuse management decision-
making, leading to missed business opportunities. The information consistency provided by
MDM across applications reduces data variability which in turn minimises organisational data
mistrust and allows for clearer (and faster) business decisions.
Better Spend Analysis and Planning
Master Data associated with product, supplier, and vendor data improve the ability to aggregate
purchasing activities, coordinate competitive sourcing, be more predictable about future spend,
and generally improve vendor and supplier management.
Increased Information Quality
Collecting metadata made up of standardised models, value domains, and business rules enables
organisations to more effectively monitor conformance to information quality expectations across
vertical applications, which reduces information scrap and rework.
Quicker Results
A standardised view of the information asset reduces the delays associated with extraction and
transformation of data, speeding the implementation of application migrations, modernisation
projects, and data warehouse/data mart construction.

Challenges of master data management


Despite the benefits it offers, MDM can be a difficult undertaking. These are some of the common
challenges it presents to organizations:

Complexity: The potential benefits of master data management increase as the number and diversity
of systems and applications in an organization expand. For this reason, MDM is more likely to be
of value to large enterprises than small and medium-sized businesses. However, the complexity
of enterprise MDM programs has limited their adoption even in large companies.

Disagreements on enterprise data standards: One of the biggest hurdles is getting different
business units and departments to agree on common master data standards. MDM efforts can lose
momentum and get bogged down if users argue about how data is formatted in their separate
systems.
Project scoping issues: Another often-mentioned obstacle to successful MDM implementations is
project scoping. The efforts can become unwieldy if the scope of the planned work gets out of
control or if the implementation plan doesn't properly stage the required steps.

Incorporating acquired companies into MDM programs: When companies merge, MDM can
help streamline data integration, reduce incompatibilities and optimize operational efficiency in
the newly combined organization. But the challenge of reaching consensus on master data among
business units can be even greater after a merger or acquisition.

Dealing with sets of big data: The growing use of big data systems in organizations can also
complicate the MDM process by adding new forms of unstructured and semi structured data
stored in a variety of platforms, including Hadoop clusters, other types of data lake systems and
newer data lake house environments.

Key roles and participants in an MDM initiative (Optional)


Because of their complexity and their broad impact on business operations, MDM programs
should involve a wide range of people in an organization. The level of involvement varies
depending on the role: Some data management professionals might work full-time on MDM,
while others devote part of their time to it, and business stakeholders usually take part on an
occasional, though regular, basis.

These are some of the key positions and participants in the MDM process:

MDM manager: This person oversees the planning, development and implementation of an MDM
program. The job title could also be director of MDM, MDM program lead or another variation.
In organizations with aligned MDM and data governance programs, a single manager might be
put in charge of both initiatives.

Master data specialist: As the title indicates, this is a technical role that focuses on the creation and
maintenance of master data. Duties typically include upfront data cleansing, matching and
merging, plus troubleshooting of data quality issues on an ongoing basis. In some organizations,
the role is master data analyst or MDM analyst.

Data stewards: As part of overseeing data sets in specific domains, data stewards often are
involved in MDM programs. For example, they can handle some of the data management and
maintenance tasks or work with master data specialists on those functions. They can also ensure
that business units comply with internal master data standards.

Other data management professionals: Various data management team members can play a part
in MDM programs, too. That includes data architects and data modellers, as well as data quality
analysts who help with data cleansing and ETL developers who create extract, transform and load
jobs to pull together master data from different source systems.

Executive sponsor: MDM programs are big and often expensive initiatives, and they can lead to
internal infighting over master data standards and face resistance from business units. As a result,
they commonly need an executive sponsor who can ensure that a program gets required funding,
help to resolve conflicts and promote adoption.

Business stakeholders: Business executives whose operations are affected by an MDM program
need to be involved in master data decision-making activities. Alternatively, they can designate
subject matter experts to represent their business units. In many cases, participating stakeholders
are organized into a steering committee that meets regularly. Organizations with an existing data
governance council might instead use it to make MDM-related decisions.

Following are MDM Tools/Solutions/Software:


1. Ataccama ONE: Ataccama offers an augmented data management platform that features data
discovery and profiling, metadata management and a data catalogue, data quality management,
master and reference data management, and big data processing and integration.
 The product is fully integrated yet modular for any data, user, domain or deployment type.
 Ataccama also includes text analytics and machine learning, as well as data enrichment
with external sources and data lake profiling.
2. IBM Infosphere Master Data Management: IBM Infosphere Master Data Management
manages all aspects of critical enterprise data regardless of System model, and delivers it to
application users in a unified view.
 The tool provides compliance with data governance rules and policies, and features a
configurable framework that supports hybrid cloud environments.
 InfoSphere MDM is available in two editions (standard and advanced) and both are
available on-prem or as Tuy- managed cloud offerings.
3. Informatica Multidomain MDM: Informatica offers a modular MDM solution that provides a
single view of data.
 The product enables users to create an authoritative view of business-critical data from
disparate, duplicate and conflicting sources.
 Informatica MDM also features AI and machine learning, and includes data quality, data
integration, business process management, and data security functionality that allows you
to easily enrich master data records with data from external providers.
4. Profisee Platform: Profisee is a cloud-native, multidomain master data management (MDM)
platform that features an intuitive data stewardship UI, visual relationship management,
matching and survivorship, data quality rules and workflow management.

5. SAP Master Data Governance: SAP offers enterprise MDM functionality through i. SAP
Master Data Governance product. The solution can be deployed on-prem or in the cloud and
enables users to consolidate and centrally govern master data.
 SAP includes support for all master data domains and implementation styles, pre-built
data models business rules, workflow, and user interfaces.
 Master Data Governance also lets you define, validate, and monitor your established
business rules to confirm master data readiness and analyse the performance of data
management.
6. Oracle Enterprise Data Management: Oracle Enterprise Data Management is a multidomain,
enterprise-level MDM platform.
 Formerly a standalone MDM platform, it is now embedded within cloud applications and
consists of several different solutions designed to collect and standardise data as a service
of the application.
7. SAS MDM: SAS MDM is a multi-domain MDM solution from SAS, an enterprise provider of
analytics - solutions. SAS MDM offers a complete solution for data management, with a strong
focus on data quality and governance.
8. Syniti Master Data Management: Syniti (formerly Back Office Associates) lets users choose
from an array of supported MDM implementation styles.
 The product enables the creation of a single point of reference to master and application
data from multiple domains.
 Syniti automatically notifies users when work needs to be performed, as well as monitor
processes. It also links the business semantic later to all master and application data
mastered by the solution.
9. TIBCO EBX: TIBCO Software offers its MDM capabilities via TIBCO EBX, a single solution
that manages workflow, data quality, and role-specific applications.
 EBX is designed with different user personas in mind by providing self-service so all
individuals can manage, govern, and consume key data assets.
 Customers can model any master data (including relationships between domains) without
buying separate tools. This means that you can put any domain of master or reference data
inside. EBX works with SAP, BCP, TMl, and One Stream.
10. Microsoft's Azure Data Catalogue: Microsoft's Azure Data Catalogue is an enterprise-wide
metadata catalogue designed to make data asset discovery straightforward, a fully-managed
service that lets analysts, data scientists, and developers to register, enrich, discover, understand,
and consume data sources.
EXTRACT, TRANSFORM AND LOAD (ETL)
ETL stands for Extract, Transform, Load and it is a process used in data warehousing to extract
data from various sources, transform it into a format suitable for loading into a data warehouse,
and then load it into the warehouse. The process of ETL can be broken down into the following
three stages:
 Extract: The first stage in the ETL process is to extract data from various sources such as
transactional systems, spreadsheets, and flat files. This step involves reading data from
the source systems and storing it in a staging area.
 Transform: In this stage, the extracted data is transformed into a format that is suitable
for loading into the data warehouse. This may involve cleaning and validating the data,
converting data types, combining data from multiple sources, and creating new data
fields.
 Load: After the data is transformed, it is loaded into the data warehouse. This step
involves creating the physical data structures and loading the data into the warehouse.
The ETL process is an iterative process that is repeated as new data is added to the warehouse.
 The process is important because it ensures that the data in the data warehouse is accurate,
complete, and up-to-date.
 It also helps to ensure that the data is in the format required for data mining and reporting.

ETL is a process in Data Warehousing and it stands for Extract, Transform and Load. It is a process
in which an ETL tool extracts the data from various data source systems, transforms it in the
staging area, and then finally, loads it into the Data Warehouse system.
1. Extraction:
The first step of the ETL process is extraction. In this step, data from various source
systems is extracted which can be in various formats like relational databases, No
SQL, XML, and flat files into the staging area. It is important to extract the data
from various source systems and store it into the staging area first and not directly
into the data warehouse because the extracted data is in various formats and can be
corrupted also. Hence loading it directly into the data warehouse may damage it and
rollback will be much more difficult. Therefore, this is one of the most important
steps of ETL process.
2. Transformation:
The second step of the ETL process is transformation. In this step, a set of rules or
functions are applied on the extracted data to convert it into a single standard
format. It may involve following processes/tasks:
 Filtering – loading only certain attributes into the data warehouse.
 Cleaning – filling up the NULL values with some default values, mapping U.S.A,
United States, and America into USA, etc.
 Joining – joining multiple attributes into one.
 Splitting – splitting a single attribute into multiple attributes.
 Sorting – sorting tuples on the basis of some attribute (generally key-attribute).
3. Loading:
The third and final step of the ETL process is loading. In this step, the transformed
data is finally loaded into the data warehouse. Sometimes the data is updated by
loading into the data warehouse very frequently and sometimes it is done after longer
but regular intervals. The rate and period of loading solely depends on the
requirements and varies from system to system.

ETL process can also use the pipelining concept i.e. as soon as some data is extracted, it can
transformed and during that period some new data can be extracted. And while the transformed
data is being loaded into the data warehouse, the already extracted data can be transformed. The
block diagram of the pipelining of ETL process is shown below:
ETL Tools: Most commonly used ETL tools are Hevo, Sybase, Oracle Warehouse builder,
Clover ETL, and Mark Logic.
Data Warehouses: Most commonly used Data Warehouses are Snowflake, Redshift,
BigQuery, and Firebolt.

ADVANTAGES AND DISADVANTAGES:

Advantages of ETL process in data warehousing:

1. Improved data quality: ETL process ensures that the data in the data warehouse is
accurate, complete, and up-to-date.
2. Better data integration: ETL process helps to integrate data from multiple sources and
systems, making it more accessible and usable.
3. Increased data security: ETL process can help to improve data security by controlling
access to the data warehouse and ensuring that only authorized users can access the data.
4. Improved scalability: ETL process can help to improve scalability by providing a way
to manage and analyze large amounts of data.
5. Increased automation: ETL tools and technologies can automate and simplify the ETL
process, reducing the time and effort required to load and update data in the warehouse.

Disadvantages of ETL process in data warehousing:

1. High cost: ETL process can be expensive to implement and maintain, especially for
organizations with limited resources.
2. Complexity: ETL process can be complex and difficult to implement, especially for
organizations that lack the necessary expertise or resources.
3. Limited flexibility: ETL process can be limited in terms of flexibility, as it may not be
able to handle unstructured data or real-time data streams.
4. Limited scalability: ETL process can be limited in terms of scalability, as it may not be
able to handle very large amounts of data.
5. Data privacy concerns: ETL process can raise concerns about data privacy, as large
amounts of data are collected, stored, and analysed.
ETL Tools:
These tools allow users to extract, transform, and load data. This process allows data to be
extracted from any data source, transformed into the proper format for storing and loaded into
a
database.
1. Oracle Data Integrator: Best ETL Tool for Oracle Users: Oracle Data Integration is an
ETL tool that provides high-performance data movement and transformation among enterprise
platforms with its open and integrated ETL architecture. This easy-to-use interface that
combines with an extensibility framework helps the ETL tool reduce development costs and
lower the total cost of ownership.
(a) Advantages:
i)Works with cloud services
ii) Easy to use
iii) Can connect multiple apps
iv) Rich UI and UX
(b) Disadvantages :
i) Limited to SQL language
ii) Requires training and testing
2. Microsoft SQL Server Integration Services (SSIS): Best ETL Tool for Microsoft users:
This ETL tool is a component of the Microsoft SQL server database that is used or a range of
data migration tasks. This ETL tool features an import/export wizard that allows the user to
create packages to move data from a single source to a destination. Microsoft also offers other
tools such as Microsoft Visual Studio which allows users to edit SSIS packages using a drag
and drop interface.
(a) Advantages:
i) Runs third party software
ii) Drag and drop components
iii) Easy to use
iv) Backend coding
(b) Disadvantages :
i) High CPU memory usage
ii) Requires training and testing
3. SAS Data Integration Studio: Best ETL Tool for Visual Learners: This ETL Tool is
unique as it combines visual elements with data migration. SAS Data Integration Studio builds,
implements, and manages data integrations processes regardless of the source of the data. This
tool is easy-to-manage and enables collaboration on large projects with repeatable processes.
SAS Data Integration Studio is only one component in a number of SAS software offerings.
(a) Advantages:
i) Reliable
ii) Consistent data
iii) User friendly data
iv) Powerful backend
(b) Disadvantages:
i) Limited third party integration
ii) Overwhelming

4. Informatica PowerCentre: Best ETL Tool for a Market Leading Enterprise Data Platform
Informatica PowerCentre is a strong ETL tool that can form the foundation for all of your data
integration needs, this tool includes analytics and data warehousing, application migration
consolidation and data governance, Advanced data transformation is also a benefit of this tool.
(a) Advantages:
i) Mapping options
ii) Compatible with different databases
iii) Easy to understand
iv) Strong search filters
(b) Disadvantages:
i) Lagging performance
ii) Lack of scheduling options
5. Talend Open Studio: Best ETL Tool for Direct Marketers: Talend Open Studio is an open
source ETL tool. It is designed for allowing users to extract data sets, standardise them and
transform them into a consistent format to then be loaded into third party applications. While
this version is a little rougher around the edges than the paid version, the price tag is definitely
a benefit. This ETL tool has several helpful features including graphical conversion tools.
charts, and database CD tools. It also includes many built-in business intelligence tools.
(a) Advantages :
i) Easy to use
ii) Large amount of connectors
iii) Open Source
iv) Custom component
(b) Disadvantages:
i) Large CPU/memory usage
ii) Lack of documentation

6. Hevo: Best ETL Tool for Every Data Need: Hevo is "bi-directional data pipeline platform
specifically built for modern ETL, ELT and Reverse ETL Needs." This tool helps teams both
automate and streamline org-wide data flows that can help save un to 10 hours of engineering
time a week. This ETL tool can also provide 10x faster analytics, reporting and decision
making. Hevo also has an internal feature that auto-maps incoming data.
(a) Advantages:
i) Custom API connections
ii) Built-in Python module
iii) Intuitive UI
iv) Fast customer service response
(b) Disadvantages:
i) Limited destination configuration
ii) Limited pipeline schedules
7. Pentaho: Best ETL Tool for Graphic Lovers: Pentaho is a ETL tool that has software
to support all aspects of decision business making including data warehouse management, data
integration tools, data analysis, management software, and data mining. This tool also allows
developers to manipulate jobs using a graphic creator without the need for coding.
(a) Advantages:
i) Graphic interface
ii) Pre-configured modules
iii) Free community suite
iv) User-friendly
(b) Disadvantages:
i) Simple error screen.
ii) Unclear implementation
8. Dataddo: Best ETL Tool for Non-Technical Users: Dataddo is a cloud based ELT tool that
is designed to work with online data services such as Salesforce, Facebook, Instagram,
Youtube, and HubSpot. This solution will easily connect the client's data with dashboards and
data warehouse systems. This ETL tool is perfect for any data-driven business, but marketing
agencies and e-commerce companies might find the most use out of it.
(a) Advantages:
i) Fully customisable metrics
ii) No-coding necessary
iii) GDPR SOC2 and ISO 27001 compliant
iv) Fast customer service response
(b) Disadvantages:
i) Hard to add custom fields
ii) Hard to cancel subscription

Advantages of ETL Tools:


1. Easy to use.
2. Load data from different targets at same time.
3. Performs data transformation as per need.
4. Better for complex rules and transformations.
5. Inbuilt Error handling functionality.
6. Based on GUI and offer visual flow.
7. Save Cost and generate higher revenue.

Disadvantages of ETL Tools:


1. Not suitable for near real-time data access.
2. Inclined more towards batch data processing
3. Difficult to keep up with changing requirements.

Extraction, Load and Transform (ELT)





1. Extraction, Load and Transform (ELT): Extraction, Load and Transform (ELT) is the
technique of extracting raw data from the source and storing it in data warehouse of the target
server and preparing it for end stream users.
ELT comprises of 3 different operations performed on the data:
1. Extract:
Extracting data is the technique of identifying data from one or more sources. The
sources may be databases, files, ERP, CRM or any other useful source of data.
2. Load:
Loading is the process of storing the extracted raw data in data warehouse or data
lakes.
3. Transform:
Data transformation is the process in which the raw data source is transformed to the
target format required for analysis.

Data from the sources are extracted and stored in the data warehouse. The entire data is not
transformed but only the required transformation is done when necessary. Raw data can be
retrieved from the warehouse anytime when required. The data transformed as required is
then sent forward for analysis. When you use ELT, you move the entire data set as it exists in
the source systems to the target. This means that you have the raw data at your disposal in
the data warehouse, in contrast to the ETL approach.

Difference between ELT and ETL:


ELT ETL

ETL tools require specific hardware with


ELT tools do not require additional hardware their own engines to perform
transformations

Mostly Hadoop or NoSQL database to store


RDBMS is used exclusively to store data
data. Rarely RDBMS is used

As all components are in one system, loading As ETL uses staging area, extra time is
is done only once required to load the data

The system has to wait for large sizes of


Time to transform data is independent of the
data. As the size of data increases,
size of data
transformation time also increases

It is cost effective and available to all business Not cost effective for small and medium
using SaaS solution business

The data transformed is used by data scientists The data transformed is used by users
and advanced analysts reading report and SQL coders

Creates ad hoc views. Views are created based on multiple scripts


Low cost for building and maintaining Deleting view means deleting data

Best for unstructured and non-relational data.


Best for relational and structured data.
Ideal for data lakes. Suited for very large
Better for small to medium amounts of data
amounts of data
Important Questions:

• Define Data Quality .What is the role of Data Quality in an organisation.


• What is MDM. Role of MDM in organisation.
• Tools of Data Quality and MDM
• Explain about ETL process
• Write about ELT process
• Tools of ETL
• Data Integration

You might also like