0% found this document useful (0 votes)
65 views998 pages

DW Bi

Uploaded by

mahima patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views998 pages

DW Bi

Uploaded by

mahima patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 998

What is data warehouse?

A data warehouse is a electronical storage of an Organization's historical data for


the purpose of
analysis and reporting. According to Kimpball, a datawarehouse should be subject-
oriented,
non-volatile, integrated and time-variant.

What are the benefits of data warehouse?

Historical data stored in data warehouse helps to analyze different aspects of


business
including, performance analysis, trend analysis, trend prediction etc. which
ultimately increases
efficiency of business processes.

Why Data Warehouse is used?

Data warehouse facilitates reporting on different key business processes known as


KPI. Data warehouse can be
further used for data mining which helps trend prediction, forecasts, pattern
recognition etc.

What is the difference between OLTP and OLAP?

OLTP is the transaction system that collects business data. Whereas OLAP is the
reporting and analysis system on
that data.

OLTP systems are optimized for INSERT, UPDATE operations and therefore highly
normalized. On the other hand,
OLAP systems are deliberately denormalized for fast data retrieval through SELECT
operations.

Explanatory Note:

In a departmental shop, when we pay the prices at the check-out counter, the sales
person at the
counter keys-in all the data into a "Point-Of-Sales" machine. That data is
transaction data and the
related system is a OLTP system. On the other hand, the manager of the store might
want to view a
report on out-of-stock materials, so that he can place purchase order for them.
Such report will come
out from OLAP system
What is data mart?

Data marts are generally designed for a single subject area. An organization may
have data pertaining to different
departments like Finance, HR, Marketing etc. stored in data warehouse and each
department may have separate data
marts. These data marts can be built on top of the data warehouse.

What is ER model?

ER model is entity-relationship model which is designed with a goal of normalizing


the data.

What is dimensional modelling?

Dimensional model consists of dimension and fact tables. Fact tables store
different transactional measurements and
the foreign keys from dimension tables that qualifies the data. The goal of
Dimensional model is not to achieve high
degree of normalization but to facilitate easy and faster data retrieval.

What is dimension?

A dimension is something that qualifies a quantity (measure).

If I just say� �20kg�, it does not mean anything. But 20kg of Rice (Product) is
sold to Ramesh (customer) on 5th April
(date), gives a meaningful sense. These product, customer and dates are some
dimension that qualified the measure.
Dimensions are mutually independent.

Technically speaking, a dimension is a data element that categorizes each item in a


data set into non-overlapping
regions.

What is fact?

A fact is something that is quantifiable (Or measurable). Facts are typically (but
not always) numerical values that
can be aggregated.

What are additive, semi-additive and non-additive measures?


Star-Schema
Non-additive measures are those which can not be used inside any numeric
aggregation function (e.g. SUM(), AVG()
etc.). One example of non-additive fact is any kind of ratio or percentage.
Example, 5% profit margin, revenue to
asset ratio etc. A non-numerical data can also be a non-additive measure when that
data is stored in fact tables.

Semi-additive measures are those where only a subset of aggregation function can be
applied. Let�s say account
balance. A sum() function on balance does not give a useful result but max() or
min() balance might be useful.
Consider price rate or currency rate. Sum is meaningless on rate; however, average
function might be useful.

Additive measures can be used with any aggregation function like Sum(), Avg() etc.
Example is Sales Quantity etc.

What is Star-schema?

This schema is used in data warehouse models where one centralized fact table
references number of dimension
tables so as the keys (primary key) from all the dimension tables flow into the
fact table (as foreign key) where
measures are stored. This entity-relationship diagram looks like a star, hence the
name.

Consider a fact table that stores sales quantity for each product and customer on a
certain time. Sales quantity will be
the measure here and keys from customer, product and time dimension tables will
flow into the fact table.

A star-schema is a special case of snow-flake schema.

What is snow-flake schema?


snowflake-schema
This is another logical arrangement of tables in dimensional modeling where a
centralized fact table references
number of other dimension tables; however, those dimension tables are further
normalized into multiple related
tables.

Consider a fact table that stores sales quantity for each product and customer on a
certain time. Sales quantity will be
the measure here and keys from customer, product and time dimension tables will
flow into the fact table.
Additionally all the products can be further grouped under different product
families stored in a different table so
that primary key of product family tables also goes into the product table as a
foreign key. Such construct will be
called a snow-flake schema as product table is further snow-flaked into product
family.

Note
Snow-flake increases degree of normalization in the design.

What are the different types of dimension?

In a data warehouse model, dimension can be of following types,

1. Conformed Dimension
2. Junk Dimension
3. Degenerated Dimension
4. Role Playing Dimension
Based on how frequently the data inside a dimension changes, we can further
classify dimension as

1. Unchanging or static dimension (UCD)


2. Slowly changing dimension (SCD)
3. Rapidly changing Dimension (RCD)

What is a 'Conformed Dimension'?

A conformed dimension is the dimension that is shared across multiple subject area.
Consider 'Customer' dimension.
Both marketing and sales department may use the same customer dimension table in
their reports. Similarly, a 'Time'
or 'Date' dimension will be shared by different subject areas. These dimensions are
conformed dimension.

Theoretically, two dimensions which are either identical or strict mathematical


subsets of one another are said to be
conformed.

What is degenerated dimension?

A degenerated dimension is a dimension that is derived from fact table and does not
have its own dimension table.

A dimension key, such as transaction number, receipt number, Invoice number etc.
does not have any more
associated attributes and hence can not be designed as a dimension table.

What is junk dimension?

A junk dimension is a grouping of typically low-cardinality attributes (flags,


indicators etc.) so that those can be
removed from other tables and can be junked into an abstract dimension table.

These junk dimension attributes might not be related. The only purpose of this
table is to store all the combinations of
the dimensional attributes which you could not fit into the different dimension
tables otherwise. One may want to
read an interesting document, De-clutter with Junk (Dimension)

What is a role-playing dimension?

Dimensions are often reused for multiple applications within the same database with
different contextual meaning.
For instance, a "Date" dimension can be used for "Date of Sale", as well as "Date
of Delivery", or "Date of Hire". This is
often referred to as a 'role-playing dimension'
What is SCD?

SCD stands for slowly changing dimension, i.e. the dimensions where data is slowly
changing. These can be of many
types, e.g. Type 0, Type 1, Type 2, Type 3 and Type 6, although Type 1, 2 and 3 are
most common.

What is rapidly changing dimension?

This is a dimension where data changes rapidly.

Describe different types of slowly changing Dimension (SCD)

Type 0:

A Type 0 dimension is where dimensional changes are not considered. This does not
mean that the attributes of the
dimension do not change in actual business situation. It just means that, even if
the value of the attributes change,
history is not kept and the table holds all the previous data.

Type 1:

A type 1 dimension is where history is not maintained and the table always shows
the recent data. This effectively
means that such dimension table is always updated with recent data whenever there
is a change, and because of this
update, we lose the previous values.

Type 2:

A type 2 dimension table tracks the historical changes by creating separate rows in
the table with different surrogate
keys. Consider there is a customer C1 under group G1 first and later on the
customer is changed to group G2. Then
there will be two separate records in dimension table like below,

Key

Customer

Group

Start Date

End Date

C1

G1

1st Jan 2000

31st Dec 2005

2
C1

G2

1st Jan 2006

NULL
Note that separate surrogate keys are generated for the two records. NULL end date
in the second row denotes that
the record is the current record. Also note that, instead of start and end dates,
one could also keep version number
column (1, 2 � etc.) to denote different versions of the record.

Type 3:

A type 3 dimension stored the history in a separate column instead of separate


rows. So unlike a type 2 dimension
which is vertically growing, a type 3 dimension is horizontally growing. See the
example below,

Key

Customer

Previous Group

Current Group

C1

G1

G2

This is only good when you need not store many consecutive histories and when date
of change is not required to be
stored.

Type 6:

A type 6 dimension is a hybrid of type 1, 2 and 3 (1+2+3) which acts very similar
to type 2, but only you add one extra
column to denote which record is the current record.

Key

Customer

Group

Start Date

End Date

Current Flag

C1

G1
1st Jan 2000

31st Dec 2005

C1

G2

1st Jan 2006

NULL

What is a mini dimension?

Mini dimensions can be used to handle rapidly changing dimension scenario. If a


dimension has a huge number of
rapidly changing attributes it is better to separate those attributes in different
table called mini dimension. This is
done because if the main dimension table is designed as SCD type 2, the table will
soon outgrow in size and create
performance issues. It is better to segregate the rapidly changing members in
different table thereby keeping the
main dimension table small and performing.
What is a fact-less-fact?

A fact table that does not contain any measure is called a fact-less fact. This
table will only contain keys from different
dimension tables. This is often used to resolve a many-to-many cardinality issue.

Explanatory Note:

Consider a school, where a single student may be taught by many teachers and a
single teacher may have many
students. To model this situation in dimensional model, one might introduce a fact-
less-fact table joining teacher and
student keys. Such a fact table will then be able to answer queries like,

1. Who are the students taught by a specific teacher.


2. Which teacher teaches maximum students.
3. Which student has highest number of teachers.etc. etc.

What is a coverage fact?

A fact-less-fact table can only answer 'optimistic' queries (positive query) but
can not answer a negative query. Again
consider the illustration in the above example. A fact-less fact containing the
keys of tutors and students can not
answer a query like below,

1. Which teacher did not teach any student?


2. Which student was not taught by any teacher?

Why not? Because fact-less fact table only stores the positive scenarios (like
student being taught by a tutor) but if
there is a student who is not being taught by a teacher, then that student's key
does not appear in this table, thereby
reducing the coverage of the table.

Coverage fact table attempts to answer this - often by adding an extra flag column.
Flag = 0 indicates a negative
condition and flag = 1 indicates a positive condition. To understand this better,
let's consider a class where there are
100 students and 5 teachers. So coverage fact table will ideally store 100 X 5 =
500 records (all combinations) and if a
certain teacher is not teaching a certain student, the corresponding flag for that
record will be 0.

What are incident and snapshot facts

A fact table stores some kind of measurements. Usually these measurements are
stored (or captured) against a
specific time and these measurements vary with respect to time. Now it might so
happen that the business might not
able to capture all of its measures always for every point in time. Then those
unavailable measurements can be kept
empty (Null) or can be filled up with the last available measurements. The first
case is the example of incident fact
and the second one is the example of snapshot fact.

What is aggregation and what is the benefit of aggregation?

A data warehouse usually captures data with same degree of details as available in
source. The "degree of detail" is
termed as granularity. But all reporting requirements from that data warehouse do
not need the same degree of
details.

To understand this, let's consider an example from retail business. A certain


retail chain has 500 shops accross
Europe. All the shops record detail level transactions regarding the products they
sale and those data are captured in
a data warehouse.

Each shop manager can access the data warehouse and they can see which products are
sold by whom and in what
quantity on any given date. Thus the data warehouse helps the shop managers with
the detail level data that can be
used for inventory management, trend prediction etc.

Now think about the CEO of that retail chain. He does not really care about which
certain sales girl in London sold
the highest number of chopsticks or which shop is the best seller of 'brown
breads'. All he is interested is, perhaps to
check the percentage increase of his revenue margin accross Europe. Or may be year
to year sales growth on eastern
Europe. Such data is aggregated in nature. Because Sales of goods in East Europe is
derived by summing up the
individual sales data from each shop in East Europe.

Therefore, to support different levels of data warehouse users, data aggregation is


needed.

What is slicing-dicing?

Slicing means showing the slice of a data, given a certain set of dimension (e.g.
Product) and value (e.g. Brown Bread)
and measures (e.g. sales).

Dicing means viewing the slice with respect to different dimensions and in
different level of aggregations.

Slicing and dicing operations are part of pivoting.

What is drill-through?
Drill through is the process of going to the detail level data from summary data.

Consider the above example on retail shops. If the CEO finds out that sales in East
Europe has declined this year
compared to last year, he then might want to know the root cause of the decrease.
For this, he may start drilling
through his report to more detail level and eventually find out that even though
individual shop sales has actually
increased, the overall sales figure has decreased because a certain shop in Turkey
has stopped operating the business.
The detail level of data, which CEO was not much interested on earlier, has this
time helped him to pin point the root
cause of declined sales. And the method he has followed to obtain the details from
the aggregated data is called drill
through.

Informatica Questions

Welcome to the finest collection of Informatica Interview Questions with standard


answers that you can count on.
Read and understand all the questions and their answers below and in the following
pages to get a good grasp in
Informatica.

What are the differences between Connected and Unconnected Lookup?

Connected Lookup

Unconnected Lookup

Connected lookup participates in dataflow


and receives input directly from the pipeline

Unconnected lookup receives input values


from the result of a LKP: expression in
another transformation

Connected lookup can use both dynamic and


static cache

Unconnected Lookup cache can NOT be


dynamic

Connected lookup can return more than one


column value ( output port )

Unconnected Lookup can return only one


column value i.e. output port

Connected lookup caches all lookup columns

Unconnected lookup caches only the lookup


output ports in the lookup conditions and the
return port

Supports user-defined default values (i.e.


value to return when lookup conditions are
not satisfied)

Does not support user defined default values

What is the difference between Router and Filter?

Router

Filter

Router transformation divides the incoming


records into multiple groups based on some
condition. Such groups can be mutually
inclusive (Different groups may contain same
record)

Filter transformation restricts or blocks the


incoming record set based on one given
condition.

Router transformation itself does not block


any record. If a certain record does not match
any of the routing conditions, the record is
routed to default group

Filter transformation does not have a default


group. If one record does not match filter
condition, the record is blocked

Router acts like CASE.. WHEN statement in


SQL (Or Switch().. Case statement in C)

Filter acts like WHERE condition is SQL.

What can we do to improve the performance of Informatica Aggregator


Transformation?

Aggregator performance improves dramatically if records are sorted before passing


to the aggregator and "sorted
input" option under aggregator properties is checked. The record set should be
sorted on those columns that are used
in Group By operation.
It is often a good idea to sort the record set in database level (why?) e.g. inside
a source qualifier transformation,
unless there is a chance that already sorted records from source qualifier can
again become unsorted before reaching
aggregator

What are the different lookup cache?

Lookups can be cached or uncached (No cache). Cached lookup can be either static or
dynamic. A static cache is one
which does not modify the cache once it is built and it remains same during the
session run. On the other hand, A
dynamic cache is refreshed during the session run by inserting or updating the
records in cache based on the
incoming source data.

A lookup cache can also be divided as persistent or non-persistent based on whether


Informatica retains the cache
even after session run is complete or not respectively

How can we update a record in target table without using Update strategy?

A target table can be updated without using 'Update Strategy'. For this, we need to
define the key in the target table
in Informatica level and then we need to connect the key and the field we want to
update in the mapping Target. In
the session level, we should set the target property as "Update as Update" and
check the "Update" check-box.

Let's assume we have a target table "Customer" with fields as "Customer ID",
"Customer Name" and "Customer
Address". Suppose we want to update "Customer Address" without an Update Strategy.
Then we have to define
"Customer ID" as primary key in Informatica level and we will have to connect
Customer ID and Customer Address
fields in the mapping. If the session properties are set correctly as described
above, then the mapping will only
update the customer address field for all matching customer IDs.

Under what condition selecting Sorted Input in aggregator may fail the session?

. If the input data is not sorted correctly, the session will fail.
. Also if the input data is properly sorted, the session may fail if the sort order
by ports and the group by ports
of the aggregator are not in the same order.

Why is Sorter an Active Transformation?


Ans. When the Sorter transformation is configured to treat output rows as distinct,
it assigns all ports as part of the
sort key. The Integration Service discards duplicate rows compared during the sort
operation. The number of Input
Rows will vary as compared with the Output rows and hence it is an Active
transformation.

Is lookup an active or passive transformation?

From Informatica 9x, Lookup transformation can be configured as as "Active"


transformation. Find out How to
configure lookup as active transformation

What is the difference between Static and Dynamic Lookup Cache?

Ans. We can configure a Lookup transformation to cache the corresponding lookup


table. In case of static or read-
only lookup cache the Integration Service caches the lookup table at the beginning
of the session and does not update
the lookup cache while it processes the Lookup transformation.

In case of dynamic lookup cache the Integration Service dynamically inserts or


updates data in the lookup cache and
passes the data to the target. The dynamic cache is synchronized with the target.

What is the difference between STOP and ABORT options in Workflow


Monitor?

Ans. When we issue the STOP command on the executing session task, the Integration
Service stops reading data
from source. It continues processing, writing and committing the data to targets.
If the Integration Service cannot
finish processing and committing data, we can issue the abort command.

In contrast ABORT command has a timeout period of 60 seconds. If the Integration


Service cannot finish processing
and committing data within the timeout period, it kills the DTM process and
terminates the session.

How to Delete duplicate row using Informatica

Scenario 1: Duplicate rows are present in relational database


Suppose we have Duplicate records in Source System and we want to load only the
unique records in the Target
System eliminating the duplicate rows. What will be the approach?
Source Qualifier Transformation DISTINCT clause
Ans.
Assuming that the source system is a Relational Database, to eliminate duplicate
records, we can check the Distinct
option of the Source Qualifier of the source table and load the target accordingly.

Scenario 2: Deleting duplicate records from flatfile

A collection of scenario based Informatica Interview Questions.

Deleting duplicate row for FLAT FILE sources

Now suppose the source system is a Flat File. Here in the Source Qualifier you will
not be able to select the distinct
clause as it is disabled due to flat file source table. Hence the next approach may
be we use a Sorter Transformation
and check the Distinct option. When we select the distinct option all the columns
will the selected as keys, in
ascending order by default.
Sorter Transformation DISTINCT clause

Deleting Duplicate Record Using Informatica Aggregator

Other ways to handle duplicate records in source batch run is to use an Aggregator
Transformation and using the
Group By checkbox on the ports having duplicate occurring data. Here you can have
the flexibility to select the last or
the first of the duplicate column value records. Apart from that using Dynamic
Lookup Cache of the target table and
associating the input ports with the lookup port and checking the Insert Else
Update option will help to eliminate the
duplicate records in source and hence loading unique records in the target.

For more details on Dynamic Lookup Cache

Loading Multiple Target Tables Based on Conditions

Q2. Suppose we have some serial numbers in a flat file source. We want to load the
serial numbers in two target files
one containing the EVEN serial numbers and the other file having the ODD ones.

Ans. After the Source Qualifier place a Router Transformation. Create two Groups
namely EVEN and ODD, with
filter conditions as MOD(SERIAL_NO,2)=0 and MOD(SERIAL_NO,2)=1 respectively. Then
output the two groups
into two flat file targets.
Router Transformation Groups Tab

Normalizer Related Questions

Q3. Suppose in our Source Table we have data as given below:

Student Name

Maths

Life Science

Physical Science

Sam

100

70

80

John

75

100

85

Tom

80

100

85

We want to load our Target Table as:

Student Name

Subject Name

Marks

Sam

Maths

100
Sam

Life Science

70

Sam

Physical Science

80

John

Maths

75

John

Life Science

100

John

Physical Science

85

Tom

Maths

80

Tom

Life Science

100

Tom

Physical Science

85

Describe your approach.

Ans. Here to convert the Rows to Columns we have to use the Normalizer
Transformation followed by an
Expression Transformation to Decode the column taken into consideration. For more
details on how the mapping is
performed please visit Working with Normalizer
Q4. Name the transformations which converts one to many rows i.e increases the
i/p:o/p row count. Also what is the
name of its reverse transformation.

Ans. Normalizer as well as Router Transformations are the Active transformation


which can increase the number of
input rows to output rows.

Aggregator Transformation is the active transformation that performs the reverse


action.

Q5. Suppose we have a source table and we want to load three target tables based on
source rows such that first row
moves to first target table, secord row in second target table, third row in third
target table, fourth row again in first
target table so on and so forth. Describe your approach.
Router Transformation Groups Tab
Ans. We can clearly understand that we need a Router transformation to route or
filter source data to the three target
tables. Now the question is what will be the filter conditions. First of all we
need an Expression Transformation
where we have all the source table columns and along with that we have another i/o
port say seq_num, which is gets
sequence numbers for each source row from the port NextVal of a Sequence Generator
start value 0 and increment
by 1. Now the filter condition for the three router groups will be:

. MOD(SEQ_NUM,3)=1 connected to 1st target table


. MOD(SEQ_NUM,3)=2 connected to 2nd target table
. MOD(SEQ_NUM,3)=0 connected to 3rd target table

Loading Multiple Flat Files using one mapping

Q6. Suppose we have ten source flat files of same structure. How can we load all
the files in target database in a
single batch run using a single mapping.

Ans. After we create a mapping to load data in target database from flat files,
next we move on to the session
property of the Source Qualifier. To load a set of source files we need to create a
file say final.txt containing the
source falt file names, ten files in our case and set the Source filetype option as
Indirect. Next point this flat file
final.txt fully qualified through Source file directory and Source filename.
Session Property Flat File

Q7. How can we implement Aggregation operation without using an Aggregator


Transformation in Informatica.

Ans. We will use the very basic concept of the Expression Transformation that at a
time we can access the previous
row data as well as the currently processed data in an expression transformation.
What we need is simple Sorter,
Expression and Filter transformation to achieve aggregation at Informatica level.

For detailed understanding visit Aggregation without Aggregator

Q8. Suppose in our Source Table we have data as given below:

Student Name

Subject Name

Marks

Sam

Maths

100

Tom

Maths

80

Sam

Physical Science

80
Mapping using sorter and Aggregator
John

Maths

75

Sam

Life Science

70

John

Life Science

100

John

Physical Science

85

Tom

Life Science

100

Tom

Physical Science

85

We want to load our Target Table as:

Student Name

Maths

Life Science

Physical Science

Sam

100

70

80

John
75

100

85

Tom

80

100

85

Describe your approach.

Ans. Here our scenario is to convert many rows to one rows, and the transformation
which will help us to achieve
this is Aggregator.

Our Mapping will look like this:


Sorter Transformation
We will sort the source data based on STUDENT_NAME ascending followed by SUBJECT
ascending.

Now based on STUDENT_NAME in GROUP BY clause the following output subject columns
are populated as

. MATHS: MAX(MARKS, SUBJECT=Maths)


. LIFE_SC: MAX(MARKS, SUBJECT=Life Science)
. PHY_SC: MAX(MARKS, SUBJECT=Physical Science)
Aggregator Transformation

Revisiting Source Qualifier Transformation

Q9. What is a Source Qualifier? What are the tasks we can perform using a SQ and
why it is an ACTIVE
transformation?

Ans. A Source Qualifier is an Active and Connected Informatica transformation that


reads the rows from a relational
database or flat file source.

. We can configure the SQ to join [Both INNER as well as OUTER JOIN] data
originating from the same
source database.
. We can use a source filter to reduce the number of rows the Integration Service
queries.
. We can specify a number for sorted ports and the Integration Service adds an
ORDER BY clause to the
default SQL query.
. We can choose Select Distinctoption for relational databases and the Integration
Service adds a SELECT
DISTINCT clause to the default SQL query.
. Also we can write Custom/Used Defined SQL query which will override the default
query in the SQ by
changing the default settings of the transformation properties.
. Also we have the option to write Pre as well as Post SQL statements to be
executed before and after the SQ
query in the source database.
Since the transformation provides us with the property Select Distinct, when the
Integration Service adds a SELECT
DISTINCT clause to the default SQL query, which in turn affects the number of rows
returned by the Database to the
Integration Service and hence it is an Active transformation.

Q10. What happens to a mapping if we alter the datatypes between Source and its
corresponding Source Qualifier?

Ans. The Source Qualifier transformation displays the transformation datatypes. The
transformation datatypes
determine how the source database binds data when the Integration Service reads it.

Now if we alter the datatypes in the Source Qualifier transformation or the


datatypes in the source definition and
Source Qualifier transformation do not match, the Designer marks the mapping as
invalid when we save it.

Q11. Suppose we have used the Select Distinct and the Number Of Sorted Ports
property in the SQ and then we add
Custom SQL Query. Explain what will happen.

Ans. Whenever we add Custom SQL or SQL override query it overrides the User-Defined
Join, Source Filter,
Number of Sorted Ports, and Select Distinct settings in the Source Qualifier
transformation. Hence only the user
defined SQL Query will be fired in the database and all the other options will be
ignored .

Q12. Describe the situations where we will use the Source Filter, Select Distinct
and Number Of Sorted Ports
properties of Source Qualifier transformation.

Ans. Source Filter option is used basically to reduce the number of rows the
Integration Service queries so as to
improve performance.

Select Distinct option is used when we want the Integration Service to select
unique values from a source, filtering
out unnecessary data earlier in the data flow, which might improve performance.

Number Of Sorted Ports option is used when we want the source data to be in a
sorted fashion so as to use the same
in some following transformations like Aggregator or Joiner, those when configured
for sorted input will improve
the performance.

Q13. What will happen if the SELECT list COLUMNS in the Custom override SQL Query
and the OUTPUT PORTS
order in SQ transformation do not match?

Ans. Mismatch or Changing the order of the list of selected columns to that of the
connected transformation output
ports may result is session failure.
Q14. What happens if in the Source Filter property of SQ transformation we include
keyword WHERE say, WHERE
CUSTOMERS.CUSTOMER_ID > 1000.

Ans. We use source filter to reduce the number of source records. If we include the
string WHERE in the source filter,
the Integration Service fails the session.

Q15. Describe the scenarios where we go for Joiner transformation instead of Source
Qualifier transformation.

Ans. While joining Source Data of heterogeneous sources as well as to join flat
files we will use the Joiner
transformation. Use the Joiner transformation when we need to join the following
types of sources:

. Join data from different Relational Databases.


. Join data from different Flat Files.
. Join relational sources and flat files.

Q16. What is the maximum number we can use in Number Of Sorted Ports for Sybase
source system.

Ans. Sybase supports a maximum of 16 columns in an ORDER BY clause. So if the


source is Sybase, do not sort more
than 16 columns.

Q17. Suppose we have two Source Qualifier transformations SQ1 and SQ2 connected to
Target tables TGT1 and
TGT2 respectively. How do you ensure TGT2 is loaded after TGT1?

Ans. If we have multiple Source Qualifier transformations connected to multiple


targets, we can designate the order
in which the Integration Service loads data into the targets.

In the Mapping Designer, We need to configure the Target Load Plan based on the
Source Qualifier transformations
in a mapping to specify the required loading order.
Target Load Plan
Target Load Plan Ordering

Q18. Suppose we have a Source Qualifier transformation that populates two target
tables. How do you ensure TGT2
is loaded after TGT1?

Ans. In the Workflow Manager, we can Configure Constraint based load ordering for a
session. The Integration
Service orders the target load on a row-by-row basis. For every row generated by an
active source, the Integration
Service loads the corresponding transformed row first to the primary key table,
then to the foreign key table.
Constraint based loading
Hence if we have one Source Qualifier transformation that provides data for
multiple target tables having primary
and foreign key relationships, we will go for Constraint based load ordering.

Revisiting Filter Transformation

Q19. What is a Filter Transformation and why it is an Active one?

Ans. A Filter transformation is an Active and Connected transformation that can


filter rows in a mapping.

Only the rows that meet the Filter Condition pass through the Filter transformation
to the next transformation in the
pipeline. TRUE and FALSE are the implicit return values from any filter condition
we set. If the filter condition
evaluates to NULL, the row is assumed to be FALSE.

The numeric equivalent of FALSE is zero (0) and any non-zero value is the
equivalent of TRUE.

As an ACTIVE transformation, the Filter transformation may change the number of


rows passed through it. A filter
condition returns TRUE or FALSE for each row that passes through the
transformation, depending on whether a row
meets the specified condition. Only rows that return TRUE pass through this
transformation. Discarded rows do not
appear in the session log or reject files.
Q20. What is the difference between Source Qualifier transformations Source Filter
to Filter transformation?

Ans.

SQ Source Filter

Filter Transformation

Source Qualifier
transformation filters rows
when read from a source.

Filter transformation filters rows from


within a mapping

Source Qualifier
transformation can only
filter rows from Relational
Sources.

Filter transformation filters rows


coming from any type of source
system in the mapping level.

Source Qualifier limits the


row set extracted from a
source.

Filter transformation limits the row set


sent to a target.

Source Qualifier reduces the


number of rows used
throughout the mapping
and hence it provides better
performance.

To maximize session performance,


include the Filter transformation as
close to the sources in the mapping as
possible to filter out unwanted data
early in the flow of data from sources
to targets.

The filter condition in the


Source Qualifier
transformation only uses
standard SQL as it runs in
the database.

Filter Transformation can define a


condition using any statement or
transformation function that returns
either a TRUE or FALSE value.
Revisiting Joiner Transformation
Q21. What is a Joiner Transformation and why it is an Active one?

Ans. A Joiner is an Active and Connected transformation used to join source data
from the same source system or
from two related heterogeneous sources residing in different locations or file
systems.

The Joiner transformation joins sources with at least one matching column. The
Joiner transformation uses a
condition that matches one or more pairs of columns between the two sources.

The two input pipelines include a master pipeline and a detail pipeline or a master
and a detail branch. The master
pipeline ends at the Joiner transformation, while the detail pipeline continues to
the target.

In the Joiner transformation, we must configure the transformation properties


namely Join Condition, Join Type and
Sorted Input option to improve Integration Service performance.

The join condition contains ports from both input sources that must match for the
Integration Service to join two
rows. Depending on the type of join selected, the Integration Service either adds
the row to the result set or discards
the row.

The Joiner transformation produces result sets based on the join type, condition,
and input data sources. Hence it is
an Active transformation.

Q22. State the limitations where we cannot use Joiner in the mapping pipeline.

Ans. The Joiner transformation accepts input from most transformations. However,
following are the limitations:

. Joiner transformation cannot be used when either of the input pipeline contains
an Update Strategy
transformation.
. Joiner transformation cannot be used if we connect a Sequence Generator
transformation directly before the
Joiner transformation.

Q23. Out of the two input pipelines of a joiner, which one will you set as the
master pipeline?

Ans. During a session run, the Integration Service compares each row of the master
source against the detail source.
The master and detail sources need to be configured for optimal performance.
To improve performance for an Unsorted Joiner transformation, use the source with
fewer rows as the master
source. The fewer unique rows in the master, the fewer iterations of the join
comparison occur, which speeds the join
process.

When the Integration Service processes an unsorted Joiner transformation, it reads


all master rows before it reads the
detail rows. The Integration Service blocks the detail source while it caches rows
from the master source. Once the
Integration Service reads and caches all master rows, it unblocks the detail source
and reads the detail rows.

To improve performance for a Sorted Joiner transformation, use the source with
fewer duplicate key values as the
master source.

When the Integration Service processes a sorted Joiner transformation, it blocks


data based on the mapping
configuration and it stores fewer rows in the cache, increasing performance.

Blocking logic is possible if master and detail input to the Joiner transformation
originate from different sources.
Otherwise, it does not use blocking logic. Instead, it stores more rows in the
cache.

Q24. What are the different types of Joins available in Joiner Transformation?

Ans. In SQL, a join is a relational operator that combines data from multiple
tables into a single result set. The Joiner
transformation is similar to an SQL join except that data can originate from
different types of sources.

The Joiner transformation supports the following types of joins :

. Normal
. Master Outer
. Detail Outer
. Full Outer
Join Type property of Joiner Transformation class="caption"

Note: A normal or master outer join performs faster than a full outer or detail
outer join.

Q25. Define the various Join Types of Joiner Transformation.

Ans.

. In a normal join , the Integration Service discards all rows of data from the
master and detail source that do
not match, based on the join condition.
. A master outer join keeps all rows of data from the detail source and the
matching rows from the master
source. It discards the unmatched rows from the master source.
. A detail outer join keeps all rows of data from the master source and the
matching rows from the detail
source. It discards the unmatched rows from the detail source.
. A full outer join keeps all rows of data from both the master and detail sources.

Q26. Describe the impact of number of join conditions and join order in a Joiner
Transformation.

Ans. We can define one or more conditions based on equality between the specified
master and detail sources. Both
ports in a condition must have the same datatype.

If we need to use two ports in the join condition with non-matching datatypes we
must convert the datatypes so that
they match. The Designer validates datatypes in a join condition.
Additional ports in the join condition increases the time necessary to join two
sources.

The order of the ports in the join condition can impact the performance of the
Joiner transformation. If we use
multiple ports in the join condition, the Integration Service compares the ports in
the order we specified.

NOTE: Only equality operator is available in joiner join condition.

Q27. How does Joiner transformation treat NULL value matching.

Ans. The Joiner transformation does not match null values.

For example, if both EMP_ID1 and EMP_ID2 contain a row with a null value, the
Integration Service does not
consider them a match and does not join the two rows.

To join rows with null values, replace null input with default values in the Ports
tab of the joiner, and then join on
the default values.

Note: If a result set includes fields that do not contain data in either of the
sources, the Joiner transformation
populates the empty fields with null values. If we know that a field will return a
NULL and we do not want to insert
NULLs in the target, set a default value on the Ports tab for the corresponding
port.

Q28. Suppose we configure Sorter transformations in the master and detail pipelines
with the following sorted ports
in order: ITEM_NO, ITEM_NAME, PRICE.

When we configure the join condition, what are the guidelines we need to follow to
maintain the sort order?

Ans. If we have sorted both the master and detail pipelines in order of the ports
say ITEM_NO, ITEM_NAME and
PRICE we must ensure that:

. Use ITEM_NO in the First Join Condition.


. If we add a Second Join Condition, we must use ITEM_NAME.
. If we want to use PRICE as a Join Condition apart from ITEM_NO, we must also use
ITEM_NAME in the
Second Join Condition.
. If we skip ITEM_NAME and join on ITEM_NO and PRICE, we will lose the input sort
order and the
Integration Service fails the session.
Mapping using Joiner
Q29. What are the transformations that cannot be placed between the sort origin and
the Joiner transformation so that
we do not lose the input sort order.

Ans. The best option is to place the Joiner transformation directly after the sort
origin to maintain sorted data.
However do not place any of the following transformations between the sort origin
and the Joiner transformation:

. Custom
. UnsortedAggregator
. Normalizer
. Rank
. Union transformation
. XML Parser transformation
. XML Generator transformation
. Mapplet [if it contains any one of the above mentioned transformations]

Q30. Suppose we have the EMP table as our source. In the target we want to view
those employees whose salary is
greater than or equal to the average salary for their departments. Describe your
mapping approach.

Ans. Our Mapping will look like this:

ahref="http://png.dwbiconcepts.com/images/tutorial/info_interview/
info_interview10.png"

To start with the mapping we need the following transformations:

After the Source qualifier of the EMP table place a Sorter Transformation . Sort
based on DEPTNOport.
Sorter Ports Tab

Next we place a Sorted Aggregator Transformation. Here we will find out the AVERAGE
SALARY for each
(GROUP BY) DEPTNO.

When we perform this aggregation, we lose the data for individual employees.

To maintain employee data, we must pass a branch of the pipeline to the Aggregator
Transformation and pass a
branch with the same sorted source data to the Joiner transformation to maintain
the original data.

When we join both branches of the pipeline, we join the aggregated data with the
original data.
Aggregator Ports Tab
Aggregator Properties Tab

So next we need Sorted Joiner Transformation to join the sorted aggregated data
with the original data, based on
DEPTNO. Here we will be taking the aggregated pipeline as the Master and original
dataflow as Detail Pipeline.
Joiner Condition Tab
Joiner Properties Tab

After that we need a Filter Transformation to filter out the employees having
salary less than average salary for their
department.

Filter Condition: SAL>=AVG_SAL


Filter Properties Tab

Lastly we have the Target table instance.

Revisiting Sequence Generator Transformation

Q31. What is a Sequence Generator Transformation?

Ans. A Sequence Generator transformation is a Passive and Connected transformation


that generates numeric
values. It is used to create unique primary key values, replace missing primary
keys, or cycle through a sequential
range of numbers. This transformation by default contains ONLY Two OUTPUT ports
namely CURRVAL and
NEXTVAL. We cannot edit or delete these ports neither we cannot add ports to this
unique transformation. We can
create approximately two billion unique numeric values with the widest range from 1
to 2147483647.

Q32. Define the Properties available in Sequence Generator transformation in brief.

Ans.

Sequence
Generator
Properties

Description
Start Value

Start value of the generated sequence that we want


the Integration Service to use if we use the Cycle
option. If we select Cycle, the Integration Service
cycles back to this value when it reaches the end
value. Default is 0.

Increment By

Difference between two consecutive values from the


NEXTVAL port.Default is 1.

End Value

Maximum value generated by SeqGen. After


reaching this value the session will fail if the
sequence generator is not configured to cycle.Default
is 2147483647.

Current
Value

Current value of the sequence. Enter the value we


want the Integration Service to use as the first value
in the sequence. Default is 1.

Cycle

If selected, when the Integration Service reaches the


configured end value for the sequence, it wraps
around and starts the cycle again, beginning with the
configured Start Value.

Number of
Cached
Values

Number of sequential values the Integration Service


caches at a time. Default value for a standard
Sequence Generator is 0. Default value for a reusable
Sequence Generator is 1,000.

Reset

Restarts the sequence at the current value each time a


session runs.This option is disabled for reusable
Sequence Generator transformations.
Sequence Generator
Q33. Suppose we have a source table populating two target tables. We connect the
NEXTVAL port of the Sequence
Generator to the surrogate keys of both the target tables.

Will the Surrogate keys in both the target tables be same? If not how can we flow
the same sequence values in both of
them.

Ans. When we connect the NEXTVAL output port of the Sequence Generator directly to
the surrogate key columns
of the target tables, the Sequence number will not be the same.

A block of sequence numbers is sent to one target tables surrogate key column. The
second targets receives a block of
sequence numbers from the Sequence Generator transformation only after the first
target table receives the block of
sequence numbers.

Suppose we have 5 rows coming from the source, so the targets will have the
sequence values as TGT1 (1,2,3,4,5) and
TGT2 (6,7,8,9,10). [Taken into consideration Start Value 0, Current value 1 and
Increment by 1.

Now suppose the requirement is like that we need to have the same surrogate keys in
both the targets.

Then the easiest way to handle the situation is to put an Expression Transformation
in between the Sequence
Generator and the Target tables. The SeqGen will pass unique values to the
expression transformation, and then the
rows are routed from the expression transformation to the targets.
Q34. Suppose we have 100 records coming from the source. Now for a target column
population we used a Sequence
generator.

Suppose the Current Value is 0 and End Value of Sequence generator is set to 80.
What will happen?

Ans. End Value is the maximum value the Sequence Generator will generate. After it
reaches the End value the
session fails with the following error message:

TT_11009 Sequence Generator Transformation: Overflow error.

Failing of session can be handled if the Sequence Generator is configured to Cycle


through the sequence, i.e.
whenever the Integration Service reaches the configured end value for the sequence,
it wraps around and starts the
cycle again, beginning with the configured Start Value.

Q35. What are the changes we observe when we promote a non resuable Sequence
Generator to a resuable one? And
what happens if we set the Number of Cached Values to 0 for a reusable
transformation?

Ans. When we convert a non reusable sequence generator to resuable one we observe
that the Number of Cached
Values is set to 1000 by default; And the Reset property is disabled.

When we try to set the Number of Cached Values property of a Reusable Sequence
Generator to 0 in the
Transformation Developer we encounter the following error message:

The number of cached values must be greater than zero for reusable sequence
transformation.

Revisiting Aggregator Transformation

Q36. What is an Aggregator Transformation?

Ans. An aggregator is an Active, Connected transformation which performs aggregate


calculations like AVG,
COUNT, FIRST, LAST, MAX, MEDIAN, MIN, PERCENTILE, STDDEV, SUM and VARIANCE.

Q37. How an Expression Transformation differs from Aggregator Transformation?

Ans. An Expression Transformation performs calculation on a row-by-row basis. An


Aggregator Transformation
performs calculations on groups.
Q38. Does an Informatica Transformation support only Aggregate expressions?

Ans. Apart from aggregate expressions Informatica Aggregator also supports non-
aggregate expressions and
conditional clauses.

Q39. How does Aggregator Transformation handle NULL values?

Ans. By default, the aggregator transformation treats null values as NULL in


aggregate functions. But we can specify
to treat null values in aggregate functions as NULL or zero.

Q40. What is Incremental Aggregation?

Ans. We can enable the session option, Incremental Aggregation for a session that
includes an Aggregator
Transformation. When the Integration Service performs incremental aggregation, it
actually passes changed source
data through the mapping and uses the historical cache data to perform aggregate
calculations incrementally.

For reference check Implementing Informatica Incremental Aggregation

Q41. What are the performance considerations when working with Aggregator
Transformation?

Ans.

. Filter the unnecessary data before aggregating it. Place a Filter transformation
in the mapping before the
Aggregator transformation to reduce unnecessary aggregation.
. Improve performance by connecting only the necessary input/output ports to
subsequent transformations,
thereby reducing the size of the data cache.
. Use Sorted input which reduces the amount of data cached and improves session
performance.

Q42. What differs when we choose Sorted Input for Aggregator Transformation?

Ans. Integration Service creates the index and data caches files in memory to
process the Aggregator transformation.
If the Integration Service requires more space as allocated for the index and data
cache sizes in the transformation
properties, it stores overflow values in cache files i.e. paging to disk. One way
to increase session performance is to
increase the index and data cache sizes in the transformation properties. But when
we check Sorted Input the
Integration Service uses memory to process an Aggregator transformation it does not
use cache files.

Q43. Under what conditions selecting Sorted Input in aggregator will still not
boost session performance?
Ans.

. Incremental Aggregation, session option is enabled.


. The aggregate expression contains nested aggregate functions.
. Source data is data driven.

Q44. Under what condition selecting Sorted Input in aggregator may fail the
session?

Ans.

. If the input data is not sorted correctly, the session will fail.
. Also if the input data is properly sorted, the session may fail if the sort order
by ports and the group by ports
of the aggregator are not in the same order.

Q45. Suppose we do not group by on any ports of the aggregator what will be the
output.

Ans. If we do not group values, the Integration Service will return only the last
row for the input rows.

Q46. What is the expected value if the column in an aggregator transform is neither
a group by nor an aggregate
expression?

Ans. Integration Service produces one row for each group based on the group by
ports. The columns which are
neither part of the key nor aggregate expression will return the corresponding
value of last record of the group
received. However, if we specify particularly the FIRST function, the Integration
Service then returns the value of the
specified first row of the group. So default is the LAST function.

Q47. Give one example for each of Conditional Aggregation, Non-Aggregate expression
and Nested Aggregation.

Ans.

Use conditional clauses in the aggregate expression to reduce the number of rows
used in the aggregation. The
conditional clause can be any clause that evaluates to TRUE or FALSE.

SUM( SALARY, JOB = CLERK )

Use non-aggregate expressions in group by ports to modify or replace groups.

IIF( PRODUCT = Brown Bread, Bread, PRODUCT )


The expression can also include one aggregate function within another aggregate
function, such as:

MAX( COUNT( PRODUCT ))

Revisiting Rank Transformation

Q48. What is a Rank Transform?

Ans. Rank is an Active Connected Informatica transformation used to select a set of


top or bottom values of data.

Q49. How does a Rank Transform differ from Aggregator Transform functions MAX and
MIN?

Ans. Like the Aggregator transformation, the Rank transformation lets us group
information. The Rank Transform
allows us to select a group of top or bottom values, not just one value as in case
of Aggregator MAX, MIN functions.

Q50. What is a RANK port and RANKINDEX?

Ans. Rank port is an input/output port use to specify the column for which we want
to rank the source values. By
default Informatica creates an output port RANKINDEX for each Rank transformation.
It stores the ranking position
for each row in a group.

Q51. How can you get ranks based on different groups?

Ans. Rank transformation lets us group information. We can configure one of its
input/output ports as a group by
port. For each unique value in the group port, the transformation creates a group
of rows falling within the rank
definition (top or bottom, and a particular number in each rank).

Q52. What happens if two rank values match?

Ans. If two rank values match, they receive the same value in the rank index and
the transformation skips the next
value.

Q53. What are the restrictions of Rank Transformation?

Ans.

. We can connect ports from only one transformation to the Rank transformation.
. We can select the top or bottom rank.
. We need to select the Number of records in each rank.
. We can designate only one Rank port in a Rank transformation.

Q54. How does a Rank Cache works?

Ans. During a session, the Integration Service compares an input row with rows in
the data cache. If the input row
out-ranks a cached row, the Integration Service replaces the cached row with the
input row. If we configure the Rank
transformation to rank based on different groups, the Integration Service ranks
incrementally for each group it finds.
The Integration Service creates an index cache to stores the group information and
data cache for the row data.

Q55. How does Rank transformation handle string values?

Ans. Rank transformation can return the strings at the top or the bottom of a
session sort order. When the Integration
Service runs in Unicode mode, it sorts character data in the session using the
selected sort order associated with the
Code Page of IS which may be French, German, etc. When the Integration Service runs
in ASCII mode, it ignores this
setting and uses a binary sort order to sort character data.

Revisiting Sorter Transformation

Q56. What is a Sorter Transformation?

Ans. Sorter Transformation is an Active, Connected Informatica transformation used


to sort data in ascending or
descending order according to specified sort keys. The Sorter transformation
contains only input/output ports.

Q57. Why is Sorter an Active Transformation?

Ans. When the Sorter transformation is configured to treat output rows as distinct,
it assigns all ports as part of the
sort key. The Integration Service discards duplicate rows compared during the sort
operation. The number of Input
Rows will vary as compared with the Output rows and hence it is an Active
transformation.

Q58. How does Sorter handle Case Sensitive sorting?

Ans. The Case Sensitive property determines whether the Integration Service
considers case when sorting data.
When we enable the Case Sensitive property, the Integration Service sorts uppercase
characters higher than
lowercase characters.

Q59. How does Sorter handle NULL values?


Ans. We can configure the way the Sorter transformation treats null values. Enable
the property Null Treated Low if
we want to treat null values as lower than any other value when it performs the
sort operation. Disable this option if
we want the Integration Service to treat null values as higher than any other
value.

Q60. How does a Sorter Cache works?

Ans. The Integration Service passes all incoming data into the Sorter Cache before
Sorter transformation performs the
sort operation.

The Integration Service uses the Sorter Cache Size property to determine the
maximum amount of memory it can
allocate to perform the sort operation. If it cannot allocate enough memory, the
Integration Service fails the session.
For best performance, configure Sorter cache size with a value less than or equal
to the amount of available physical
RAM on the Integration Service machine.

If the amount of incoming data is greater than the amount of Sorter cache size, the
Integration Service temporarily
stores data in the Sorter transformation work directory. The Integration Service
requires disk space of at least twice
the amount of incoming data when storing data in the work directory.

Revisiting Union Transformation

Q61. What is a Union Transformation?

Ans. The Union transformation is an Active, Connected non-blocking multiple input


group transformation use to
merge data from multiple pipelines or sources into one pipeline branch. Similar to
the UNION ALL SQL statement,
the Union transformation does not remove duplicate rows.

Q62. What are the restrictions of Union Transformation?

Ans.

. All input groups and the output group must have matching ports. The precision,
datatype, and scale must be
identical across all groups.
. We can create multiple input groups, but only one default output group.
. The Union transformation does not remove duplicate rows.
. We cannot use a Sequence Generator or Update Strategy transformation upstream
from a Union
transformation.
. The Union transformation does not generate transactions.
General questions

Q63. What is the difference between Static and Dynamic Lookup Cache?

Ans. We can configure a Lookup transformation to cache the corresponding lookup


table. In case of static or read-
only lookup cache the Integration Service caches the lookup table at the beginning
of the session and does not update
the lookup cache while it processes the Lookup transformation.

In case of dynamic lookup cache the Integration Service dynamically inserts or


updates data in the lookup cache and
passes the data to the target. The dynamic cache is synchronized with the target.

Q64. What is Persistent Lookup Cache?

Ans. Lookups are cached by default in Informatica. Lookup cache can be either non-
persistent or persistent. The
Integration Service saves or deletes lookup cache files after a successful session
run based on whether the Lookup
cache is checked as persistent or not.

Q65. What is the difference between Reusable transformation and Mapplet?

Ans. Any Informatica Transformation created in the in the Transformation Developer


or a non-reusable promoted to
reusable transformation from the mapping designer which can be used in multiple
mappings is known as Reusable
Transformation. When we add a reusable transformation to a mapping, we actually add
an instance of the
transformation. Since the instance of a reusable transformation is a pointer to
that transformation, when we change
the transformation in the Transformation Developer, its instances reflect these
changes.

A Mapplet is a reusable object created in the Mapplet Designer which contains a set
of transformations and lets us
reuse the transformation logic in multiple mappings. A Mapplet can contain as many
transformations as we need.
Like a reusable transformation when we use a mapplet in a mapping, we use an
instance of the mapplet and any
change made to the mapplet is inherited by all instances of the mapplet.

Q66. What are the transformations that are not supported in Mapplet?

Ans. Normalizer, Cobol sources, XML sources, XML Source Qualifier transformations,
Target definitions, Pre- and
post- session Stored Procedures, Other Mapplets.

Q67. What are the ERROR tables present in Informatica?


Ans.

. PMERR_DATA- Stores data and metadata about a transformation row error and its
corresponding source
row.
. PMERR_MSG- Stores metadata about an error and the error message.
. PMERR_SESS- Stores metadata about the session.
. PMERR_TRANS- Stores metadata about the source and transformation ports, such as
name and datatype,
when a transformation error occurs.

Q68. What is the difference between STOP and ABORT?

Ans. When we issue the STOP command on the executing session task, the Integration
Service stops reading data
from source. It continues processing, writing and committing the data to targets.
If the Integration Service cannot
finish processing and committing data, we can issue the abort command.

In contrast ABORT command has a timeout period of 60 seconds. If the Integration


Service cannot finish processing
and committing data within the timeout period, it kills the DTM process and
terminates the session.

Q69. Can we copy a session to new folder or new repository?

Ans. Yes we can copy session to new folder or repository provided the corresponding
Mapping is already in there.

Q70. What type of join does Lookup support?

Ans. Lookup is just similar like SQL LEFT OUTER JOIN.

(Page 3 of 3)

What is a fact-less-fact?

A fact table that does not contain any measure is called a fact-less fact. This
table will only contain keys from different
dimension tables. This is often used to resolve a many-to-many cardinality issue.

Explanatory Note:
Consider a school, where a single student may be taught by many teachers and a
single teacher may have many
students. To model this situation in dimensional model, one might introduce a fact-
less-fact table joining teacher and
student keys. Such a fact table will then be able to answer queries like,

1. Who are the students taught by a specific teacher.


2. Which teacher teaches maximum students.
3. Which student has highest number of teachers.etc. etc.

What is a coverage fact?

A fact-less-fact table can only answer 'optimistic' queries (positive query) but
can not answer a negative query. Again
consider the illustration in the above example. A fact-less fact containing the
keys of tutors and students can not
answer a query like below,

1. Which teacher did not teach any student?


2. Which student was not taught by any teacher?

Why not? Because fact-less fact table only stores the positive scenarios (like
student being taught by a tutor) but if
there is a student who is not being taught by a teacher, then that student's key
does not appear in this table, thereby
reducing the coverage of the table.

Coverage fact table attempts to answer this - often by adding an extra flag column.
Flag = 0 indicates a negative
condition and flag = 1 indicates a positive condition. To understand this better,
let's consider a class where there are
100 students and 5 teachers. So coverage fact table will ideally store 100 X 5 =
500 records (all combinations) and if a
certain teacher is not teaching a certain student, the corresponding flag for that
record will be 0.

What are incident and snapshot facts

A fact table stores some kind of measurements. Usually these measurements are
stored (or captured) against a
specific time and these measurements vary with respect to time. Now it might so
happen that the business might not
able to capture all of its measures always for every point in time. Then those
unavailable measurements can be kept
empty (Null) or can be filled up with the last available measurements. The first
case is the example of incident fact
and the second one is the example of snapshot fact.

What is aggregation and what is the benefit of aggregation?


A data warehouse usually captures data with same degree of details as available in
source. The "degree of detail" is
termed as granularity. But all reporting requirements from that data warehouse do
not need the same degree of
details.

To understand this, let's consider an example from retail business. A certain


retail chain has 500 shops accross
Europe. All the shops record detail level transactions regarding the products they
sale and those data are captured in
a data warehouse.

Each shop manager can access the data warehouse and they can see which products are
sold by whom and in what
quantity on any given date. Thus the data warehouse helps the shop managers with
the detail level data that can be
used for inventory management, trend prediction etc.

Now think about the CEO of that retail chain. He does not really care about which
certain sales girl in London sold
the highest number of chopsticks or which shop is the best seller of 'brown
breads'. All he is interested is, perhaps to
check the percentage increase of his revenue margin accross Europe. Or may be year
to year sales growth on eastern
Europe. Such data is aggregated in nature. Because Sales of goods in East Europe is
derived by summing up the
individual sales data from each shop in East Europe.

Therefore, to support different levels of data warehouse users, data aggregation is


needed.

What is slicing-dicing?

Slicing means showing the slice of a data, given a certain set of dimension (e.g.
Product) and value (e.g. Brown Bread)
and measures (e.g. sales).

Dicing means viewing the slice with respect to different dimensions and in
different level of aggregations.

Slicing and dicing operations are part of pivoting.

What is drill-through?

Drill through is the process of going to the detail level data from summary data.

Consider the above example on retail shops. If the CEO finds out that sales in East
Europe has declined this year
compared to last year, he then might want to know the root cause of the decrease.
For this, he may start drilling
through his report to more detail level and eventually find out that even though
individual shop sales has actually
increased, the overall sales figure has decreased because a certain shop in Turkey
has stopped operating the business.
The detail level of data, which CEO was not much interested on earlier, has this
time helped him to pin point the root
cause of declined sales. And the method he has followed to obtain the details from
the aggregated data is called drill
through.

This article attempts to refresh your Unix skills in the form of a question/answer
based Unix tutorial on Unix
command lines. The commands discussed here are particulary useful for the
developers working in the middle-tier
(e.g. ETL) systems, where they may need to interact with several *nx source systems
for data retrieval.

How to print/display the first line of a file?


There are many ways to do this. However the easiest way to display the first line
of a file is using the [head]
command.

$> head -1 file.txt

No prize in guessing that if you specify [head -2] then it would print first 2
records of the file.

Another way can be by using [sed] command. [Sed] is a very powerful text editor
which can be used for various text
manipulation purposes like this.

$> sed '2,$ d' file.txt

How does the above command work? The 'd' parameter basically tells [sed] to delete
all the records from display
from line 2 to last line of the file (last line is represented by $ symbol). Of
course it does not actually delete those lines
from the file, it just does not display those lines in standard output screen. So
you only see the remaining line which
is the 1st line.

How to print/display the last line of a file?

The easiest way is to use the [tail] command.

$> tail -1 file.txt

If you want to do it using [sed] command, here is what you should write:

$> sed -n '$ p' test

From our previous answer, we already know that '$' stands for the last line of the
file. So '$ p' basically prints (p for
print) the last line in standard output screen. '-n' switch takes [sed] to silent
mode so that [sed] does not print
anything else in the output.

How to display n-th line of a file?

The easiest way to do it will be by using [sed] I guess. Based on what we already
know about [sed] from our previous
examples, we can quickly deduce this command:

$> sed �n '<n> p' file.txt

You need to replace <n> with the actual line number. So if you want to print the
4th line, the command will be
$> sed �n '4 p' test

Of course you can do it by using [head] and [tail] command as well like below:

$> head -<n> file.txt | tail -1

You need to replace <n> with the actual line number. So if you want to print the
4th line, the command will be

$> head -4 file.txt | tail -1

How to remove the first line / header from a file?

We already know how [sed] can be used to delete a certain line from the output � by
using the'd' switch. So if we
want to delete the first line the command should be:

$> sed '1 d' file.txt

But the issue with the above command is, it just prints out all the lines except
the first line of the file on the standard
output. It does not really change the file in-place. So if you want to delete the
first line from the file itself, you have
two options.

Either you can redirect the output of the file to some other file and then rename
it back to original file like below:

$> sed '1 d' file.txt > new_file.txt

$> mv new_file.txt file.txt

Or, you can use an inbuilt [sed] switch '�i' which changes the file in-place. See
below:

$> sed �i '1 d' file.txt

How to remove the last line/ trailer from a file in Unix script?

Always remember that [sed] switch '$' refers to the last line. So using this
knowledge we can deduce the below
command:

$> sed �i '$ d' file.txt

How to remove certain lines from a file in Unix?

If you want to remove line <m> to line <n> from a given file, you can accomplish
the task in the similar method
shown above. Here is an example:
$> sed �i '5,7 d' file.txt

The above command will delete line 5 to line 7 from the file file.txt

How to remove the last n-th line from a file?

This is bit tricky. Suppose your file contains 100 lines and you want to remove the
last 5 lines. Now if you know how
many lines are there in the file, then you can simply use the above shown method
and can remove all the lines from
96 to 100 like below:

$> sed �i '96,100 d' file.txt # alternative to command [head -95 file.txt]

But not always you will know the number of lines present in the file (the file may
be generated dynamically, etc.) In
that case there are many different ways to solve the problem. There are some ways
which are quite complex and
fancy. But let's first do it in a way that we can understand easily and remember
easily. Here is how it goes:

$> tt=`wc -l file.txt | cut -f1 -d' '`;sed �i "`expr $tt - 4`,$tt d" test

As you can see there are two commands. The first one (before the semi-colon)
calculates the total number of lines
present in the file and stores it in a variable called �tt�. The second command
(after the semi-colon), uses the variable
and works in the exact way as shows in the previous example.

How to check the length of any line in a file?

We already know how to print one line from a file which is this:

$> sed �n '<n> p' file.txt

Where <n> is to be replaced by the actual line number that you want to print. Now
once you know it, it is easy to
print out the length of this line by using [wc] command with '-c' switch.

$> sed �n '35 p' file.txt | wc �c

The above command will print the length of 35th line in the file.txt.

How to get the nth word of a line in Unix?

Assuming the words in the line are separated by space, we can use the [cut]
command. [cut] is a very powerful and
useful command and it's real easy. All you have to do to get the n-th word from the
line is issue the following
command:

cut �f<n> -d' '


'-d' switch tells [cut] about what is the delimiter (or separator) in the file,
which is space ' ' in this case. If the separator
was comma, we could have written -d',' then. So, suppose I want find the 4th word
from the below string: �A quick
brown fox jumped over the lazy cat�, we will do something like this:

$> echo �A quick brown fox jumped over the lazy cat� | cut �f4 �d' '

And it will print �fox�

How to reverse a string in unix?

Pretty easy. Use the [rev] command.

$> echo "unix" | rev

xinu

How to get the last word from a line in Unix file?

We will make use of two commands that we learnt above to solve this. The commands
are [rev] and [cut]. Here we
go.

Let's imagine the line is: �C for Cat�. We need �Cat�. First we reverse the line.
We get �taC rof C�. Then we cut the
first word, we get 'taC'. And then we reverse it again.

$>echo "C for Cat" | rev | cut -f1 -d' ' | rev

Cat

How to get the n-th field from a Unix command output?

We know we can do it by [cut]. Like below command extracts the first field from the
output of [wc �c] command

$>wc -c file.txt | cut -d' ' -f1

109

But I want to introduce one more command to do this here. That is by using [awk]
command. [awk] is a very
powerful command for text pattern scanning and processing. Here we will see how may
we use of [awk] to extract
the first field (or first column) from the output of another command. Like above
suppose I want to print the first
column of the [wc �c] output. Here is how it goes like this:

$>wc -c file.txt | awk ' ''{print $1}'

109
The basic syntax of [awk] is like this:

awk 'pattern space''{action space}'

The pattern space can be left blank or omitted, like below:

$>wc -c file.txt | awk '{print $1}'

109

In the action space, we have asked [awk] to take the action of printing the first
column ($1). More on [awk] later.

How to replace the n-th line in a file with a new line in Unix?

This can be done in two steps. The first step is to remove the n-th line. And the
second step is to insert a new line in
n-th line position. Here we go.

Step 1: remove the n-th line

$>sed -i'' '10 d' file.txt # d stands for delete

Step 2: insert a new line at n-th line position

$>sed -i'' '10 i This is the new line' file.txt # i stands for insert

How to show the non-printable characters in a file?

Open the file in VI editor. Go to VI command mode by pressing [Escape] and then
[:]. Then type [set list]. This will
show you all the non-printable characters, e.g. Ctrl-M characters (^M) etc., in the
file.

How to zip a file in Linux?

Use inbuilt [zip] command in Linux

How to unzip a file in Linux?

Use inbuilt [unzip] command in Linux.

$> unzip �j file.zip

How to test if a zip file is corrupted in Linux?

Use �-t� switch with the inbuilt [unzip] command


$> unzip �t file.zip

How to check if a file is zipped in Unix?

In order to know the file type of a particular file use the [file] command like
below:

$> file file.txt

file.txt: ASCII text

If you want to know the technical MIME type of the file, use �-i� switch.

$>file -i file.txt

file.txt: text/plain; charset=us-ascii

If the file is zipped, following will be the result

$> file �i file.zip

file.zip: application/x-zip

How to connect to Oracle database from within shell script?

You will be using the same [sqlplus] command to connect to database that you use
normally even outside the shell
script. To understand this, let's take an example. In this example, we will connect
to database, fire a query and get the
output printed from the unix shell. Ok? Here we go �

$>res=`sqlplus -s username/password@database_name <<EOF

SET HEAD OFF;

select count(*) from dual;

EXIT;

EOF`

$> echo $res

If you connect to database in this method, the advantage is, you will be able to
pass Unix side shell
variables value to the database. See below example

$>res=`sqlplus -s username/password@database_name <<EOF

SET HEAD OFF;

select count(*) from student_table t where t.last_name=$1;


EXIT;

EOF`

$> echo $res

12

How to execute a database stored procedure from Shell script?

$> SqlReturnMsg=`sqlplus -s username/password@database<<EOF

BEGIN

Proc_Your_Procedure(� your-input-parameters �);

END;

EXIT;

EOF`

$> echo $SqlReturnMsg

How to check the command line arguments in a UNIX command in Shell Script?

In a bash shell, you can access the command line arguments using $0, $1, $2, �
variables, where $0 prints the
command name, $1 prints the first input parameter of the command, $2 the second
input parameter of the command
and so on.

How to fail a shell script programmatically?

Just put an [exit] command in the shell script with return value other than 0. this
is because the exit codes of
successful Unix programs is zero. So, suppose if you write

exit -1

inside your program, then your program will thrown an error and exit immediately.

How to list down file/folder lists alphabetically?

Normally [ls �lt] command lists down file/folder list sorted by modified time. If
you want to list then alphabetically,
then you should simply specify: [ls �l]

How to check if the last command was successful in Unix?


To check the status of last executed command in UNIX, you can check the value of an
inbuilt bash variable [$?]. See
the below example:

$> echo $?

How to check if a file is present in a particular directory in Unix?

Using command, we can do it in many ways. Based on what we have learnt so far, we
can make use of [ls] and [$?]
command to do this. See below:

$> ls �l file.txt; echo $?

If the file exists, the [ls] command will be successful. Hence [echo $?] will print
0. If the file does not exist, then [ls]
command will fail and hence [echo $?] will print 1.

How to check all the running processes in Unix?

The standard command to see this is [ps]. But [ps] only shows you the snapshot of
the processes at that instance. If
you need to monitor the processes for a certain period of time and need to refresh
the results in each interval,
consider using the [top] command.

$> ps �ef

If you wish to see the % of memory usage and CPU usage, then consider the below
switches

$> ps aux

If you wish to use this command inside some shell script, or if you want to
customize the output of [ps] command,
you may use �-o� switch like below. By using �-o� switch, you can specify the
columns that you want [ps] to print
out.

$>ps -e -o stime,user,pid,args,%mem,%cpu

How to tell if my process is running in Unix?

You can list down all the running processes using [ps] command. Then you can �grep�
your user name or process
name to see if the process is running. See below:

$>ps -e -o stime,user,pid,args,%mem,%cpu | grep "opera"

14:53 opera 29904 sleep 60 0.0 0.0

14:54 opera 31536 ps -e -o stime,user,pid,arg 0.0 0.0

14:54 opera 31538 grep opera 0.0 0.0


Print
Email
How to get the CPU and Memory details in Linux server?

In Linux based systems, you can easily access the CPU and memory details from
the /proc/cpuinfo and
/proc/meminfo, like this:

$>cat /proc/meminfo

$>cat /proc/cpuinfo

Just try the above commands in your system to see how it works

What is a database? A question for both pro and newbie

.
.

Published on Wednesday, 28 April 2010 16:46

Written by Akash Mitra

inShare0

Remember Codd's Rule? Or Acid Property of database? May be you still hold these
basic properties to your heart or
may be you no longer remember them. Let's revisit these ideas once again..

A database is a collection of data for one or more multiple uses. Databases are
usually integrated and offers both data
storing and retrieval.

Codd's Rule

Codd's 12 rules are a set of thirteen rules (numbered zero to twelve) proposed by
Edgar F. Codd, a pioneer of the
relational model for databases.

Rule 0: The system must qualify as relational, as a database, and as a management


system.
For a system to qualify as a relational database management system (RDBMS), that
system must use its relational
facilities (exclusively) to manage the database.

Rule 1: The information rule:

All information in the database is to be represented in one and only one way,
namely by values in column positions
within rows of tables.

Rule 2: The guaranteed access rule:

All data must be accessible. This rule is essentially a restatement of the


fundamental requirement for primary keys. It
says that every individual scalar value in the database must be logically
addressable by specifying the name of the
containing table, the name of the containing column and the primary key value of
the containing row.

Rule 3: Systematic treatment of null values:

The DBMS must allow each field to remain null (or empty). Specifically, it must
support a representation of "missing
information and inapplicable information" that is systematic, distinct from all
regular values (for example, "distinct
from zero or any other number", in the case of numeric values), and independent of
data type. It is also implied that
such representations must be manipulated by the DBMS in a systematic way.

Rule 4: Active online catalog based on the relational model:

The system must support an online, inline, relational catalog that is accessible to
authorized users by means of their
regular query language. That is, users must be able to access the database's
structure (catalog) using the same query
language that they use to access the database's data.

Rule 5: The comprehensive data sublanguage rule:

The system must support at least one relational language that

. Has a linear syntax


. Can be used both interactively and within application programs,
. Supports data definition operations (including view definitions), data
manipulation operations (update as
well as retrieval), security and integrity constraints, and transaction management
operations (begin, commit,
and rollback).
Rule 6: The view updating rule:

All views that are theoretically updatable must be updatable by the system.

Rule 7: High-level insert, update, and delete:

The system must support set-at-a-time insert, update, and delete operators. This
means that data can be retrieved
from a relational database in sets constructed of data from multiple rows and/or
multiple tables. This rule states that
insert, update, and delete operations should be supported for any retrievable set
rather than just for a single row in a
single table.

Rule 8: Physical data independence:

Changes to the physical level (how the data is stored, whether in arrays or linked
lists etc.) must not require a change
to an application based on the structure.

Rule 9: Logical data independence:

Changes to the logical level (tables, columns, rows, and so on) must not require a
change to an application based on
the structure. Logical data independence is more difficult to achieve than physical
data independence.

Rule 10: Integrity independence:

Integrity constraints must be specified separately from application programs and


stored in the catalog. It must be
possible to change such constraints as and when appropriate without unnecessarily
affecting existing applications.

Rule 11: Distribution independence:

The distribution of portions of the database to various locations should be


invisible to users of the database. Existing
applications should continue to operate successfully :

. when a distributed version of the DBMS is first introduced; and


. when existing distributed data are redistributed around the system.

Rule 12: The nonsubversion rule:


If the system provides a low-level (record-at-a-time) interface, then that
interface cannot be used to subvert the
system, for example, bypassing a relational security or integrity constraint.

Database ACID Property

ACID(atomicity, consistency, isolation, durability) is a set of properties that


guarantee that database transactions are
processed reliably.

Atomicity: Atomicity requires that database modifications must follow an all or


nothing rule. Each transaction is said
to be atomic if when one part of the transaction fails, the entire transaction
fails and database state is left unchanged

Consistency: The consistency property ensures that the database remains in a


consistent state; more precisely, it says
that any transaction will take the database from one consistent state to another
consistent state. The consistency rule
applies only to integrity rules that are within its scope. Thus, if a DBMS allows
fields of a record to act as references
to another record, then consistency implies the DBMS must enforce referential
integrity: by the time any transaction
ends, each and every reference in the database must be valid.

Isolation: Isolation refers to the requirement that other operations cannot access
or see data that has been modified
during a transaction that has not yet completed. Each transaction must remain
unaware of other concurrently
executing transactions, except that one transaction may be forced to wait for the
completion of another transaction
that has modified data that the waiting transaction requires.

Durability: Durability is the DBMS's guarantee that once the user has been notified
of a transaction's success, the
transaction will not be lost. The transaction's data changes will survive system
failure, and that all integrity
constraints have been satisfied, so the DBMS won't need to reverse the transaction.
Many DBMSs implement
durability by writing transactions into a transaction log that can be reprocessed
to recreate the system state right
before any later failure.

Why people Hate Project Managers � A must read for would-be managers
"Project Managers" are inevitable. Love them or hate them, but if you are in a
project, you have to accept them. They
are Omnipresent in any project. They intervene too much on technical things without
much knowledge. They create
unrealistic targets and nonsensical methods of achieving them. And they invariably
fail to acknowledge the
individual hard work. Are they of any use?

In a recent online survey by amplicate.com, 51% of the participants expressed hate


for project managers and project
management. Look around yourself in your office, the scenario is probably the same.
So what are the reasons that
make people hate their project managers? DWBIConcepts delved deeper into this
question and found out top 5
reasons about why project managers are hated.

Remember, all project managers are not hated! So, following reasons off course
don�t apply to them.

1. Project managers are lazy

Generally project managers are not answerable to their subordinates. They are self-
paced and semi autocratic. These
allowances provide them the opportunity to spend time lazily. Many project managers
spend more time surfing
internet than evaluating the performances of his/her subordinates.

The cure for their laziness is pro-activeness which can help them spend quality
time in office.

2. Project Managers snatch other people�s credit

I know of a project manager �Harry� (name changed), who used to receive work from
client and assign the work to
his subordinate �John� and once John finished the work and sent Harry an email,
Harry used to copy the contents of
John�s mail and reply back to the client. Since Harry never �forwarded� John�s mail
directly to client � so client was
always oblivion to the actual person (John) doing their work. Client always used to
send appreciation mail to Harry
only and John was never accredited for the work he did.

The advice for the would-be project managers here is to remain conscious about the
individual contributions and
give them their due credit whenever possible.

3. Project managers are reluctant to listen to new idea

There is no one-size-fit-all solution when it comes to project management. Just


because a specific idea worked in your
earlier project, doesn�t mean that will work in your next project also. Everybody
is good at something or other.
Everybody has some idea. Not all of them are good. But some of them are. So be
flexible and open to new ideas.
Listen carefully what others have to say and if you have to discard them, give
proper reasons.

4. Project Managers fail to do realistic planning

Proper planning makes thing easy. What do you think is the main difference between
a NASA space project and a
service industry IT project? The project members in that NASA project are the same
kind of engineers that you have
in your project. May be many of them passed from the same graduate school. The same
set of people who made one
project a marvellous success, fail miserably in some other project. There is
nothing wrong with those people. But
there is something wrong with the leader leading that set of people. A NASA project
succeeds because of a
meticulous and realistic planning whereas the other project slogs.

Create a detail plan and follow it closely.

5. Project Managers don't know the technology well

Don�t let new tools and technologies outsmart you. Technology space is ever
changing. Try to keep pace with that.

Install the software and tools that are being used in your project in your laptop.
Play with them. Know what their
features are and what their limitations are. Read blogs on them. Start your own
blog and write something interesting
in that in a regular basis. Be a savvy. Otherwise you will be fooled by your own
people.

A road-map on Testing in Data Warehouse

Testing in data warehouse projects are till date a less explored area. However, if
not done properly, this can be a
major reason for data warehousing project failures - especially in user acceptance
phase. Given here a mind-map that
will help a project manager to think all the aspects of testing in data
warehousing.

Testing Mindmap
DWBI Testing

Points to consider for DWBI Testing

1. Why is it important?
. To bug-free the code
. To ensure data quality
. To increase credibility of BI Reports
. More BI projects fail after commissioning due to quality issue
2. What constitutes DWBI Testing?
. Performance Testing
. Functional Testing
. Canned Report Testing
. Ad-hoc testing
. Load Reconciliation

3. What can be done to ease it?


. Plan for testing
. Start building DWBI Test competency
. Design code that generates debug information
. Build reconciliation mechanism

4. Why is it difficult?
. Limited Testing Tool
. Automated Testing not always possible
. Data traceability not always available
. Requires extensive functional knowledge
. Metadata management tool often fails
. Deals with bulk data - has performance impact
. Number of data conditions are huge

Use the above mind-map to plan and prepare the testing activity for your data
warehousing project.

Enterprise Data Warehouse Data Reconciliation Methodology

An enterprise data warehouse often fetches records from several disparate systems
and store them centrally in an
enterprise-wide warehouse. But what is the guarantee that the quality of data will
not degrade in the process of
centralization?

Data Reconciliation

Many of the data warehouses are built on n-tier architecture with multiple data
extraction and data insertion jobs
between two consecutive tiers. As it happens, the nature of the data changes as it
passes from one tier to the next tier.
Data reconciliation is the method of reconciling or tie-up the data between any two
consecutive tiers (layers).
Why Reconciliation is required?

In the process of extracting data from one source and then transforming the data
and loading it to the next layer, the
whole nature of the data can change considerably. It might also happen that some
information is lost while
transforming the data. A reconciliation process helps to identify such loss of
information.

One of the major reasons of information loss is loading failures or errors during
loading. Such errors can occur due to
several reasons e.g.

. Inconsistent or non coherent data from source


. Non-integrating data among different sources
. Unclean/ non-profiled data
. Un-handled exceptions
. Constraint violations
. Logical issues/ Inherent flaws in program
. Technical failures like loss of connectivity, loss over network, space issue etc.

Failure due to any such issue can result into potential information loss leading to
unreliable data quality for business
process decision making.

Further more, if such issues are not rectified at the earliest, this becomes even
more costly to �patch� later. Therefore
this is highly suggested that a proper data reconciliation process must be in place
in any data Extraction-
Transformation-Load (ETL) process.

Scope of Data Reconciliation

Data reconciliation is often confused with the process of data quality testing.
Even worse, sometimes data
reconciliation process is used to investigate and pin point the data issues.

While data reconciliation may be a part of data quality assurance, these two things
are not necessarily same.

Scope of data reconciliation should be limited to identify, if at all, there is any


issue in the data or not. The scope
should not be extended to automate the process of data investigation and pin
pointing the issues.

A successful reconciliation process should only indicate whether or not the data is
correct. It will not indicate why the
data is not correct. Reconciliation process answers �what� part of the question,
not �why� part of the question.
Methods of Data Reconciliation

Master Data Reconciliation

Master data reconciliation is the method of reconciling only the master data
between source and target. Master data
are generally unchanging or slowly changing in nature and no aggregation operation
is done on the dataset. That is -
the granularity of the data remains same in both source and target. That is why
master data reconciliation is often
relatively easy and quicker to implement.

In one business process, �customer�, �products�, �employee� etc. are some good
example of master data. Ensuring
the total number of customer in the source systems match exactly with the total
number of customers in the target
system is an example of customer master data reconciliation.

Some of the common examples of master data reconciliation can be the following
measures,

1. Total count of rows, example


. Total Customer in source and target
. Total number of Products in source and target etc.

2. Total count of rows based on a condition, example


. Total number of active customers
. Total number of inactive customers etc.

Transactional Data Reconciliation

Sales quantity, revenue, tax amount, service usage etc. are examples of
transactional data. Transactional data make
the very base of BI reports so any mismatch in transactional data can cause direct
impact on the reliability of the
report and the whole BI system in general. That is why reconciliation mechanism
must be in-place in order to detect
such a discrepancy before hand (meaning, before the data reach to the final
business users)

Transactional data reconciliation is always done in terms of total sum. This


prevents any mismatch otherwise caused
due to varying granularity of qualifying dimensions. Also this total sum can be
done on either full data or only on
incremental data set.

Some examples measures used for transactional data reconciliation can be

1. Sum of total revenue calculated from source and target


2. Sum of total product sold calculated from source and target etc.
Data Warehouse design phase
Automated Data Reconciliation

For large warehouse systems, it is often convenient to automate the data


reconciliation process by making this an
integral part of data loading. This can be done by maintaining separate loading
metadata tables and populating those
tables with reconciliation queries. The existing reporting architecture of the
warehouse can be then used to generate
and publish reconciliation reports at the end of the loading. Such automated
reconciliation will keep all the stake
holders informed about the trustworthiness of the reports.

Top 10 things you must know before designing a data warehouse

This paper outlines some of the most important (and equally neglected) things that
one must consider before and
during the design phase of a data warehouse. In our experience, we have seen data
warehouse designers often miss
out on these items merely because they thought them to be too trivial to attract
their attentions. Guess what, at the
end of the day such neglects cost them heavily as they cut short the overall ROI of
the data warehouse.

Here we outline some data warehouse gotchas that you should be aware of.

1. ETL solution takes more time to design than analytical solutions

In a top-down design approach people often start to visualize the end data and
realize the complexity
associated with data analytics first. As they tend to see more details of it, they
tend to devote more time for
designing of the analytical or reporting solutions and less time for the designing
of the background ETL staffs
that deal with data extraction / cleaning / transformation etc. They often live
under the assumption that it
would be comparatively easy to map the source data from the existing systems since
users already have
better understanding on the source systems. Moreover, the need and complexity of
cleansing / profiling of
the source data would be less since the data is already coming from standard source
systems.

Needless to say, these assumptions often turn void when it comes to


actually coding the ETL layer to feed the data warehouse. Almost
always, mapping, cleaning and preparing data turns out significantly
more time consuming compared to design of Reporting / Analytics layer.

From budgeting and costing standpoints also, an architect prefers to choose the
case of data reporting and
analytics over background ETL as the former can be more easily presented to the
senior management over
the later in order to get them sanction the budget. This leads to disproportionate
budget between background
ETL and frontend Reporting tasks.

2. Data Warehouse scope will increase along the development

Users often do not know what they want from the data until they start to see the
data. As and when
development progress and more and more data visualization becomes possible, users
start wishing even
more out of their data. This phenomenon is unavoidable and designers must allocate
extra time to
accomodate such ad-hoc requirements.

Many requirements that were implicit in the beginning becomes explicit and
indispensable in the later phase
of the project. Since you can not avoid it, make sure that you already have
adequate time allocated in your
project plan before hand.

3. Issues will be discovered in the source system that went undetected till date

The power of an integrated data warehouse becomes apparent when you start
discovering discrepancies and
issues in the existing stable (so-called) source systems. The real problem,
however, is - designers often make
the wrong assumption that the source systems or upstream systems are fault free.
And that is why they do
not allocate any time or resource in their project plan to deal with those issues.

Data warehouse developers do discover issues in the source systems. And those
issues take lot of time to get
fixed. More than often those issues are not even fixed in the source (to minimize
the impact on business) and
some work around is suggested to deal with those issues in the data warehouse level
directly (although that
is not generally a good idea). Source system issues confuse everybody and require
more administrative time
(that technical time) to resolve as DW developers need to identify and make their
case to prove it to the
source systems that the issue(s) does exist. These are huge time wasters and often
not incorporated in the
project plan.

4. You will need to validate data not being validated in source systems
Source systems do not always give you the correct data. A lot of validations and
checks are not done in the
source system level (e.g. OLTP systems) and each time a validation check is
skipped, it creates danger of
sending unexpected data to the data warehouse level. Therefore before you can
actually process data in data
warehouse, you will require to perform some validation checks at your end to ensure
the expected data
availability.

This is again unavoidable. If you do not make those checks that would cause issues
at your side which
include things like, data loading error, reconciliation failure even data integrity
threats. Hence ensure that
proper time and resource allocation are there to work on these items.

5. User training will not be sufficient and users will not put their training to
use

You would face a natural resistance from the existing business users who would show
huge inertia against
the acceptance to the new system. In order to ease the things, adequate user
training sessions are generally
arranged for the users of the data warehouse. But you will notice that "adequate"
training is not "sufficient"
for them (mainly due to they need to unlearn a lot of things to learn the use of
the new data warehouse).

Even if you arrange adequate training to the users, you would find that the users
are not really putting their
training to use when it comes to doing things in the new data warehouse. That's
often because facts and
figures from the new data warehouse often challenge their existing convictions and
they are reluctant to
accept it whole heartedly.

User training and acceptance is probably the single most important non-technical
challenge that makes or
breaks a data warehouse. No matter what amount of effort you put as a designer to
design the data
warehouse - if the users are not using it - the data warehouse is as good as
failure. As the old saying goes in
Sanskrit � �a tree is known by the name of its fruit�, the success of data
warehouse is measured from the
information it produces. If the information is not relevant to the users and if
they are reluctant to use it - you
lost the purpose. Hence make all the possible efforts to connect to the users and
train them to use the data
warehouse. Mere 'adequate' training is not 'sufficient' here.

6. Users will create conflicting business rules

That is because the users often belong to different departments of the company and
even though each one of
them knows the business of her department pretty well, she would not know the
business of the other
department that well. And when you take the data from all these departments and try
to combine them
together into an integrated data warehouse, you would often discover that business
rule suggested by one
user is completely opposite to the business rule suggested by the other.

Such cases are generally involved and need collaboration between multiple parties
to come into the
conclusion. It's better to consider such cases way before during the planning phase
to avoid the late surprises.

7. Volumetric mis-judgement is more common than you thought

A very minutely done volumetric estimate in the starting phase of the project would
go weary later. This
happens due to several reasons e.g. slight change in the standard business metrics
may create huge impact on
the volumetric estimates.

For example, suppose a company has 1 million customers who are expected to grow at
a rate of 7% per
annum. While calculating the volume and size of your data warehouse you have used
this measure in several
places. Now if the customer base actually increase by 10% instead of 7%, that would
mean 30000 more
customers. In a fact table of granularity customer, product, day - this would mean
30000 X 10 X 365 more
records (assuming on average one customer use 10 products). If one record takes
1kb, then the fact table
would now require - (30000 X 365 X 10 X 1kb ) /(1024 X 1024) = 100+ GB more disk
space from only one table.

8. It's IT's responsibility to prove the correctness of your data

When user look at one value in your report and says, "I think it's not right" - the
onus is on you to prove the
correctness or validity of that data. Nobody is going to help you around to prove
how right your data
warehouse is. For this reason, it is absolutely necessary to build a solid data
reconciliation framework for
your data warehouse. A reconciliation framework that can trigger an early alarm
whenever something does
not match between source and target, so that you get enough time to investigate
(and if required, fix) the
issue.

Such reconciliation framework however indispensable is not easy to create. Not only
they require huge
amount of effort and expertise, they also tend to run in the same production server
almost same time as that
of production load and eat up lot of performance. Moreover, such reconciliation
framework is often not a
client side requirement - making it even difficult for you to allocate time and
budget. But if you do not do it
that would be a much bigger mistake to make.

9. Data Warehousing project incur high maintenance cost


Apart from development and deployment, maintenance also incur huge cost in data
warehousing. Server
maintenance, software licensing, regular data purging, database maintenance all
these incur costs.

It's important to set the expectation in the very beginning of the project about
the huge maintenance cost
implications.

10. Amount of time needed to refresh your data warehouse is going to be your top
concern

You need to load data in the data warehouse, generally at least daily (although
sometimes more frequently
than this) and also monthly / quarterly / yearly etc. Loading latest data into data
warehouse ensures that your
reports are all up-to-date. However, the time required to load data (refresh time)
is going to be more than
what you have calculated and that's too going to increase day by day.

One of the major hinderances in the acceptance of a data warehouse by its users is
its performance. I have
seen too many cases where reports generated from data warehouse miss SLA and
severely damage the
dependency and credibility of the data warehouse. In fact, I have seen cases where
daily load runs more than
a day and never completes to generate timely daily report. There have been other
famous cases of SLA breach
as well.

I can not tress this enough but performance considerations are hugely important for
the success of a data
warehouse and it's more important than what you thought. Do everything necessary to
make your data
warehouse perform well - reduce overhead, maintain servers, cut-off complexities,
do regular system
performance tests (SPT) and weigh the performance against industry benchmarks, make
SPT a part of user
acceptance test (UAT) etc.

Common Mistakes in Data Modelling

A model is an abstraction of some aspect of a problem. A data model is a model that


describes how data is
represented and accessed, usually for a database. The construction of a data model
is one of the most difficult tasks of
software engineering and is often pivotal to the success or failure of a project.
There are too many factors that determine the success of a data model in terms of
its usability and effectiveness. Not
all of them can be discsed here. Plus people tend to make different types of
mistakes for different types of modelling
patterns. Some modelling patterns are prone to some specific types of issues which
might not be prevalent is other
types of patterns. Nevertheless, I have tried to compile a list of some widespread
mistakes that are commonly found
in data modelling patterns.

Avoid Large Data Models

Well, you may be questioning how much large is large. The answer: it depends. You
must ask yourself if the large
size of the model is really justified. The more complex your model is, the more
prone it is to contain design errors.
For an example, you may want to try to limit your models to not more than 200
tables. To be able to do that, in the
early phase of data modelling ask yourself these questions �

. Is the large size really justified?


. Is there any extraneous content that I can remove?
. Can I shift the representation and make the model more concise?
. How much work is to develop the application for this model and is that
worthwhile?
. Is there any speculative content that I can remove? (Speculative contents are
those which are not immediately
required but still kept in the model as �might be required� in the future.)

If you consciously try to keep things simple, most likely you will also be able to
avoid the menace of over modelling.
Over modelling leads to over engineering which leads to over work without any
defined purpose. A person who
does modelling just for the sake of modelling often ends up doing over modelling.

Watch carefully if you have following signs in your data model?

. Lots of entities with no or very few non-key attributes?


. Lots of modelling objects with names which no business user would recognise?
. You yourself have lot of troubles coming up with the names of the attributes?

All the above are sure signs of over modelling that only increases your burden (of
coding, of loading, of maintaining,
of securing, of using).

Lack of Clarity or Purpose

Purpose of the model determines the level of details that you want to keep in the
model. If you are unsure about the
purpose, you will definitely end up designing a model that is too detail or too
brief for the purpose.
Violation of Normalization
Clarity is also very important. For example - do you clearly know the data types
that you should be using for all the
business attributes? Or do you end up using some speculative data types (and
lengths)?

Modern data modelling tools come with different concepts of declaring data (e.g.
domain and enumeration concept
in ERWin) that helps to bring clarity to the model. So, before you start building �
pause for a moment and ask
yourself if you really understand the purpose of the model.

Reckless violation of Normal Form

(Applicable for operational data models)

When the tables in the model satisfy higher levels of normal forms, they are less
likely to store redundant or
contradictory data. But there is no hard and fast rule about maintaining those
normal forms. A modeller is allowed to
violate these rules for good purpose (such as to increase performance) and such a
relaxation is called
denormalization.

But the problem occurs � when a modeller violates the normal form deliberately
without a clearly defined purpose.
Such reckless violation breaks apart the whole design principle behind the data
model and often renders the model
unusable. So if you are unsure of something � just stick to the rules. Don�t get
driven by vague purposes.

The above figure shows a general hierarchical relationship between customer and its
related categories. Let�s say a
customer can fall under following categories � Consumer, Business, Corporate and
Wholesaler. Given this condition,
�ConsumerFlag� is a redundant column on Customer table.

Traps in Dimensional Modelling


SCD Type2 Modelling Issue
When it comes to dimensional modelling, there are some inexcusable mistakes that
people tends to make. Here are a
few of them �

Snow-flaking between two Type-II slowly changing dimension (SCD) tables

Below is an example of such a modelling.

Theoretically speaking there is no issue with such a model, at least until one
tries to create the ETL programming
(extraction-transformation-loading) code behind these tables.

Consider this � in the above example, suppose something changed in the


�ProductType� table which created a new
row in �ProductType� table (since ProductType is SCD2, any historical change will
be maintained by adding new
row). This new row will have new surrogate key. But in the Product table, any
existing row is still pointing to the old
product type record and hence leading to data anomaly.

Indiscriminate use of Surrogate keys

Surrogate Keys are used as a unique identifier to represent an entity in the


modelled world. Surrogate keys are
required when we cannot use a natural key to uniquely identify a record or when
using a surrogate key is deemed
more suitable as the natural key is not a good fir for primary key (natural key too
long, data type not suitable for
indexing etc.)
But surrogate keys also come with some disadvantages. The values of surrogate keys
have no relationship with the
real world meaning of the data held in a row. Therefore over usage of surrogate
keys (often in the name of
�standardization�) lead to the problem of disassociation and creates unnecessary
ETL burden and performance
degradation.

Even query optimization becomes difficult when one disassociates the surrogate key
with the natural key. The reason
being � since surrogate key takes the place of primary key, unique index is applied
on that column. And any query
based on natural key identifier leads to full table scan as that query cannot take
the advantage of unique index on the
surrogate key.

Before assigning a surrogate key to a table, ask yourself these questions �

. Am I using a surrogate key only for the sake of maintaining standard?


. Is there any unique not null natural key that I can use as primary key instead of
a new surrogate key?
. Can I use my natural key as primary key without degrading the performance?

If the answer of the above questions are �YES� � don�t use the surrogate key.

Data Mining - a simple guide for beginners

This paper introduces the subject of data mining in simple lucid language and moves
on to build more complex
concepts. Start here if you are a beginner.

Data Mining. I have an allergy to this term.

Not because I hate the subject of data mining itself, but because this term is so
much over-used and misused
and exploited and commercialized and often conveyed in inaccurate manner, in
inappropriate places and often
with intentional vagueness.

So when I decided to write about what is data mining, I was convinced that I need
to write about what is NOT
data mining first, in order to build a formal definition of data mining.
http://png.dwbiconcepts.com/images/wikipedia-icon.png
What is Data Mining? (And what it is not)

Here is the Wikipedia definition of data mining:

�Data mining � is the process of discovering


new patterns from large data sets�

Now the question is: what does the above definition really mean and how does it
differ from finding
information from databases? We often store information in databases (as in data
warehouses) and retrieve the
information from the database when we need it. Is that data mining? Answer is �no�.
We will soon see why is it
so.

Let�s start with the big picture first. This all starts with something called
"Knowledge Discovery in Database". Data
mining is basically one of the steps in the process of knowledge discovery in
database (KDD). Knowledge
discovery process is basically divided in 5 steps:

1. Selection
2. Pre-processing
3. Transformation
4. Data Mining
5. Evaluation

�Selection� is the step where we identify the data, �pre-processing� is where we


cleanse and profile the data,
�transformation� step is required for data preparation, and then is data mining.
Lastly we use �Evaluation� to
test the result of the data mining.

Notice here the term � �Knowledge� as in Knowledge Discovery in Database (KDD). Why
did you say
�Knowledge�? Why not �information� or �data�?

This is because there are differences among the terms �data�, �information� and
�knowledge�. Let�s
understand this difference through one example.

You run a local departmental store and you log all the details of your customers in
the store
database. You know the names of your customers and what items they buy each day.

For example, Alex, Jessica and Paul visit your store every Sunday and buys candle.
You store
this information in your store database. This is data. Any time you want to know
who are
the visitors that buy candle, you can query your database and get the answer. This
is
information. You want to know how many candles are sold on each day of week from
your
store, you can again query your database and you�d get the answer � that�s also
information.

But suppose there are 1000 other customers who also buy candle from you on every
Sunday
(mostly � with some percentage of variations) and all of them are Christian by
religion. So,
you can conclude that Alex, Jessica and Paul must be also Christian.

Now the religion of Alex, Jessica and Paul were not given to you as data. This
could not be retrieved from the
database as information. But you learnt this piece of information indirectly. This
is the �knowledge� that you
discovered. And this discovery was done through a process called �Data Mining�.

Now there are chances that you are wrong about Alex, Jessica and Paul. But there
are fare amount of chances
that you are actually right. That is why it is very important to �evaluate� the
result of KDD process.

I gave you this example because I wanted to make a clear distinction between
knowledge and information in
the context of data mining. This is important to understand our first question �
why retrieving information
from deep down of your database is not same as data mining. No matter how complex
the information retrieval
process is, no matter how deep the information is located at, it�s still not data
mining.

As long as you are not dealing with predictive analysis or not discovering �new�
pattern from the existing data
� you are not doing data mining.

What are the applications of Data Mining?

When it comes to applying data mining, your imagination is the only barrier (not
really . there are
technological hindrances as well as we will see later). But it�s true that data
mining is applied in almost any
fields starting from genetics to human rights violation. One of the most important
applications is in �Machine
Learning�. Machine learning is a branch of artificial intelligence concerned with
the design and development of
algorithms that allow computers to evolve behaviors based on empirical data.
Machine learning makes it
possible for computers to take autonomous decisions based on the data available
from past experiences. Many
of the standard problems of today�s world are being solved by the application of
machine learning as solving
them otherwise (e.g. through the deterministic algorithmic approach) would be
impossible given the breadth
and depth of the problem.
Let me start with one example of the application of data mining that enables
machine-learning algorithm to
drive an autonomous vehicle. This vehicle does not have any driver and it moves
around the road all by itself.
The way it maneuvers and overcomes the obstacles is by applying the images that it
sees (through a VGA
camera) and then using data mining to determine the course of action based on the
data of its past experiences.

Fig. Autonomous Vehicle Designed in Stanford University using


Data Mining methods to maneuver (Video)

There are notable applications of data mining in the subjects such as �

. Voice recognition

Think of Siri in iPhone. How does it understand your commands? Clearly it�s not
deterministically
programmable as every body has different tone and accent and voice. And not only it
understands, it
also adapts better with your voice as you keep using it more and more.

. Classification of DNA sequences

DNA sequence contains biological information. One of the many approaches of DNA
sequencing is
through sequence mining where data mining techniques are applied to find
statistically relevant
patters, which are then compared with previously studied sequences to understand
the given sequence.

. Natural Language processing

Consider the following conversations between customer (Mike) and shop-keeper


(Linda).

Mike: You have playing cards?


Linda: We have one blue stack from Jackson�s and also one other from Deborah
Mike: What is the price?
Linda: Jackson�s $4 and Deborah�s $7.
Mike: Okay give me the blue one please.

Now consider this. What if �Linda� was an automated machine? You could probably
have the same
kind of conversations still, but it would probably had much more unnatural.

Mike: You have playing cards?


Robot: Yes.
Mike: What type of playing cards do you have?
Robot: We have Jackson�s and Deborah�s playing cards.
Mike: What are the colors of the playing cards?
Robot: Which Company�s playing card do you want to know the color of?
Mike: What is the color of Jackson�s playing cards?
Robot: Blue.
Mike: What are the prices of Jackson�s and deborah�s playing cards?
Robot: Jacksons� playing cards cost you $4 and Deborah�s playing cards cost you $7.

Mike: Ok, then can I buy the blue ones?


Robot: We do not have any product called �blue ones�.
Mike: Can I have the blue color playing cards please?
Robot: Sure!

I know the above example is a bit of overshoot, but you got the idea. Machines do
not understand
natural language. And it�s a challenge to make them understand the same. And until
we do that we
wont be able to build a really useful human-computer interface.

Recently, real advancement on natural language processing is done after the


application of data mining.
Prior implementations of language-processing tasks typically involved the direct
hand coding of large
sets of rules. But the machine-learning paradigm instead used general learning
algorithms � often,
although not always, grounded in statistical inference � to automatically learn
such rules through the
analysis of large corpora of typical real-world examples.

Methods of data mining

Now if the above examples interest you then let�s continue learning more about data
mining. One of the first
tasks that we have to do next is to understand the different approaches that are
used in the field of data mining.
Below list shows most of the important methods:

Anomaly Detection

This is the method of detecting patterns in a given data set that does not conform
to an established normal
behavior. This is applied in number of different fields such as � network intrusion
detection, share market fraud
detection etc.

Association Rule Learning


This is a method of discovering interesting relations between variables in large
databases. Ever seen �Buyers
who bought this product, also bought these:� type of messages in e-commerce
websites (e.g. in Amazon.com)?
That�s an example of Association Rule learning.

Clustering

Clustering is the method of assigning a set of objects into groups (called


clusters) so that the objects in the same
cluster are more similar (in some sense or another) to each other than to those in
other clusters. Cluster analysis
is widely used in market research when working with multivariate data. Market
researchers often use this to
create customer segmentation, product segmentation etc.

Classification

This method is used for the task of generalizing known structure to apply to new
data. For example, an email
program might attempt to classify an email as legitimate or spam.

Regression

Attempts to find a function, which models the data with the least error. The above
example of autonomous
driving uses this method.

Next we would learn about each of these methods in greater detail with examples of
their
SQL Questions

What is normalization? Explain different levels of normalization?

Check out the article Q100139 from Microsoft knowledge base and of course, there's
much more
information available in the net. It will be a good idea to get a hold of any RDBMS
fundamentals text
book, especially the one by C. J. Date. Most of the times, it will be okay if you
can explain till third normal
form.

What is de-normalization and when would you go for it?

As the name indicates, de-normalization is the reverse process of normalization. It


is the controlled
introduction of redundancy in to the database design. It helps improve the query
performance as the
number of joins could be reduced.

How do you implement one-to-one, one-to-many and many-to-many relationships while


designing
tables?

One-to-One relationship can be implemented as a single table and rarely as two


tables with primary and
foreign key relationships. One-to-Many relationships are implemented by splitting
the data into two
tables with primary key and foreign key relationships. Many-to-Many relationships
are implemented
using a junction table with the keys from both the tables forming the composite
primary key of the
junction table.

It will be a good idea to read up a database designing fundamentals text book.


What's the difference between a primary key and a unique key?

Both primary key and unique enforce uniqueness of the column on which they are
defined. But by
default primary key creates a clustered index on the column, where are unique
creates a non-clustered
index by default. Another major difference is that, primary key does not allow
NULLs, but unique key
allows one NULL only.

What are user defined data types and when you should go for them?
User defined data types let you extend the base SQL Server data types by providing
a descriptive name,
and format to the database. Take for example, in your database, there is a column
called Flight_Num which appears in many tables. In all these tables it should
bevarchar(8). In this case
you could create a user defined data type called Flight_num_type of varchar(8) and
use it across all your
tables.

See sp_addtype, sp_droptype in books online.

What is bit data type and what's the information that can be stored inside a bit
column?

Bit data type is used to store Boolean information like 1 or 0 (true or false).
Until SQL Server 6.5 bit data
type could hold either a 1 or 0 and there was no support for NULL. But from SQL
Server 7.0 onwards, bit
data type can represent a third state, which is NULL.

Define candidate key, alternate key, composite key.

A candidate key is one that can identify each row of a table uniquely. Generally a
candidate key becomes
the primary key of the table. If the table has more than one candidate key, one of
them will become the
primary key, and the rest are called alternate keys.

A key formed by combining at least two or more columns is called composite key.

What are defaults? Is there a column to which a default cannot be bound?


A default is a value that will be used by a column, if no value is supplied to that
column while inserting
data. IDENTITY columns and timestamp columns can't have defaults bound to them. See
CREATE
DEFAULT in books online.

What is a transaction and what are ACID properties?

A transaction is a logical unit of work in which, all the steps must be performed
or none. ACID stands for
Atomicity, Consistency, Isolation, Durability. These are the properties of a
transaction. For more
information and explanation of these properties, see SQL Server books online or
any RDBMS fundamentals text book.

Explain different isolation levels

An isolation level determines the degree of isolation of data between concurrent


transactions. The default
SQL Server isolation level is Read Committed. Here are the other isolation levels
(in the ascending order
of isolation): Read Uncommitted, Read Committed, Repeatable Read, Serializable. See
SQL Server books
online for an explanation of the isolation levels. Be sure to read about SET
TRANSACTION ISOLATION
LEVEL, which lets you customize the isolation level at the connection level.

CREATE INDEX myIndex ON myTable (myColumn)

What type of Index will get created after executing the above statement?

Non-clustered index. Important thing to note: By default a clustered index gets


created on the primary
key, unless specified otherwise.

What is the maximum size of a row?

8060 bytes. Do not be surprised with questions like 'What is the maximum number of
columns per table'.
Check out SQL Server books online for the page titled: "Maximum Capacity
Specifications".

Explain Active/Active and Active/Passive cluster configurations


Hopefully you have experience setting up cluster servers. But if you do not, at
least be familiar with the
way clustering works and the two clustering configurations Active/Active and
Active/Passive. SQL
Server books online has enough information on this topic and there is a good white
paper available on
Microsoft site.

Explain the architecture of SQL Server


This is a very important question and you better be able to answer it if consider
yourself a DBA. SQL
Server books online is the best place to read about SQL Server architecture. Read
up the chapter
dedicated to SQL Server Architecture.

What is Lock Escalation?

Lock escalation is the process of converting a lot of low level locks (like row
locks, page locks) into higher
level locks (like table locks). Every lock is a memory structure too many locks
would mean, more memory
being occupied by locks. To prevent this from happening, SQL Server escalates the
many fine-grain locks
to fewer coarse-grain locks. Lock escalation threshold was definable in SQL Server
6.5, but from SQL
Server 7.0 onwards it's dynamically managed by SQL Server.

What's the difference between DELETE TABLE and TRUNCATE TABLE commands?

DELETE TABLE is a logged operation, so the deletion of each row gets logged in the
transaction log,
which makes it slow. TRUNCATE TABLE also deletes all the rows in a table, but it
will not log the
deletion of each row, instead it logs the de-allocation of the data pages of the
table, which makes it faster.
Of course, TRUNCATE TABLE can be rolled back.

Explain the storage models of OLAP

Check out MOLAP, ROLAP and HOLAP in SQL Server books online for more information.

What are the new features introduced in SQL Server 2000 (or the latest release of
SQL Server at the time
of your interview)? What changed between the previous version of SQL Server and the
current version?
This question is generally asked to see how current is your knowledge. Generally
there is a section in the
beginning of the books online titled "What's New", which has all such information.
Of course, reading just
that is not enough, you should have tried those things to better answer the
questions. Also check out the
section titled "Backward Compatibility" in books online which talks about the
changes that have taken
place in the new version.

What are constraints? Explain different types of constraints.


Constraints enable the RDBMS enforce the integrity of the database automatically,
without needing you
to create triggers, rule or defaults.

Types of constraints: NOT NULL, CHECK, UNIQUE, PRIMARY KEY, FOREIGN KEY

For an explanation of these constraints see books online for the pages titled:
"Constraints" and "CREATE
TABLE", "ALTER TABLE"

What is an index? What are the types of indexes? How many clustered indexes can be
created on a
table? I create a separate index on each column of a table. what are the advantages
and disadvantages
of this approach?

Indexes in SQL Server are similar to the indexes in books. They help SQL Server
retrieve the data quicker.

Indexes are of two types. Clustered indexes and non-clustered indexes. When you
create a clustered
index on a table, all the rows in the table are stored in the order of the
clustered index key. So, there can
be only one clustered index per table. Non-clustered indexes have their own storage
separate from the
table data storage. Non-clustered indexes are stored as B-tree structures (so do
clustered indexes), with
the leaf level nodes having the index key and it's row locater. The row located
could be the RID or the
Clustered index key, depending up on the absence or presence of clustered index on
the table.

If you create an index on each column of a table, it improves the query


performance, as the query
optimizer can choose from all the existing indexes to come up with an efficient
execution plan. At the
same time, data modification operations (such as INSERT, UPDATE, DELETE) will
become slow, as
every time data changes in the table, all the indexes need to be updated. Another
disadvantage is that,
indexes need disk space, the more indexes you have, more disk space is used.

What is RAID and what are different types of RAID configurations?

RAID stands for Redundant Array of Inexpensive Disks, used to provide fault
tolerance to database
servers. There are six RAIDlevels 0 through 5 offering different levels of
performance, fault tolerance.
MSDN has some information about RAID levels and for detailed information, check out
the RAID
advisory board's homepage

What are the steps you will take to improve performance of a poor performing query?
This is a very open ended question and there could be a lot of reasons behind the
poor performance of a
query. But some general issues that you could talk about would be: No indexes,
table scans, missing or
out of date statistics, blocking, excess recompilations of stored procedures,
procedures and triggers
without SET NOCOUNT ON, poorly written query with unnecessarily complicated joins,
too much
normalization, excess usage of cursors and temporary tables.

Some of the tools/ways that help you troubleshooting performance problems are:

. SET SHOWPLAN_ALL ON,


. SET SHOWPLAN_TEXT ON,
. SET STATISTICS IO ON,
. SQL Server Profiler,
. Windows NT /2000 Performance monitor,
. Graphical execution plan in Query Analyzer.

Download the white paper on performance tuning SQL Server from Microsoft web site.

What are the steps you will take, if you are tasked with securing an SQL Server?

Again this is another open ended question. Here are some things you could talk
about: Preferring NT
authentication, using server, database and application roles to control access to
the data, securing the
physical database files using NTFS permissions, using an unguessable SA password,
restricting physical
access to the SQL Server, renaming the Administrator account on the SQL Server
computer, disabling the
Guest account, enabling auditing, using multi-protocol encryption, setting up SSL,
setting up firewalls,
isolating SQL Server from the web server etc.

Read the white paper on SQL Server security from Microsoft website. Also check out
My SQL Server
security best practices
What is a deadlock and what is a live lock? How will you go about resolving
deadlocks?

Deadlock is a situation when two processes, each having a lock on one piece of
data, attempt to acquire a
lock on the other's piece. Each process would wait indefinitely for the other to
release the lock, unless one
of the user processes is terminated. SQL Server detects deadlocks and terminates
one user's process.

A livelock is one, where a request for an exclusive lock is repeatedly denied


because a series of
overlapping shared locks keeps interfering. SQL Server detects the situation after
four denials and refuses
further shared locks. A livelock also occurs when read transactions monopolize a
table or page, forcing a
write transaction to wait indefinitely.

Check out SET DEADLOCK_PRIORITY and "Minimizing Deadlocks" in SQL Server books
online. Also
check out the article Q169960 from Microsoft knowledge base.

What is blocking and how would you troubleshoot it?

Blocking happens when one connection from an application holds a lock and a second
connection
requires a conflicting lock type. This forces the second connection to wait,
blocked on the first.

Read up the following topics in SQL Server books online: Understanding and avoiding
blocking, Coding
efficient transactions.

Explain CREATE DATABASE syntax

Many of us are used to creating databases from the Enterprise Manager or by just
issuing the command:

CREATE DATABASE MyDB.

But what if you have to create a database with two file groups, one on drive C and
the other on drive D
with log on drive E with an initial size of 600 MB and with a growth factor of 15%?
That's why being a
DBA you should be familiar with the CREATE DATABASE syntax. Check out SQL Server
books online
for more information.
How to restart SQL Server in single user mode? How to start SQL Server in minimal
configuration
mode?

SQL Server can be started from command line, using the SQLSERVR.EXE. This EXE has
some very
important parameters with which a DBA should be familiar with. -m is used for
starting SQL Server in
single user mode and -f is used to start the SQL Server in minimal configuration
mode. Check out SQL
Server books online for more parameters and their explanations.
As a part of your job, what are the DBCC commands that you commonly use for
database
maintenance?

DBCC CHECKDB,
DBCC CHECKTABLE,
DBCC CHECKCATALOG,
DBCC CHECKALLOC,
DBCC SHOWCONTIG,
DBCC SHRINKDATABASE,
DBCC SHRINKFILE etc.

But there are a whole load of DBCC commands which are very useful for DBAs. Check
out SQL Server
books online for more information.

What are statistics, under what circumstances they go out of date, how do you
update them?

Statistics determine the selectivity of the indexes. If an indexed column has


unique values then the
selectivity of that index is more, as opposed to an index with non-unique values.
Query optimizer uses
these indexes in determining whether to choose an index or not while executing a
query.

Some situations under which you should update statistics:

1. If there is significant change in the key values in the index

2. If a large amount of data in an indexed column has been added, changed, or


removed (that is, if
the distribution of key values has changed), or the table has been truncated using
the
TRUNCATE TABLE statement and then repopulated
3. Database is upgraded from a previous version

Look up SQL Server books online for the following commands:

UPDATE STATISTICS,
STATS_DATE,
DBCC SHOW_STATISTICS,
CREATE STATISTICS,
DROP STATISTICS,
sp_autostats,
sp_createstats,
sp_updatestats

What are the different ways of moving data/databases between servers and databases
in SQL Server?

There are lots of options available, you have to choose your option depending upon
your requirements.
Some of the options you have are:

BACKUP/RESTORE,
Detaching and attaching databases,
Replication,
DTS,
BCP,
logshipping,
INSERT...SELECT,
SELECT...INTO,
creating INSERT scripts to generate data.

Explain different types of BACKUPs available in SQL Server? Given a particular


scenario, how would
you go about choosing a backup plan?

Types of backups you can create in SQL Sever 7.0+ are Full database backup,
differential database
backup, transaction log backup, filegroup backup. Check out the BACKUP and RESTORE
commands in
SQL Server books online. Be prepared to write the commands in your interview. Books
online also has
information on detailed backup/restore architecture and when one should go for a
particular kind of
backup.

What is database replication? What are the different types of replication you can
set up in SQL Server?

Replication is the process of copying/moving data between databases on the same or


different servers.
SQL Server supports the following types of replication scenarios:

* Snapshot replication
* Transactional replication (with immediate updating subscribers, with queued
updating subscribers)
* Merge replication

See SQL Server books online for in-depth coverage on replication. Be prepared to
explain how different
replication agents function, what are the main system tables used in replication
etc.
How to determine the service pack currently installed on SQL Server?

The global variable @@Version stores the build number of the sqlservr.exe, which is
used to determine the
service pack installed. To know more about this process visit SQL Server service
packs and versions.

What are cursors? Explain different types of cursors. What are the disadvantages of
cursors? How can
you avoid cursors?

Cursors allow row-by-row processing of the resultsets.

Types of cursors:

Static,
Dynamic,
Forward-only,
Keyset-driven.

See books online for more information.

Disadvantages of cursors: Each time you fetch a row from the cursor, it results in
a network roundtrip,
where as a normal SELECT query makes only one round trip, however large the
resultset is. Cursors are
also costly because they require more resources and temporary storage (results in
more IO operations).
Further, there are restrictions on the SELECT statements that can be used with some
types of cursors.

Most of the times, set based operations can be used instead of cursors. Here is an
example:
If you have to give a flat hike to your employees using the following criteria:

Salary between 30000 and 40000 -- 5000 hike


Salary between 40000 and 55000 -- 7000 hike
Salary between 55000 and 65000 -- 9000 hike
In this situation many developers tend to use a cursor, determine each employee's
salary and update his
salary according to the above formula. But the same can be achieved by multiple
update statements or
can be combined in a single UPDATE statement as shown below:

UPDATE tbl_emp SET salary =


CASE WHEN salary BETWEEN 30000 AND 40000 THEN salary + 5000
WHEN salary BETWEEN 40000 AND 55000 THEN salary + 7000
WHEN salary BETWEEN 55000 AND 65000 THEN salary + 10000
END

Another situation in which developers tend to use cursors: You need to call a
stored procedure when a
column in a particular row meets certain condition. You don't have to use cursors
for this. This can be
achieved using WHILE loop, as long as there is a unique key to identify each row.

Write down the general syntax for a SELECT statements covering all the options.

Here's the basic syntax: (Also checkout SELECT in books online for advanced
syntax).

SELECT select_list
[INTO new_table_]
FROM table_source
[WHERE search_condition]
[GROUP BY group_by__expression]
[HAVING search_condition]
[ORDER BY order__expression [ASC | DESC] ]

What is a join and explain different types of joins?

Joins are used in queries to explain how different tables are related. Joins also
let you select data from a
table depending upon data from another table.
Types of joins:

INNER JOINs,
OUTER JOINs,
CROSS JOINs

OUTER JOINs are further classified as


LEFT OUTER JOINS,
RIGHT OUTER JOINS and
FULL OUTER JOINS.

For more information see pages from books online titled: "Join Fundamentals" and
"Using Joins".

Can you have a nested transaction?

Yes, very much. Check out BEGIN TRAN, COMMIT, ROLLBACK, SAVE TRAN and @@TRANCOUNT

What is an extended stored procedure? Can you instantiate a COM object by using T-
SQL?

An extended stored procedure is a function within a DLL (written in a programming


language like C,
C++ using Open Data Services (ODS) API) that can be called from T-SQL, just the way
we call normal
stored procedures using the EXEC statement. See books online to learn how to create
extended stored
procedures and how to add them to SQL Server.

Yes, you can instantiate a COM (written in languages like VB, VC++) object from T-
SQL by
using sp_OACreate stored procedure.

Also see books online for sp_OAMethod, sp_OAGetProperty, sp_OASetProperty,


sp_OADestroy.

What is the system function to get the current user's user id?
USER_ID(). Also check out other system functions like

USER_NAME(),
SYSTEM_USER,
SESSION_USER,
CURRENT_USER,
USER,
SUSER_SID(),
HOST_NAME().
What are triggers? How many triggers you can have on a table? How to invoke a
trigger on demand?

Triggers are special kind of stored procedures that get executed automatically when
an INSERT,
UPDATE or DELETE operation takes place on a table.

In SQL Server 6.5 you could define only 3 triggers per table, one for INSERT, one
for UPDATE and one
for DELETE. From SQL Server 7.0 onwards, this restriction is gone, and you could
create multiple
triggers per each action. But in 7.0 there's no way to control the order in which
the triggers fire. In SQL
Server 2000 you could specify which trigger fires first or fires last using
sp_settriggerorder

Triggers cannot be invoked on demand. They get triggered only when an associated
action (INSERT,
UPDATE, DELETE) happens on the table on which they are defined.

Triggers are generally used to implement business rules, auditing. Triggers can
also be used to extend the
referential integrity checks, but wherever possible, use constraints for this
purpose, instead of triggers, as
constraints are much faster.

Till SQL Server 7.0, triggers fire only after the data modification operation
happens. So in a way, they are
called post triggers. But in SQL Server 2000 you could create pre triggers also.
Search SQL Server 2000
books online for INSTEAD OF triggers.

Also check out books online for 'inserted table', 'deleted table' and
COLUMNS_UPDATED()

There is a trigger defined for INSERT operations on a table, in an OLTP system. The
trigger is written to
instantiate a COM object and pass the newly inserted rows to it for some custom
processing.

What do you think of this implementation? Can this be implemented better?

Instantiating COM objects is a time consuming process and since you are doing it
from within a trigger, it
slows down the data insertion process. Same is the case with sending emails from
triggers. This scenario
can be better implemented by logging all the necessary data into a separate table,
and have a job which
periodically checks this table and does the needful.

What is a self join? Explain it with an example.


Self join is just like any other join, except that two instances of the same table
will be joined in the query.
Here is an example: Employees table which contains rows for normal employees as
well as managers. So,
to find out the managers of all the employees, you need a self join.

CREATE TABLE emp


(
empid int,
mgrid int,
empname char(10)
)

INSERT emp SELECT 1,2,'Vyas'


INSERT emp SELECT 2,3,'Mohan'
INSERT emp SELECT 3,NULL,'Shobha'
INSERT emp SELECT 4,2,'Shridhar'
INSERT emp SELECT 5,2,'Sourabh'

SELECT t1.empname [Employee], t2.empname [Manager]


FROM emp t1, emp t2
WHERE t1.mgrid = t2.empid

Here is an advanced query using a LEFT OUTER JOIN that even returns the employees
without
managers (super bosses)

SELECT t1.empname [Employee], COALESCE(t2.empname, 'No manager') [Manager]


FROM emp t1
LEFT OUTER JOIN
emp t2
ON
t1.mgrid = t2.empid
SQL interview questions and answers
By admin | July 14, 2008

1. What are two methods of retrieving SQL?


2. What cursor type do you use to retrieve multiple recordsets?
3. What is the difference between a "where" clause and a "having" clause? - "Where"
is a kind of restiriction
statement. You use where clause to restrict all the data from DB.Where clause is
using before result
retrieving. But Having clause is using after retrieving the data.Having clause is a
kind of filtering command.
4. What is the basic form of a SQL statement to read data out of a table? The basic
form to read data out of
table is �SELECT * FROM table_name; � An answer: �SELECT * FROM table_name WHERE
xyz= �whatever�;�
cannot be called basic form because of WHERE clause.
5. What structure can you implement for the database to speed up table reads?-
Follow the rules of DB
tuning we have to: 1] properly use indexes ( different types of indexes) 2]
properly locate different DB
objects across different tablespaces, files and so on.3] create a special space
(tablespace) to locate some of the
data with special datatype ( for example CLOB, LOB and �)
6. What are the tradeoffs with having indexes? - 1. Faster selects, slower updates.
2. Extra storage space to
store indexes. Updates are slower because in addition to updating the table you
have to update the index.
7. What is a "join"? - �join� used to connect two or more tables logically with or
without common field.
8. What is "normalization"? "Denormalization"? Why do you sometimes want to
denormalize? -
Normalizing data means eliminating redundant information from a table and
organizing the data so that
future changes to the table are easier. Denormalization means allowing redundancy
in a table. The main
benefit of denormalization is improved performance with simplified data retrieval
and manipulation. This is
done by reduction in the number of joins needed for data processing.
9. What is a "constraint"? - A constraint allows you to apply simple referential
integrity checks to a table.
There are four primary types of constraints that are currently supported by SQL
Server:
PRIMARY/UNIQUE - enforces uniqueness of a particular table column. DEFAULT -
specifies a default
value for a column in case an insert operation does not provide one. FOREIGN KEY -
validates that every
value in a column exists in a column of another table. CHECK - checks that every
value stored in a column is
in some specified list. Each type of constraint performs a specific type of action.
Default is not a constraint.
NOT NULL is one more constraint which does not allow values in the specific column
to be null. And also it
the only constraint which is not a table level constraint.
10. What types of index data structures can you have? - An index helps to faster
search values in tables. The
three most commonly used index-types are: - B-Tree: builds a tree of possible
values with a list of row IDs
that have the leaf value. Needs a lot of space and is the default index type for
most databases. - Bitmap:
string of bits for each possible value of the column. Each bit string has one bit
for each row. Needs only few
space and is very fast.(however, domain of value cannot be large, e.g. SEX(m,f);
degree(BS,MS,PHD) - Hash:
A hashing algorithm is used to assign a set of characters to represent a text
string such as a composite of
keys or partial keys, and compresses the underlying data. Takes longer to build and
is supported by
relatively few databases.
11. What is a "primary key"? - A PRIMARY INDEX or PRIMARY KEY is something which
comes mainly from
database theory. From its behavior is almost the same as an UNIQUE INDEX, i.e.
there may only be one of
each value in this column. If you call such an INDEX PRIMARY instead of UNIQUE, you
say something
about
your table design, which I am not able to explain in few words. Primary Key is a
type of a constraint
enforcing uniqueness and data integrity for each row of a table. All columns
participating in a primary key
constraint must possess the NOT NULL property.
12. What is a "functional dependency"? How does it relate to database table design?
- Functional dependency
relates to how one object depends upon the other in the database. for example,
procedure/function sp2 may
be called by procedure sp1. Then we say that sp1 has functional dependency on sp2.
13. What is a "trigger"? - Triggers are stored procedures created in order to
enforce integrity rules in a database.
A trigger is executed every time a data-modification operation occurs (i.e.,
insert, update or delete). Triggers
are executed automatically on occurance of one of the data-modification operations.
A trigger is a database
object directly associated with a particular table. It fires whenever a specific
statement/type of statement is
issued against that table. The types of statements are insert,update,delete and
query statements. Basically,
trigger is a set of SQL statements A trigger is a solution to the restrictions of a
constraint. For instance: 1.A
database column cannot carry PSEUDO columns as criteria where a trigger can. 2. A
database constraint
cannot refer old and new values for a row where a trigger can.
14. Why can a "group by" or "order by" clause be expensive to process? - Processing
of "group by" or "order
by" clause often requires creation of Temporary tables to process the results of
the query. Which depending
of the result set can be very expensive.
15. What is "index covering" of a query? - Index covering means that "Data can be
found only using indexes,
without touching the tables"
16. What types of join algorithms can you have?
17. What is a SQL view? - An output of a query can be stored as a view. View acts
like small table which meets
our criterion. View is a precomplied SQL query which is used to select data from
one or more tables. A view
is like a table but it doesn�t physically take any space. View is a good way to
present data in a particular
format if you use that query quite often. View can also be used to restrict users
from accessing the tables
directly.

Linux command line Q&A

By admin | July 15, 2008

1. You need to see the last fifteen lines of the files dog, cat and horse. What
command should you use?
tail -15 dog cat horse
The tail utility displays the end of a file. The -15 tells tail to display the last
fifteen lines of each specified file.
2. Who owns the data dictionary?
The SYS user owns the data dictionary. The SYS and SYSTEM users are created when
the database is
created.
3. You routinely compress old log files. You now need to examine a log from two
months ago. In order to
view its contents without first having to decompress it, use the _________ utility.

zcat
The zcat utility allows you to examine the contents of a compressed file much the
same way that cat
displays a file.
4. You suspect that you have two commands with the same name as the command is not
producing the
expected results. What command can you use to determine the location of the command
being run?
which
The which command searches your path until it finds a command that matches the
command you are
looking for and displays its full path.
5. You locate a command in the /bin directory but do not know what it does. What
command can you use to
determine its purpose.
whatis
The whatis command displays a summary line from the man page for the specified
command.
6. You wish to create a link to the /data directory in bob�s home directory so you
issue the command ln /data
/home/bob/datalink but the command fails. What option should you use in this
command line to be
successful.
Use the -F option
In order to create a link to a directory you must use the -F option.
7. When you issue the command ls -l, the first character of the resulting display
represents the file�s
___________.
type
The first character of the permission block designates the type of file that is
being displayed.
8. What utility can you use to show a dynamic listing of running processes?
__________
top
The top utility shows a listing of all running processes that is dynamically
updated.
9. Where is standard output usually directed?
to the screen or display
By default, your shell directs standard output to your screen or display.
10. You wish to restore the file memo.ben which was backed up in the tarfile
MyBackup.tar. What command
should you type?
tar xf MyBackup.tar memo.ben
This command uses the x switch to extract a file. Here the file memo.ben will be
restored from the tarfile
MyBackup.tar.
11. You need to view the contents of the tarfile called MyBackup.tar. What command
would you use?
tar tf MyBackup.tar
The t switch tells tar to display the contents and the f modifier specifies which
file to examine.
12. You want to create a compressed backup of the users� home directories. What
utility should you use?
tar
You can use the z modifier with tar to compress your archive at the same time as
creating it.
13. What daemon is responsible for tracking events on your system?
syslogd
The syslogd daemon is responsible for tracking system information and saving it to
specified log files.
14. You have a file called phonenos that is almost 4,000 lines long. What text
filter can you use to split it into
four pieces each 1,000 lines long?
split
The split text filter will divide files into equally sized pieces. The default
length of each piece is 1,000 lines.
15. You would like to temporarily change your command line editor to be vi. What
command should you
type to change it?
set -o vi
The set command is used to assign environment variables. In this case, you are
instructing your shell to
assign vi as your command line editor. However, once you log off and log back in
you will return to the
previously defined command line editor.
16. What account is created when you install Linux?
root
Whenever you install Linux, only one user account is created. This is the superuser
account also known as
root.
17. What command should you use to check the number of files and disk space used
and each user�s defined
quotas?
repquota
The repquota command is used to get a report on the status of the quotas you have
set including the amount
of allocated space and amount of used space.

What is the difference between oracle,sql and sql server ?

. Oracle is based on RDBMS.


. SQL is Structured Query Language.
. SQL Server is another tool for RDBMS provided by MicroSoft.

why you need indexing ? where that is stroed and what you mean by schema object?
For what purpose
we are using view?

We cant create an Index on Index.. Index is stoed in user_index table.Every object


that has been created
on Schema is Schema Object like Table,View etc.If we want to share the particular
data to various users
we have to use the virtual table for the Base table...So tht is a view.

indexing is used for faster search or to retrieve data faster from various table.
Schema containing set of
tables, basically schema means logical separation of the database. View is crated
for faster retrieval of
data. It's customized virtual table. we can create a single view of multiple
tables. Only the drawback
is..view needs to be get refreshed for retrieving updated data.

Difference between Store Procedure and Trigger?

. we can call stored procedure explicitly.


. but trigger is automatically invoked when the action defined in trigger is done.
ex: create trigger after Insert on
. this trigger invoked after we insert something on that table.
. Stored procedure can't be inactive but trigger can be Inactive.
. Triggers are used to initiate a particular activity after fulfilling certain
condition.It need to define
and can be enable and disable according to need.

What is the advantage to use trigger in your PL?

Triggers are fired implicitly on the tables/views on which they are created. There
are various advantages
of using a trigger. Some of them are:
. Suppose we need to validate a DML statement(insert/Update/Delete) that modifies a
table then
we can write a trigger on the table that gets fired implicitly whenever DML
statement is executed
on that table.
. Another reason of using triggers can be for automatic updation of one or more
tables whenever a
DML/DDL statement is executed for the table on which the trigger is created.
. Triggers can be used to enforce constraints. For eg : Any insert/update/ Delete
statements should
not be allowed on a particular table after office hours. For enforcing this
constraint Triggers
should be used.
. Triggers can be used to publish information about database events to subscribers.
Database event
can be a system event like Database startup or shutdown or it can be a user even
like User loggin
in or user logoff.

What the difference between UNION and UNIONALL?

Union will remove the duplicate rows from the result set while Union all does'nt.

What is the difference between TRUNCATE and DELETE commands?

Both will result in deleting all the rows in the table .TRUNCATE call cannot be
rolled back as it is a DDL
command and all memory space for that table is released back to the server.
TRUNCATE is much
faster.Whereas DELETE call is an DML command and can be rolled back.

Which system table contains information on constraints on all the tables created ?
yes,
USER_CONSTRAINTS,
system table contains information on constraints on all the tables created

Explain normalization ?
Normalisation means refining the redundancy and maintain stablisation. there are
four types of
normalisation :
first normal forms, second normal forms, third normal forms and fourth Normal
forms.

How to find out the database name from SQL*PLUS command prompt?
Select * from global_name;
This will give the datbase name which u r currently connected to.....

What is the difference between SQL and SQL Server ?


SQLServer is an RDBMS just like oracle,DB2 from Microsoft
whereas
Structured Query Language (SQL), pronounced "sequel", is a language that provides
an interface to
relational database systems. It was developed by IBM in the 1970s for use in System
R. SQL is a de facto
standard, as well as an ISO and ANSI standard. SQL is used to perform various
operations on RDBMS.

What is diffrence between Co-related sub query and nested sub query?

Correlated subquery runs once for each row selected by the outer query. It contains
a reference to a value
from the row selected by the outer query.

Nested subquery runs only once for the entire nesting (outer) query. It does not
contain any reference to
the outer query row.

For example,

Correlated Subquery:

select e1.empname, e1.basicsal, e1.deptno from emp e1 where e1.basicsal = (select


max(basicsal) from emp
e2 where e2.deptno = e1.deptno)

Nested Subquery:

select empname, basicsal, deptno from emp where (deptno, basicsal) in (select
deptno, max(basicsal) from
emp group by deptno)

WHAT OPERATOR PERFORMS PATTERN MATCHING?


Pattern matching operator is LIKE and it has to used with two attributes

1. % and

2. _ ( underscore )

% means matches zero or more characters and under score means mathing exactly one
character

1)What is difference between Oracle and MS Access?


2) What are disadvantages in Oracle and MS Access?
3) What are feratures&advantages in Oracle and MS Access?
Oracle's features for distributed transactions, materialized views and replication
are not available with
MS Access. These features enable Oracle to efficiently store data for multinational
companies across the
globe. Also these features increase scalability of applications based on Oracle.

What is database?
A database is a collection of data that is organized so that itscontents can easily
be accessed, managed and
updated. open this url : http://www.webopedia.com/TERM/d/database.html

What is cluster.cluster index and non cluster index ?


Clustered Index:- A Clustered index is a special type of index that reorders the
way records in the table
are physically stored. Therefore table may have only one clustered index.Non-
Clustered Index:- A Non-
Clustered index is a special type of index in which the logical order of the index
does not match the
physical stored order of the rows in the disk. The leaf nodes of a non-clustered
index does not consists of
the data pages. instead the leaf node contains index rows.

How can i hide a particular table name of our schema?


you can hide the table name by creating synonyms.

e.g) you can create a synonym y for table x

create synonym y for x;

What is difference between DBMS and RDBMS?


The main difference of DBMS & RDBMS is

RDBMS have Normalization. Normalization means to refining the redundant and


maintain the
stablization.
the DBMS hasn't normalization concept.

What are the advantages and disadvantages of primary key and foreign key in SQL?

Primary key

Advantages

1) It is a unique key on which all the other candidate keys are functionally
dependent

Disadvantage

1) There can be more than one keys on which all the other attributes are dependent
on.

Foreign Key
Advantage

1)It allows refrencing another table using the primary key for the other table

Which date function is used to find the difference between two dates?
datediff

for Eg: select datediff (dd,'2-06-2007','7-06-2007')

output is 5

This article is a step-by-step instruction for those who want to install Oracle 10g
database on their
computer. This document provides guidelines to install Oracle 10g database on
Microsoft Windows
environment. If you use other operating system other than Microsoft Windows, the
process is not too
much different from that of Microsoft Windows, since Oracle uses Oracle Universal
Installer to install its
software.

For more information about installing Oracle 10g under operating systems other than
Microsoft
Windows, please refer to this URL :

http://www.oracle.com/pls/db102/homepage

How to get Oracle 10g :

You can download Oracle 10g database from www.oracle.com. You must registered and
create an
account before you can download the software. The example in this document uses
Oracle Database 10g
Release 2 (10.2.0.1.0) for Microsoft Windows.

How to uninstall Oracle database software :

1. Uninstall all Oracle components using the Oracle Universal Installer (OUI).
oracle10g installation
2. Run regedit.exe and delete the HKEY_LOCAL_MACHINE/ SOFTWARE/ORACLE key. This
contains registry entire for all Oracle products.
3. Delete any references to Oracle services left behind in the following part of
the registry: HKEY
LOCAL MACHINE/ SYSTEM/ CurrentControlsSet/ Services/Ora*. It should be pretty
obvious
which ones relate to Oracle
4. Reboot your machine.
5. Delete the C: \Oracle directory, or whatever directory is your Oracle_Base.
6. Delete the C:\Program Files \Oracle directory.
7. Empty the contents of your c:\temp directory.
8. Empty your recycle bin.

Installing Oracle 10g database software :

1. Insert Oracle CD , the autorun window opens automatically. If you are installing
from network or
hard disk, click setup.exe in the installation folder.
2. The Oracle Universal Installer (OUI) will run and display the Select
Installation
MethodWindow.

3. Choose Basic Installation:


Select this option to quickly install Oracle Database 10g. This method requires
minimal user
input. It installs the software and optionally creates a general-purpose database
based on the
information you provide.
For basic installation, you specify the following:
oracle10g installation
Oracle Home Location � Enter the directory in which to install the Oracle Database
10g
software. You must specify a new Oracle home directory for each new installation of
Oracle
Database 10g. Use the default value, which is :

c:\oracle\product\10.2.0\db_1

Installation Type � Select Enterprise Edition :


If you have limited space, select standard edition. Personal edition installs the
same software as
the Enterprise Edition, but supports only a single-user development and deployment
environment.
Create Starter Database � Check this box to create a database during installation.
Oracle
recommends that you create a starter database for first Create Starter Database �
time
installations. Choose a Global Database Name, like cs157b, or just use the default
value.
Type a password. Don�t lose this password, since you will need it to connect to the
database
server.
Click next

4. The Product-Specific Prerequisite Checks window appears: Click next


http://faq.programmerworld.net/images/db_images/oracle10g_installation3.jpg
oracle10g_installation
5. A summary screen appears showing information such as your global settings, space

requirements and the new products to be installed. Click Install to start the
installation..
6. The Install window appears showing installation progress.
7. At the end of the installation phase, the Configuration Assistants window
appears. This window
lists the configuration assistants that are started automatically.
If you are creating a database, then the Database Configuration Assistant starts
automatically in
oracle 10g installation
oracle 10g installation
a separate window.

At the end of database creation, you are prompted to unlock user accounts to make
the accounts
accessible. The SYS and SYSTEM accounts are already unlocked. Click OK to bypass
password
management.
oracle 10g installation
oracle 10g installation
Note: Oracle 10g still keeps scott / tiger username and password (UID=scott,
PWD=tiger) from
the old version of oracle. In the old version of oracle, scott/tiger user ID is
available by default,
but not in oracle 10g. If you want to use scott /tiger account, you must unlock it
by clicking
�Password Management� at the last window.
Password Management window will appear like the one shown below. Find the user name

�Scott� and uncheck the �Lock Account?� column for the user name.

8. Your installation and database creation is now complete. The End of Installation
window
displays several important URLs, one of which is for Enterprise Manager.
9. You can navigate to this URL in your browser and log in as the SYS user with the
associated
password, and connect as SYSDBA. You use Enterprise Manager to perform common
database
administration tasks
Note : you can access Oracle Enterprise Manager using browser by typing the URL
shown above
in your browser. Instead of typing the IP address, you can also access the
Enterprise Manager by
typing http://localhost:1158/em or �http://[yourComputerName]:1158/em� or by
clicking �Start
>> All Programs >> Oracle � [YourOracleHome_home1] >> Database Control �
[yourOracleID]�
in Windows menu.
By default, use user ID �SYSTEM�, with the password that you have chosen at the
beginning of
installation, to connect to database, SQLPlus, etc. If you want to use other user
ID, you may create
a new user .

Data Modeling

What is Data Model

Data Model is a logical map that represents the inherent properties of the data
independent of
software, hardware, or machine performance considerations. The model shows data
elements
grouped into records, as well as the association around those records.

Since the data model is the basis for data implementation regardless of software or

hardware platforms, the data model should present descriptions about a data in an
abstract
manner which does not mention detailed information specific to any hardware or
software
such as bits manipulation or index addition.

There are two generally accepted meanings on the term data model. The first is that
the
data model could be some sort of theory about the formal description of the data's
structure
and use without any mention of heavy technical terms related to information
technology.
The second is that a data model instance is the application of the data model
theory in order
to create to meet requirements of some applications such as those used in a
business
enterprise.

The structural part of a data model theory refers to the collection of data
structures which
make up a data when it is being created. These data structures represent entities
and
objects in the database model. For instance the data model may that be of a
business
enterprise involved in sales of toys.

The real life things of interest would include customers, company staff and of
course the toy
items. Since the database which will keep the records of these things of interest
cannot
understand the real meaning of customers, company staff and toy item, there should
be
created a data representation of this real life things.

The integrity part of a data model refers to the collection of rules which governs
the
constraints on the data structures so that structural integrity could be achieved.
In the
integrity aspect of a data model, the formal definition of an extensive sets of
rules and
consistent application of data is defined so that the data can be used for its
intended
purpose.

Techniques are defined on hot to maintain data in the data resource and to ensure
that the
data consistently contains value which is loyal to its source while at the same
time accurate
in its destination. This is to ensure that data will always have data value
integrity, data
structure integrity, data retention integrity, and data derivation integrity.

The manipulation part of a data model refers to the collection of operators which
be applied
to the data structures. These operations include query and update of data within
the
database. This is important because not all data can be allowed for altering or
deletion. The
data manipulation part works hand in hand with the integrity part so that the data
model
can result in high quality in the database for the data consumers to enjoy.

As an example, let us take the relational model. The data model defined in the
structural
part refers to the modified concept of the mathematical relation. The reasoning
about such
data is represented as n-ary which is a subset of the Cartesian product of n
domains.

The integrity part refers to the expression in the first order logic and the
manipulation part
refers to the relational algebra as well as tuple and domain calculus.

The process of defining a data model is extremely important in any database


implementation in that there can only be one data model which is the basis for a
wide
variety of data implementation.
Hence, any database management system such as Access, Oracle or MySQL can be
implementing and maintaining a database based on one data model only

What is Data Modeling

Data Modeling is a method used to define and analyze data requirements needed to
support
the business functions of an enterprise. These data requirements are recorded as a
conceptual data model with associated data definitions. Data modeling defines the
relationships between data elements and structures.

Data modeling can be used for a wide array of purposes. It is an act of exploring
data
oriented structures without considering any specific applications that the data
will be used
in. It is like a conceptual definition an entity and its real life counterparts
which is any thing
that is of interest to the organization implementing a database.

Data models are the products of data modeling. In general, three data model styles
namely
conceptual data model, logical data model and physical data model.

The conceptual data model is often called the domain model. It describes the
semantics of a
business organization as this model consists of entity classes which represent
things of
significance to an organization and the relationships of these entities.
Relationships are
defined as assertions about associations between various pairs of entity classes.
The
conceptual data model is commonly used to explore domain concepts with project
stakeholders. Conceptual models may be created to explore high level static
business
structures and concepts. But they can be used as well as precursor or alternatives
to logical
data models.

The logical data model is used in exploring domain concepts and other key areas
such as
relationships and domain problems. The logical data models could be defined for the
scope
of a single project or for the whole enterprise. The logical data model describes
semantics
related to particular data manipulation methods and such descriptions include those
of
tables, columns, object oriented classes, XML tags and many other things. The
logical data
model depicts some logical entity types, the data attributes to describe those
entities and
relations among the entities.

The physical data model is used in the design of the internal database schema. This
design
defines data tables, data columns for the tables and the relationships among the
tables.
Among other things that the physical data model is concern with include
descriptions of the
physical means by which data should be stored. This storage aspect embraces
concerns on
hard disk partitioning, CPU usage optimization, creation of table spaces and
others.

Applications developers need to understand the fundamentals of data modeling so


that their
application can be optimized. It should be noted that the tasks involve in data
modeling
may be performed in an iterative manner. These data modeling tasks include the
following:
Identifying entity types, Identifying attributes, Applying naming conventions,
Identifying
relationships, Applying data model patterns, Assigning keys, Normalizing to reduce
data
redundancy and De-normalizing to improve performance.

Data modeling also focuses on the structure of a data within a domain. This
structure is
described in such a manner that specification is in a dedicated grammar for an
artificial
language used for a certain domain. But as always, the description of the data
structure will
never make any mention of a specific implementation of any database management
system
such as specific vendors.

Sometimes, having different data modelers could lead to confusion as they could
potentially
produce different data models within the same domain. The difference could stem
from
different levels of abstraction in the data models. This can be overcome by coming
up with
generic data modeling methods.

For instance, generic data modeling could take advantage of generic patterns in a
business
organization. An example is the concept of a Party which includes Persons and
Organizations. A generic data model for this entity may be easier to implement to
without
creating conflict along the way

What is Common Data Model

This data model represents events, entities and objects in the real world that are
of interest
to the company. It is subject oriented and includes all aspects of the real world,
primarily
activities pertaining to the business.

To use lay terms, a data model can be considered a road map to get one employee
from
point A to point B in the least mileage, most scenery and shortest time of travel.

In the science of computing, data models are structured and organized data
structures that
are implemented in a database management system. Aside from defining and organizing

business data, data modeling also includes implicitly and explicitly imposing
constraints and
limitations on the data within the data structure.

A data model may be instance of a conceptual schema, logical schema and physical
schema.

A conceptual schema is a description of the semantics of an organization. All the


terms of
the business from the most minute details such as staff information to the most
complex
business transactions are being defined and translated as entity classes.
Relationships
among entities are also defined in a conceptual schema.

A logical schema is a description of the semantics in the conceptual schema. It can


be
represented by a particular technology for data manipulation. This schema is
composed of
particular descriptions of columns, tables, XML tags, object oriented classes and
others.
Later on, these descriptions will be used in the software applications
implementation to
simulate real life scenario of activities in the business.

The physical schema, as the name implies, is the description of the physical means
for
storing data. This can include definitions for storage requirements in hard terms
like
computers, central processing units, network cables, routers and others.

Data Architects and Business Analysts usually work hand in hand to make an
efficient data
model for an organization. To come up with a good Common Data Model output, they
need
to be guided by the following:

1. They have to be sure about database concepts like cardinality, normalization and

optionality;
2. The have to have in depth knowledge of the actual rules of business and its
requirements;
3. They should be more interested in the final resulting database than the data
model.

A data model describes the structure of the database within a business and in
effect, the
underlying structure of the business as well. It can be thought of as a grammar for
an
artificial intelligence in business or any other undertaking.
In the real world, the kinds of things are represented as entities in the data
model. This
entities are can hold any information or attribute as well as relationships.
Irrespective of
how data is represented in the computer system, the data model describes the
company
data.

It is always advised to have a good conceptual data model to describe the semantics
of a
given subject area. A conceptual data model is a collection of assertions
pertaining to the
nature of information used by the company. Entities should be named with natural
language
instead of a technical term. Relationships which are properly named also form
concrete
assertions about the subject.

In large data warehouses, it is imperative that a Common Data Model must be


consistent
and stable. Since companies may have several databases around the world feeding
data to
a central warehouse, a Common Data Model takes a lot of load in the central
processing of
the warehouse because disparities among database sources are already made seamless.
What is Common Data Modeling

Common Data Modeling is defining the unifying the structure used in allowing
heterogeneous business environments to interoperate. A Common Data Model is very
critical to a business organization.

Especially with today's business environment where it is common to have multiple


applications, a Common Data Model seamless integrates seemingly unrelated data into

useful information to give a company a competitive advantage over its competitors.


Data
Warehouses make intensive use data models to make companies have a real update on
how
the business is faring.

In Common Data Modeling, Business Architects and analysts need to face the data
first
before defining a common data or abstraction layer so that they will not be bound
to a
particular schema and thus make the Business Enterprise more flexible.

Business Schemas are the underlying definition of all business related activities.
Data
Models are actually instances of Business Schemas � conceptual, logical and
physical
schemas. These schemas have several aspects of definition and they usually form a
concrete basis for the design of Business Data Architecture.

Data Modeling is actually a vast field but having a Common Data Model for a certain
domain
can answer problems with many different models operating in a homogeneous
environment.
To make Common Data Models, modelers need to focus on one standard of Data
Abstraction. They need to agree on certain elements to be concretely rendered so
uniformity
and consistency is obtained.

Generic patterns can be used to attain a Common Data Model. Some of these patterns
include using entities such as "party" to refer to persons and organizations, or
"product
types", "activity type", "geographic area" among others. Robust Common Data Models
explicitly include versions of these entities.

A good approach to Common Data Modeling is to a have a generic Data Model which
consists of generic types of entity like class, relationships, individual thing and
others. Each
instance of these classes can have subtypes.
Common Data Modeling process may obey some these rules:
1. Attributes are to be treated as relationships with other entities.
2. Entities are defined under the very nature of a Business Activity, rule, policy
or structure
but not the role that it plays within a given context.
3. Entities must have a local identifier in an exchange file or database. This
identifier must
be unique and artificial but should not use relationships to be part of the local
identifier.
4. Relationships, activities and effects of events should not be represented by
attributed but
by the type of entity.
5. Types of relationships should be defined on a generic or high level. The highest
level is
defined as a relationship between one individual thing with another individual
thing.

Data Modeling often uses the Entity-Relationship Model (ERM). This model is a
representation of structured data. This type of Data Modeling can be used to
describe any
ontology (the term used to describe the overview and classification of terms and
their
respective relationships) for a certain area of interest.

What is Common Data Modeling Method

Common Data Modeling is one of the core considerations when setting up a business
data
warehouse. Any serious company wanting to have a data warehouse will have to be
first
serious about data models. Building a data model takes time and it is not unusual
for
companies to spend two to five years just doing it.

Data Models should reflect practical and real world operations and that is why a
common
data modeling method of combining forward, reverse and vertical methods make
perfect
sense to seamlessly integrate disparate data coming in whether top down or bottom
up
from different sources and triggering events.

Professionals involved Enterprise Data Modeling projects understand the great


importance of
accurately reflecting what exactly happens in an industry without having to create
entities
artificially. It can be easy to overlook and side step some issues which can be
analytically
difficult, issues people have no experience of or issues which may be politically
sensitive.
When these are side stepped, data models can become seriously flawed.

Business Architects, analysts and data modelers work together to look around and
look for
the best practices found in the industry. These best practices are then synthesized
into the
enterprise model to reflect the current state of the business and the future it
wants to get
into.

A good Enterprise Data Model should strike a balance between conceptual entities
and
functional entities based on practical, real and available industry standard data.
Conceptual
entities are defined within the company and will take on the values of the data by
the
defined by the company. Examples of conceptual entities are products status,
marital
status, customer types, etc.

On the other hand, functional entities refer to entities that are already well
defined, industry
standard data ready to be placed into database tables. Examples of functional
entities are
D&B Paydex Rating and FICO Score.

Businesses usually start with simply and grows more complex as they progress. It
may start
by selling goods or providing services to clients. These goods and services
delivered as well
as money received were recorded and then reused. So over time, transactions pile up
over
another and the set up can get more and more complex. Despite the complexity, the
business is still essentially a simple entity that has just grown in complexity.

This happens when the business does not have a very defined common data modeling
method. Many software applications could not provide ways to integrate real world
data and
data within the data architecture.

This scenario where there is not common business model can worsen when disparate
multiple systems are used within the company each and each of the system has
differing
views on the underlying data structures.

Business Intelligence can perform more efficiently with Common Data Modeling
Method. As
its name implies, Business Intelligence processes billions of data from the data
warehouse
so that a variety of statistical analysis can be reported and a recommendation on
innovation
to give the company more competitive edge can be presented.

With Common Data Modeling Method, processes can be made faster as the internal
structure of data are made closer to reality compared to non-usage of the data
model. It
should be noted that the common set up of today's business involves having data
sources
from as many geographical locations as possible.

A Look at the Entity-Relationship

Entity-Relationship

The Entity-Relationship or E-R is a model which deals with real world entities, it
includes a
set of objects and the relationships among them. Entity is an object that exists
and is easily
distinguishable from others. Like people, they can easily be distinguished from
others
through various methods; for instance you can distinguish people from one another
by
social security numbers.
http://www.learn.geekinterview.com/images/dm01.gif
http://www.learn.geekinterview.com/images/dm02.gif

Also an entity can be concrete, like a book, person, or place, while it can also be
abstract,
like a holiday for example. Now an entity set is a set of entities that share
something, like
multiple holders of a bank account, those people would be considered an entity set.
Also
entity sets do not need to be disjoint.

Here is an example, the entity set employee (all employees of a bank) and the
entity
set customer (all customers of the bank) may have members in common. Such as an
employee or employees may also be members of the bank. This puts them into both
sets.

We must keep in mind that an entity or entity set is defined by a set of


attributes. An
attribute is a function which maps an entity set into a domain. Every entity is
described by a
set of (attribute, data value) pairs. There is one pair for each attribute of the
entity set.

To illustrate in simpler terms, consider the following.

. A bank has employees, and customers.

. Employees are an entity set defined by employee numbers.

. Customers are an entity set defined by account numbers.

. Employees can be customers as well, and will have possession of both employee
numbers and account numbers.
Relationships and Relationship Sets

A relationship, is an association between several entities. A relationship set is


much
like the others but it is a set of entities that are associated through one or more
aspects.
This is where relationship and relationship set start to differ because basically
it is a
mathematical relation. Consider the following:

If the equation expresses different entity sets, then a relationship set R,


is a subset of
http://www.learn.geekinterview.com/images/dm03.gif

Where would express the actual relationship. Though it appears to be little


complicated, with some practice it can be no more challenging then reading a simple

sentence.

One should remember that the role of an entity is the function it has in a
relationship.
Consider an example, the relationship �works-for� could be ordered pairs of
different
employee entities or entity sets. The first employee entity takes the role of a
manager or
supervisor, where as the other one will take on the role of worker or associate.

Relationships can also have descriptive attributes. This can be seen in the example
of a date
(as in the last date of access to an account), this date is an attribute of a
customer
account relationship set.

Attributes

A particular set of entities and the relationships between them can be defined in a
number
of ways. The differentiating factor is how you deal with the attributes. Consider a
set of
employees as an entity, this time let us say that the set attributes are employee
name andphone number.

In some instances the phone number should be considered an entity alone, with its
own
attributes being the location and uniqueness of the number it is self. Now we have
two
entity sets, and the relationship between them being through the attribute of the
phone
number. This defines the association, not only between the employees but also
between
the employee phone numbers. This new definition allows us to more accurately
reflect the
real world.
Basically what constitutes an entity and what constitutes an attribute depends
largely on
the structure of the situation that is being modeled, as well as the semantics
associated
with the attributes in question.

Let us now look at an example of the entity-relationship graph.


http://www.learn.geekinterview.com/images/dm04b.png
We can easily express the logical structure of a database in picture with an entity

relationship diagram. Yet this is only possible when we keep in mind what
components
are involved in creating this type of model.

The vital components of an entity relationship model include:

* Rectangles, representing entity sets.


* Ellipses, representing attributes.
* Diamonds, representing relationship sets.
* Lines�, linking attributes to entity sets and entity sets to relationship sets.

Below diagram illustrates how an Entity-Relationship models work.

Entity-Relational Diagram Styles

Some of the different variations of the Entity-Relational diagram you will see are:
* Diamonds are omitted - a link between entities indicates a relationship.
Less symbols, means a clearer picture but what will happen with descriptive
attributes? In
this case, we have to create an intersection entity to possess the attributes
instead.

* There can be numbers instead of arrowheads to indicate cardinality.


The symbols, 1, n and m can be used. E.g. 1 to 1, 1 to n, n to m. Some feel this is
easier
to understand than arrowheads.

* Also we can use a range of numbers that can indicate the different options of
relationship

E.g. (0, 1) is used to indicate minimum zero (optional), maximum 1. We can also use
(0,n),
(1,1) or (1,n). This is typically used on the near end of the link - it is very
confusing at first,
but this structure gives us more information.

* Multi-valued attributes can be indicated in a certain manner.


This means attributes are able to have more than one value. An example of this is
hobbies.
Still this structure has to be normalized at a later date.

* Extended Entity-Relationship diagrams allow more details or constraints in the


real world
to be recorded.

This allows us to map composite attributes and record derived attributes. We can
then use
subclasses and super classes. This structure is generalization and specialization.

Summary

Entity-Relationship diagrams are a very important data modeling tool that can help
organize
the data in a project into categories defining entities and the relationships
between entities.
This process has proved time and again to allow the analyst to create a nice
database
structure and helps to store the data correctly.

Entity

The data entity represent both real and abstract entity about which data is being
stored.
The types of entities fall into classes (roles, events, locations, and concepts).
This could be
employees, payments, campuses, books, and so on. Specific examples of an entity are

called instances.

Relationship

The relationship between data is a natural association that exists with one or more
entities.
Like the employees process payments. Cardinality is the number of occurrences of a
single
entity for one occurrence of the related entity, such as, an employee may process
many
payments but might not process any depending on the nature of their job.

Attribute

An attribute represents the common characteristic of a particular entity or entity


set. The
employee number and pay rate are both attributes. An attribute or combinations of
attributes that identify one and only one instance of an entity are called a
primary key or an
identifier. For example, an employee number is an identifier.

Concept Oriented Model

What is a Concept Oriented Model?

Concept-oriented model is proclaimed to be the next level in data modeling. The


method is
based upon the assumption that data aspects are living concepts where each is a
combination of a number of super-concepts.

The belief goes on to complement the top concept and the bottom concept structures.
This
particular structure constitutes a lattice which is also described as an order
satisfying certain
properties. Each item is then defined as a mixture of some other super items that
are taken
from the related super-concepts.
In simple words, the Concept Oriented model is based on a lattice theory or an
order of
sets. Each component is defined as a mixture of its super-concepts. The top concept

structure provided the most abstract view, with no items. Where as, the bottom
concept
structure is more specific and provides a much more detailed representation of the
model.

The syntax structure offers what is commonly referred to as a multi-dimensional


hierarchical
space where the items are found. The sub-concepts of the top concepts are also
called
primitive concepts.
The semantics of this model are represented by the data items; each item is
actually a
combination of its related super-items. Each item has a non-specified number of
sub-items
from the corresponding sub-concepts. The path from model semantics onto
dimensionality
of the model leads from the current concept to some of its corresponding super-
concepts.

Each step within this path normally has a name that is in context to its
corresponding
concept. The number of these paths from the top of the model to the bottom of the
model is
its dimensionality. Now each dimension corresponds to one variable or one
attribute. Thus is
supposed to be one-valued.

All one valued attributes are also directed upward within the structure. Yet if we
were to
reverse the direction of the dimensions then we would have what is often called
sub-
dimensions or reverse dimensions.

These dimensions also corresponds to the attributes or the properties, but they
take many
values from the sub-concepts rather then the super-concepts to form normal
dimensions.
After we have explored the dimensionality of the concept model structure we move
forward
to address the relations between the concepts.

When speaking of relations, each concept is related to its super-concepts, yet the
super-
concepts are also clarified as relation with regard to this concept. So, in order
to be a
relation is a comparative role. More specifically each item is a single instance of
relation for
the corresponding super-items and it is an object link to other objects by the
means or the
relations of its sub-items, which clarified as relation to instances. This brings
us to grouping
and/or aggregation.
Let�s continue to think of these items in relations, this way we can imagine each
item has a
number of �parents� from the super-concepts as well as a number of sub-items from
the
sub-concepts.

Items are interpreted as a group, set, and category for its sub-items. Yet it is
also a
member of the sets, groups, and categories formula represented by the super-items.
You
can see the dual functions or roles for the items when you consider them in this
light.

Continuing to think of our items in the light we have created we can now see that
each
problem domain that would be represented in a concept model have differing levels
of
details.

Near the top we would find it is represented as a single element, like an


organization in
whole. However, we can still spread information from the lower levels to the top
level and
create the aggregated feature we need by seeking the sub-items including in the
parent
item and then by apply the aggregation task.

We will finish up this section by quickly touching on the topic of multi-


dimensional analysis
and constraints. First let�s address the matter of multi-dimensional analysis.

We can easily indicate that multiple source concepts and enforce input constraints
upon
them. Then the constraints can be spread in a downward direction to the bottom
level. This
is the most specific level of all of the levels.

Once we have completed this step the result of this then transported back up the
chain of
levels toward a target concept. Then we can begin an operation of moving one of the
source
concepts downward, basically choosing one of the sub-concepts with more detail.
This is
called Drill Down. We also have an operation known as Roll Up; this is the process
of
moving up by selecting some super-concept with less detail.

With all this talk of constraints, you are probably wondering what they are as
well. Simply
for each concept we can indicate constraints the corresponding items need to
satisfy. That
forces us to describe the actual properties by indicating a path resembling a
zigzag pattern
in the concept structure.

The zigzag path then goes up when needed to get more detailed information; it also
goes
down to retrieve more general information. Using this we can easily then express
the
constraints in terms of the other items and where they can be found.

What are Concepts?


The Concept oriented model deals with concepts rather then class. A concept is a
combination of classes, one is a reference class and the other is an object class.
When the
concept fails to define its specific reference class it is then equal to a
conventional class.

Object and references both have corresponding structure and behavioral methods. A
simple
consequence of having concepts is that the object are presented and accessed in an
indirect
manner, this is done by the concept using custom references with subjective domain
specific structures as well as functions.
Now you might also want to know what a sub-concept or a super-concept is as well. A

super-concept is a concept that is combined with other concepts in the definition


of the
concept. An example of this would be the concept Orders=<Addresses, Customers>,
this
has two super-concepts which are the Addresses and the Customers.

Remember there is always an upward directed arrow in the concept graph from a sub-
concept to any of the corresponding super-concepts. Therefore the sub-concept then
associated with the start of that arrow type and the super-concepts is associated
with the
end of that arrow.

Sub-concepts also have two parts, such as Order Parts, the formula would be Order
Parts=<Products, and Orders> or Order Operations=<Orders, and operations>.

So now we have dissected the main components of the Concept Oriented Model, it is
obviously important to the association of data and concepts related to that data.
We can
now understand the uses and functions of this model with a bit more clarity.

Though the Concept Oriented model is complex and definitely worthy of further
research. It
is suggested that anyone who has had their curiosity sparked by this article look
further into
the model, and perhaps even further explore the additional functions and uses since
it can
be applied to many situations.

Object-Relational Model

What is the Object-Relational Model?

The object-relational model is designed to provide a relational database management


that
allows developers to integrate databases with their data types and methods. It is
essentially
a relational model that allows users to integrate object-oriented features into it.
This design is most recently shown in the Nordic Object/Relational Model. The
primary
function of this new object-relational model is to more power, greater flexibility,
better
performance, and greater data integrity then those that came before it.

Some of the benefits that are offered by the Object-Relational Model include:

. Extensibility - Users are able to extend the capability of the database server;
this can
be done by defining new data types, as well as user-defined patterns. This allows
the
user to store and manage data.
.

. Complex types - It allows users to define new data types that combine one or more
of
the currently existing data types. Complex types aid in better flexibility in
organizing the
data on a structure made up of columns and tables.
.

. Inheritance - Users are able to define objects or types and tables that procure
the
properties of other objects, as well as add new properties that are specific to the
object
that has been defined.
.

. A field may also contain an object with attributes and operations.


..

. Complex objects can be stored in relational tables.

The object-relational database management systems which are also known as ORDBMS,
these systems provide an addition of new and extensive object storage capabilities
to the
relational models at the center of the more modern information systems of today.

These services assimilate the management of conventional fielded data, more complex

objects such as a time-series or more detailed geospatial data and varied dualistic
media
such as audio, video, images, and applets.

This can be done due to the model working to summarize methods with data
structures, the
ORDBMS server can implement complex analytical data and data management operations
to
explore and change multimedia and other more complex objects.
What are some of the functions and advantages to the Object-Relational Model?

It can be said that the object relational model is an evolutionary technology, this
approach
has take on the robust transaction and performance management aspects of its
predecessors and the flexibility of the object-oriented model (we will address this
in a later
article).

Database developers can now work with somewhat familiar tabular structures and data

definition but with more power and capabilities. This also allows them to perform
such task
all the while assimilating new object management possibilities. Also the query and
procedural languages and the call interfaces in the object relational database
management
systems are familiar.

The main function of the object relational model is to combine the convenience of
the
relational model with the object model. The benefits of this combination range from

scalability to support for rich data types.

However, the relational model has to be drastically modified in order to support


the classic
features of the object oriented programming. This creates some specific
characteristics for
the object-relational model.

Some of these characteristics include:

. Base Data type extension

. Support complex objects

. Inheritance (which we discussed in more detail above.)

. And finally Rule systems

Object-relational models allow users to define data types, function, and also
operators. As a
direct result of this the functionality and performance of this model are
optimized. The
massive scalability of the object-relational model is its most notable advantage,
and it can
be seen at work in many of today�s vendor programs.
The History of the Object-Relational Model

As said before the Object-Relational model is a combination of the Relational Model


and the
Object Oriented Model. The Relational Model made its way into the world of data in
the
1970s; it was a hit but managed to leave developers wanting more flexibility and
capability.

The Object Oriented model seemed to move into the spot light in the 1990s, the idea
of
being able to store object oriented data was a hit, but what happened to the
relational data?
Later in the 1990s the Object-Relational model was developed, combining the
advantages of
its most successful predecessors such as; user defined data types, user defined
functions,
and inheritance and sub-classes.

This model grew from the research conducted in the 1990s. The researches many goal
was
to extend the capabilities of the relational model by including objects oriented
concepts. It
was a success.

What about Object Relational Mapping?

Object-Relational mapping is a programming method used to convert data between


incompatible data type systems in relational databases and object oriented
languages. Here
are some basics when it comes to mapping. Java classes can be mapped to relational
database management systems tables.

The easiest way to begin mapping between a enduring class and a table is one-on-
one. In a
case such as this, all of the attributes in the enduring class are represented by
all of the
columns of the table. Each case in point of a business class is then in turn stored
in a row of
that table.

Though this particular type of mapping is pretty straightforward, it can conflict


with the
existing object and entity-relation (we discussed entities in the previous article)
models.
This is partially due to the fact that the goal of the object modeling is to model
an
organizational process using real world objects, (we will discuss more about object
modeling
in the following article), where as the goal of entity-relational modeling is to
normalize and
retrieve data in a quick manner.

Due to this two types of class to table modeling methods have been adopted by most
users.
This was to help overcome the issues caused by differences between the relational
and
object models. The two methods are known as SUBSET mapping and SUPERSET mapping.

Let�s talk briefly about these two methods.

With SUBSET Mapping the attributes of a persistent class, or described above as an


enduring class, represent either a section of the columns in a table or all of the
columns in
the table.
SUBSET Mapping is used mostly when all of the attributes of a class are mapped to
the
same table. This method is useful also when a class is not concerned with a portion
of the
columns of its table in the database due to the fact that they are not a part of
the business
model.

SUBSET Mapping is used to create projection classes as well for tables with a
sizable
number of columns. A projection class contains enough information to enable the
user to
choose a row for complete retrieval from a database.

This essentially reduces the amount of information passed through out the network.
This
type of mapping can also be used to help may a class inheritance tree to a table of
using
filters.

Now let�s consider SUPERSET Mapping. With a persistent class the superset mapping
method holds attributes taken from columns of more then one table. This particular
method
of mapping is also known as table spanning.

Mapping using the SUPERSET method is meant to create view classes that cover the
underlying data model, or to map a class inheritance tree to a database by using a
Vertical
mapping tactic.

The final word

There are millions of other aspects and advantages to this model. The Object-
Relational
model does what no other single model before it could do. By combining the
strongest
points of those that did come before it, this model has surpasses expectations, and
taken on
a definitive role in database technology. Despite what models follow it, this model
is here to
stay.
http://www.learn.geekinterview.com/images/dm05a.png

The Object Model

What is the Object Model?

The Object model, also referred to as the object oriented model was designed to add

database functionality to object programming languages. Object models help to


extend the
semantics of C++, which are Smalltalk and Java object programming languages used to

provide full-featured database programming capability, all the while retaining the
native
language compatibility as well.

A notable benefit of this particular approach is the unification of the application


and
database development into a complete data structure and language environment.
Application then require less code, they use a more natural data modeling, and the
code
bases are much easier to maintain as a result. Developers can then construct whole
database applications with a modest amount of extra effort put into it. Object
models are
often also used to show the connection between objects and collections.

Unlike the relational model, where a complicated data structure needs to be


flattened to fit
into tables or even joined from those tables to form the in-memory structure,
object models
have little or no performance overhead used to store or retrieve a hierarchy of
inter-related
objects. The one-to-one mapping of object programming languages to the database
object
had two major benefits over the older storage methods. One, it offers a higher
performance
management of objects. Secondly, it allows for better management of the more
complex
inter-relationships between objects. These two aspects make object modeling much
better
suited to support applications such as a financial portfolio risk analysis system,
telecommunications service applications, design and manufacturing systems, and even

patient record systems, all of which have very complex relationships between data.

Are there different types of object models?

When you search the web for some concrete information on the object model don�t be
surprised to end up with mix matched results, none of which plainly stating �Object
Model�.
You will instead turn up results for Document Object Models, and Component Object
Models.
This is because the Object Model has been modified just slightly to apply to
different
instances. We will touch on that before moving on.

So what exactly is a Document Object Model? Well you might see it often referred to
as a
DOM, this model is a platform and language neutral interface that allows programs
or script
to vigorously access as well as update the content, structures, and styles of
documents. The
document can then be processed further and the results can be incorporated back
into the
contents of the page.

The Component Object Model which is also referred to as COM, this model is
basically a
component software architecture that enables users to build applications and
systems alike
from components supplied by different software vendors. Component Object Models are
the
underlying design that forms the foundation for some higher-level software
services. Some
of these services include those that are provided by OLE. Any PC user may be
surprised to
learn that a COM is also known as ActiveX. An application we are all familiar with,
especially
those of use that spend a lot of time surfing the internet.

Many of the traditional operating systems were designed to deal with only the
application
binaries and not the actual components. Due to this the benefits of good component-
oriented designs have until now never gone beyond the compilation step. In a world
that is
object-centric it is confusing that our operating systems still can not recognize
objects.
Instead our operating systems have been dealing with only application binaries or
EXEs.
This prevented objects in one process from communication with objects in a
different
process while using their own defined method.

History

The object model really hit the programming scene in the mid-1990s. Around October
of
1998 the first specification of the Document Object model was released by W3C, it
was
known as DOM 1. Later in 2000 DOM 2 followed, it surpassed its older version by
including
specifics with the style sheet Object Model and style information manipulation.
Most recently
DOM 3 wowed the programming world with its release in 2004. Thus far there have
been no
more current releases, as of now we are still using the DOM 3 model, and it has
served us
well.

The history of the Component Object Model is a bit lengthier; we will summarize its
more
dramatic points. DDE was one of the very first methods of inter-process
communication. It
allowed sending and receiving communications or messages between applications. This
is
also sometimes referred to as a conversation between applications. At this point I
think it is
important to point that Windows is the leading Component Object Model vendor, and
the
history of COM is based richly on the information a discoveries made by Windows.

The budding technology of COM was the base of OLE, which means Object Linking and
Embedding. This was one of the most successful technologies introduced with
Windows. The
programs we soon being added into application like Word and Excel by 1991, and on
into
1992. it was not until 1996 that Windows truly realized the potential for their
discover. They
found that the OLE custom controls could expand a web browsers capability enough to

present content.

From that point the vendor had be integrating aspects of COM into many of their
applications, some like Microsoft Office. There is no way to tell how far or how
long the
evolution of Object Modeling with travel, we need only sit back and watch as it
transforms
our software and application into tools to help us mold and shape our future in
technology.

Examples of Object models

The first example we will cover it the Document Object Model. This example is a
remake of
a more detailed example, I have reduced the information provided in the example in
order
to express on the important features of the model. This example can be seen below:
http://www.learn.geekinterview.com/images/dm06.png
http://www.learn.geekinterview.com/images/dm07.png

By looking at the example provided above, we can clearly see the process in which
the
Document Object Model is used. The sample model is designed to show us the way in
which
the document is linked to each element and the coinciding text that is linked to
those
elements..

Now we will take a quick look at a simple component object model example. This
particular
example has been based on one of the models provided by window themselves.
On the example above you see two different types of arrows. The solid arrows are
used to
indicate the USES, where as the dashed arrows are used to represent the OPTIONALLY
USES. The boxes with green out lined text are there to represent the aspects
provided with
WDTF. The blue high lighted text is you Implement or Modify example. The red is
expressing the implementation of your own action interface, and the text high
lighted in
black indicates your operating systems or driver API. By viewing this sample of the

Component Object Model, we can see how the components are linked and the way in
which
they communication between one another.

So after exploring the Object model we can safely come to the conclusion that the
Object
model does serve an important purpose that no model before it was able to grasp.
Though
the model has been modified to fit with specific instances the main use is to model
object
data.

Windows is one of the more notable vendors who have put the Object Model in the
limelight; it will be interesting to see what heights this model reaches with their
assistance.
I will be keeping a close eye out for the next evolutionary change in the Object
Model.

The Associative Model

What is the Associative Data Model?

The Associative data model is a model for databases unlike any of those we spoke of
in prior
articles. Unlike the relational model, which is record based and deals with
entities and
attributes, this model works with entities that have a discreet independent
existence, and
their relationships are modeled as associations.
The Associative model was bases on a subject-verb-object syntax with bold parallels
in
sentences built from English and other languages. Some examples of phrases that are

suitable for the Associative model could include:

. Cyan is a Color
. Marc is a Musician
. Musicians play instruments
. Swings are in a park
. A Park is in a City (the bold text indicates the verbs)
By studying the example above it is easy to see that the verb is actually a way of
association. The association�s sole purpose is to identify the relationship between
the
subject and the object.

The Associative database had two structures, there are a set of items and a set of
links that
are used to connected them together. With the item structure the entries must
contain a
unique indication, a type, and a name. Entries in the links structure must also
have a unique
indicator along with indicators for the related source, subject, object, and verb.

How is the Associative Data Model different?

The Associative model structure is efficient with the storage room fore there is no
need to
put aside existing space for the data that is not yet available. This differs from
the relational
model structure. With the relational model the minimum of a single null byte is
stored for
missing data in any given row. Also some relational databases set aside the maximum
room
for a specified column in each row.

The Associative database creates storage of custom data for each user, or other
needs clear
cut and economical when considering maintenance or network resources. When
different
data needs to be stored the Associative model is able to manage the task more
effectively
then the relational model.

With the Associative model there are entities and associations. The entity is
identified as
discrete and has an independent existence, where as the association depends on
other
things. Let�s try to simplify this a little before moving on.

Let�s say the entity is an organization, the associations would be the customer and
the
employees. It is possible for the entity to have many business roles at the same
time, each
role would be recorded as an association. When the circumstances change, one or
more of
the associations may no longer apply, but the entity will continue to endure.

The Associative model is designed to store metadata in the same structures where
the data
itself is stored. This metadata describes the structure of the database and the how
different
kinds of data can interconnect. Simple data structures need more to transport a
database
competent of storing the varying of data that a modernized business requires along
with the
protection and managements that is important for internet implementation.
The Associative model is built from chapters and the user�s view the content of the
database
is controlled by their profile. The profile is a list of chapters. When some links
between items
in the chapters inside as well as outside of a specific profile exist, those links
will not be
visible to the user.

There is a combination of chapters and profiled that can simplify the making of the
database
to specific users or ever subject groups. The data that is related to one of the
user groups
would remain unseen to another, and would be replaced by a different data set.

Are there any potential disadvantages to the Associative Data Model?

With the Associative model there is not record. When assembling all of the current
information on a complex order the data storage needs to be re-visited multiple
times. This
could pose as a disadvantage. Some calculations seem to suggest that Associative
database
would need as many as four times the data reads as the relational database.

All of the changes and deletions to the Associative model are directly affected by
adding
links to the database. However we must not that a deleted association is not
actually
deleted itself. Rather it is linked to an assertion that has been deleted. Also
when an entity
is re-named it is not actually re-named but rather linked to its new name.

In order to reduce the complexity that is a direct result from the parameterization
required
by heftier software packages we can rely on the chapters, profiles and the
continuation of
database engines that expect data stored to be different between the individual
entities or
associations. To set of or hold back program functions in a database the use of
�Flags� has
begun to be practiced.

The packages that are based on an Associative model would use the structure of the
database along with the metadata to control this process. This can ultimately lead
to the
generalization of what are often lengthy and costly implementation processes.

A generalization such as this would produce considerable cost reductions for users
purchasing or implementing bigger software packages, this could reduce risks
related with
the changes of post implementation as well.

How well does the Associative Model suit the demands of data?

Some ask if there is still an ongoing demand for a better database. Honestly, there
will
always be that demand. The weaker points of the current relational model are now
apparent, due to the character of the data we still need to store changing. Binary
structures
that are supportive to multimedia have posed real challenged for relational
databases in the
same way that the object-oriented programming methods did.

When we look back on the Object databases we can see that they have no conquered
the
market, and have their cousins the hybrid relational products with their object
extensions.
So will the Associative model solve some of the issues surrounding the relational
model?
The answer is not entirely clear, though it may resolve some issues it is not
completely clear
how efficiently the model will manage when set against the bigger binary blocks of
data.

The security of data is crucial, as is the speed of transaction. User interfaces


and database
management facilities should but up to pace. When a database is designed to aid in
the use
of internet applications it should allow back ups without needing to take the data
off-line as
well.

Programming interfaces need to be hearty and readily available to a range of


development
languages, the Associative database will need to show that it is good practice to
store data
using the subject-verb-object method in every case as well. There will always be
questions
about maintaining performance as the database grows, this should be expected.

So what�s the verdict?

Areas of the Associative database design do seem simpler then the relational
models, still as
we have pointed out there are also areas that call for careful attention. There are
issues
related to the creation of chapters that remain daunting at best.

Even so, if the concept of the Associative model proves itself to be a genuinely
feasible and
is able to bring out a new and efficient database, then others could bring to life
products
that are built upon the base ideas that exist with this model.

There is definitely an undeniable demand for a faster operating database model that
will
scale up to bigger servers and down to the smaller devices. It will be an
interesting journey
to witness; I personally would like to see if the future databases built using this
model can
make their mark in the market.

The Hierarchical Model


http://www.learn.geekinterview.com/images/dm08.png

What is a Hierarchical Model?

The term Hierarchical Model covers a broad concept spectrum. It often refers to a
lot of set
ups like Multi-Level models where there are various levels of information or data
all related
be some larger form.

The Hierarchical model is similar to the Network model; it displays a collection of


records in
trees, rather then arbitrary graphs.

Here is an example of on type of conventional Hierarchical model:

You can see from the above figure that the supplementing information or details
branch out
from the main or core topic, creating a �tree� like form. This allows for a visual
relationship
of each aspect and enables the user to track how the data is related.

There are many other ways to create this type of model, this is one of the simplest
and is
used the most often.

An example of information you would use the Hierarchical model to record would be
the
levels within an organization, the information would flow such as:
http://www.learn.geekinterview.com/images/dm09.png

. An organization has several departments

. Each department has several subdivisions

. Each subdivision has sections

So the Hierarchical model for this scenario would look closely like the one below.
As you can
see this model is substantially larger, the benefit of the Hierarchical model is
that it allows
for a continuous growth, though it can take up a lot of room.

With each addition of data a new branch on the �tree� is formed, adding to the
information
as a whole as well as the size.

Hierarchical models allow for a visual parent/ child relationship between data
sets,
organizational information, or even mathematics.

The idea for these models is to begin with the smallest details, in the example
above that
would be the sections.

From the smallest details you would move up (it is often easiest to think of the
model as a
hierarchy) to the subdivisions, above the subdivisions you find departments, and
finally
ending at one �parent� the organization.
Once finished you can sit back and view the entire �family� of data and clearly
distinguish
how it is related.

How is the Hierarchical Model used?

The first mainframe database management systems were essentially the birth place of
the
Hierarchical model.

The hierarchical relationships between varying data made it easier to seek and find
specific
information.

Though the model is idea for viewing relationships concerning data many
applications no
longer use the model. Still some are finding that the Hierarchical model is idea
for data
analysis.

Perhaps the most well known use of the Hierarchical model is the Family Tree, but
people
began realizing that the model could not only display the relationships between
people but
also those between mathematics, organizations, departments and their employees and
employee skills, the possibilities are endless.

Simply put this type of model displays hierarchies in data starting from one
�parent� and
branching into other data according to relation to the previous data.

Commonly this structure is used with organizational structures to define the


relationship
between different data sets.
Normally this contains employees, students, skills, and so forth. Yet we are
beginning to see
the model used in more professional and meta-data oriented environments such as
large
organizations, scientific studies, and even financial projects.

Though the Hierarchical model is rarely used some of its few uses include file
systems and
XML documents.
The tree like structure is idea for relating repeated data, and though it is not
currently
applied often the model can be applied to many situations.

Issues Related to Hierarchical Models

The Hierarchical model can present some issues while focusing on data analysis.
There is
the issue of independence of observations, when data is related it tends to share
some type
of background information linking it together, therefore the data is not entirely
independent.

However, most diagnostic methods have need of independence of observations as a key

hypothesis for the analysis.

This belief is corrupted in the incident of hierarchical data; such as when


ordinary minimum
square regressions turn out typical miscalculations that are too small.

Subsequently, this usually results in a greater likelihood of rejection of an


unacceptable
assumption than if:

(1) a suitable statistical analysis was performed, or


(2) the data contained within honestly self governs observation.

Other Hierarchical Model Structures

Though the tree like structure is perhaps the simplest and also the most desirable
form for
new users there are other types or structures for this model.
Hierarchy is also structured as an outline or indented list. It can be found in the
indented
lists of XML documents.

The example, below, presents information similar to those above that we have
created but
the tree like form is not used in this Hierarchical Modeling but that of
indentation.
. ORGANISATION
o Department 1
. Subdivision 1
. Section 1

. Section 2

. Section 3

. Subdivision 2
. Section 1

. Section 2

. Section 3

. Subdivision 3
. Section 1
. Section 2

. Section 3

o Department 2
. Subdivision 1
. Section 1

. Section 2

. Section 2

. Subdivision 2
. Section 1
. Section 2

. Section 3

One thing you must keep in mind at all times is that no matter what type of
structure you
use for the model you need to be able to add categories at any time, as well as
delete them.
http://www.learn.geekinterview.com/images/dm10.png
An idea to ensure that this is possible is to use a list view or tree view with
expandable and
collapsible categories.

You can also use the model in a visual form, something involving a cylinder or
pyramid or
even a cube, this visual presentation of the data would be most suitable for a
presentation
of data to a group of professionals.

This form would be better for smaller less detailed levels. There is an example
using some
of the same information from above but shown more compact below.

There are various structures of the Hierarchical Model; in fact there are many more
then
those shown here.
The type you use all depends on the data you are using. The methods differ
according to
whether your data is people related, mathematical related, or just simple
statistics.

Review of the Hierarchical Model Facts

1. This model expresses the relationships between information. How they are related
and
what they are most closely related to.

2. The Hierarchical Model is often thought of as a hierarchy. The idea is to think


of your data
as a family.

3. The model has many different structures and forms. Each is best used depending
on the
type of data being recorded, the amount of data being recorded, and who it is being

recorded for.

4. Speaking in parent/child terms data can have many children but only one parent.

5. The model begins with core data and branches off into supplementing data or
smaller
related data.

6. One must remember to start with the smallest detail and work their way up.

If you keep to these simple and compacted guidelines your own Hierarchical Model
will be
successful, clean, clear, and well built. The point is to present information in a
simple and
easy to read manner.
The Multi-Dimensional Model

What is a Multi-dimensional Model?

Multi-dimensional model is an integral aspect of the On-line Analytical Processing


which also known as OLAP.

Due to the fact that OLAP is online it provides information quickly, iterative
queries are
often posed during interactive sessions.
Due to the analytical nature of OLAP the queries are often complex. The multi-
dimensional
model is used to solve this kind of complex queries. The model is important because
it
applies simplicity.

This helps users understand the databases and enables software to plot a course
through
the databases effectively.

Multi-dimensional data models are made up of logical cubes, measures, and


dimensions.
Within the models you can also find hierarchies, levels, and attributes.

The straightforwardness of the model is essential due to the fact that is


identifies objects
that represent real world entities.

The analysts know what measures they want to see, what dimensions and attributes
make
the data important, and in what ways the dimensions of their work is organized into
levels
as well as hierarchies.

What are Logical Cubes and Logical Measures?

Let us touch on what the logical cubes and logical measures are before we move on
to more
complicated details.

Logical cubes are designed to organize measures that have the same exact
dimensions.
Measures that are in the same cube have the same relationship to other logical
objects;
they can easily be analyzed and shown together.
With logical measures cells of the logical cube are filled with facts collected
about an
organization�s operations or functions.

The measures are organized according to the dimensions, which also deals with time
dimension.
Analytic databases contain outlines of historical data, taken from data in a
heritage system,
also those other data sources such as syndicated sources. The normally acceptable
amount
of historical data for analytic applications is about three years worth.

The measures are static; they are also trusted to be consistent while they are
being used to
help make informed decisions.

The measures are updated often, most applications update data by adding to the
dimensions of a measure. These updates give users concrete historical record of a
specific
organizational activity for an interval. This is very productive.

Another productive strategy is that adopted by other application, which fully


rebuild the
data rather then perform updated.

The lowest level of a measure is called the grain. Often this level of data is
never seen, even
so it has a direct affect on the type of analysis that can be done.

This level also determines whether or not the analysts can obtain answers.
Questions such
as, when are men most likely prone to place orders for custom purchases?

Logical Attributes, Dimensions, Hierarchies and Levels

Logical cubes and measures were relatively simple and easy to digest. Now we will
consider
Logical Dimensions, which is a little more complex. Dimensions have a unique set of
values
that define and categorized the data.

These form the sides of the logical cubes and through this the measures inside of
the cubes
as well. The measures themselves are usually multi-dimensional; due to this a value
within
a measure should be qualified by a member of all of the dimensions in order to be
appropriate.

The Hierarchy is a mode which is used to organize the data at each level of
aggregation.
When looking at data, developers use hierarchy dimensions to identify trends on a
specific
level, as well as drill down to lower lever to indicate what is causing such
trends, then they
can also roll up to the higher levels to view how these trends affect the bigger
sections of
the organization.

Back to the levels, each level represent a position in the hierarchy, the levels
above the
most detailed level contain aggregated values for the levels that are beneath it.

On different levels, the members of those levels have a hierarchical relation,


which we
defined in the article related to this topic as a parent / child relationship,
where the parent
can have many children but there can only be one parent to a child.

The hierarchies and levels have a many-to-many relationship, the hierarchy is


usually
consisted of many levels, and one level can be includes in various hierarchies.

Finally to wrap up this section we will take a quick look at Logical Attributes. By
now we
should all know that an attribute provides extra information about the data.

Some of there attributes are used simply for display. You can have attributes that
are like,
flavors, colors, sizes, the possibilities are endless.

It is this kind of attribute that can be helpful in data selection and also in
answering
questions.

An example of the type of questions that the attributes can help answer are; what
colors
are most popular in abstract painting? Also we can ask, what flavor of ice-cream do
seven
year olds prefer?
We also have time attributes, which can give us information about the time
dimensions we
spoke of earlier, this information can be helpful in some kinds of analysis.
These types of analysis can be indication the last day or amount of days in a time
period.
That pretty much wraps it up for attributes at this point. We will revisit the
topic a little
later.

Variables

Now we will consider the issue of variables. A variable is basically a value table
for data,
which is an array with a specific type of data and is indexed by a particular list
of
dimensions. Please be sure to understand that the dimensions are not stored in the
variable.

Each mixture of members of a dimension define a data cell. This is true whether a
value for
that cell is present or not. Therefore, if data is missing, or absent the fact of
the absences
can either be included or excluded from analysis.

There is no specific relationship between variables that share like dimensions.


Even so, a
logical relationship does exist between then, this is due to the fact that even
though they
may store different data that could be from a different data type, they are
identical
containers.

When you have variable that contain identical dimensions it creates a logical cube.
With that
in mind, you can see how if you change a dimension, like adding time periods to the
time
dimension then the variables change as well to include the new time periods, this
happens
even if the other variable have no data for them.

The variables that share dimensions can be manipulated in a array of ways, this
includes
aggregation, allocation, modeling, and calculations.
This is more specifically numeric calculations, and it is an easy and fast method
in the
analytic work place. We can also use variables to store measures.

In an analytic work place factual information is kept in variables, normally they


are kept
with a numeric data type.
Each data type is then stored in an associated variable, this is so that while
sales and
expense data may have like dimensions and the same data type, they will be stored
in
distinct variables.

In addition to using variable to store measures they can be used to store attribute
as well.
There are major differences between the two.

While attribute are multi-dimensional, only one dimension is the data dimension.
Attributes
give us information about each dimension member no matter what level it inhabits.

Through out our journey of learning about the different types of data models, I
think that
the multi-dimensional model is perhaps one of the most useful.

It takes key aspects from other models like the relational mode, the hierarchical
model, and
the object model, and combines those aspects into one competent database that has a
wide
variety of possible uses.

Network Model

What is a Network Model?

Oddly enough the Network model was designed to do what the Hierarchical model could
not.
Though both show how data is related the Network model allows for data to not only
have
many children but also many parents, where as the Hierarchical model allowed for
only one
parent with many children. With the Network model data relationships must be
predefined.

It was in 1971 that the Conference on Data System Languages or CODASYL officially
or
formally defined the Network model. This is essentially how the CODASYL defined the

Network model:
The central data modeling theory in the network model is the set theory. A set
contains a
holder record style, a set title, and an affiliate record type.

An affiliate record type is able to have the same role in more than one set;
because of this
the multi-parent hypothesis is established. A holder record style can be an
affiliate or holder
in another set as well.

The data model is an uncomplicated system, and link and connection record styles
(often
referred to as junction records) may well be existent, as well as additional sets
between
them.

Therefore, the entire network of relationships is demonstrated by a number of pair


wise
sets; within each set some record type is holder or owner (meaning one record type)
this
will be located at the tail of the network arrow (See figure below for an example)
and one or
more of the record types are presented as members or affiliates (the will be
located at the
head of the relationship arrow). Usually, a set defines a 1: M relationship,
although 1:1 is
permitted.

The Traditional Network Model


http://www.learn.geekinterview.com/images/dm11.png

The most notable advantage of the Network model is that in comparison with the
Hierarchical model it allows for a more natural avenue to modeling relationships
between
information. Though the model has been widely used it has failed to dominate the
world of
data modeling.

This is believed to be due to large companies choosing to continue using the


Hierarchical
model with some alterations to accommodate their individual needs and because it
had
been made almost obsolete by the Relational Model which offers a higher lever, and
a more
declarative interface.

For a while the performance benefits of the lower lever navigational interfaces
used with the
Hierarchical and Network models were well suited for most large applications.

Yet as hardware advanced and became faster the added productivity and flexibility
of the
newer models proved to be better equipped for the data needs.
Soon the Hierarchical and Network models were all but forgotten in relation to
corporate
enterprise usage.

The OSI Network Model

Open System Interconnection or OSI models were created to serve as tools that could
be
used to describe the various hardware and software components that can be found in
a
network system.

Over the year we have learned that this is particularly useful for educational
purposes, and
in expressing the full details of the things that need to occur for a network
application to be
successful.

This particular model consists of seven separate layers, with the hardware placed
at the
very bottom, and the software located at the top.

The arrow identifies that a message originating in an application program in the


column
listed as #1 must make its way through all of the other layers contained in both of
the
computers in order to make it to the destination application in the column listed
as #2.

This process could easily be compared to that of reading an email. Imagine Column
#1 and
#2 as computers when exploring the figure below:
http://www.learn.geekinterview.com/images/dm12.png

The first layer, which is clear labeled as the physical layer, is used to describe
components
like that of internal voltage levels, it is also used to define the timing for the
conduction of
single fragments.

The next layer is the Data Link, which is the second layer that is listed in the
example
above, this often relates to the sending of a small amount of data, this could be
and often is
a byte, it is also often used for the task of error corrections.

The Network layer follows the Data Link layer, this defines how to transport the
message
through and within the network. If you can stop an moment and think of this layer
as one
working with an internet connection, it is easy to imagine that it would be used to
add the
correct network address.

Next we have the Transport layer, this layer is designed to divide small amounts of
the data
into smaller sets, or if needed it even severs to recombine them into a larger more
complete
set. The Transport layer also deals with data integrity; this process often
involves a
checksum.
Following the Transport layer we find the Session layer, this next layer is related
to issues
that go further or are more complicated then a single set of data.

More to the point the layer is meant to address resuming transmissions like those
that have
been prematurely interrupted or even some how corrupted by some kind of outside
influence. This layer also often makes long term connections to other remote
machines.

Following the Session layer is where we find the Presentation layer. This layer
acts as an
application interface so that syntax formats and codes are consistent with two
networked or
connected machines.

The Presentation layer Ialso designed to provide sub-routines as well, these are
often what
the user may call on to access their network functions, and perform some functions
like
encrypting data, or even compressing their data.

Finally we have the Application layer. This layer is where the actual user programs
can be
found. In a computer this could be as simple as a web browser surprisingly enough,
or it
could serve as a ladder logic program on a PLC.

Network Model Tips

After reading this article it is not hard to see the big differences between the
Hierarchical
Model and the Network Model. The network model is by far more complicated and deals
with
larger amounts of information that can be related in various and complicated ways.

This model is more useful due to the fact that the data can have many-to-many
relationships, not restricting in to a single parent to a child structure. This is
how the
Hierarchical Model works with data.
Though the Network model has been officially replaced by the more accommodating
Relational Model, for me it is not hard to imagine how it can still be used today,
and may
very well still be being used by PCs around the globe when I think of the Network
Model in
relation to how we email one another.
After reviewing the information and investigating the facts of the Network model I
have
come to the conclusion that it is a sound and relatively helpful model if not a bit

complicated.

Its one major downfall being that the data must be predefined; this adds
restrictions and is
why a more suitable model was needed for more advanced data. Ultimately this one
restriction lead to the model�s untimely replacement with in the world of data
analysis.

What is a Relational Model?

The Relational Model is a clean and simple model that uses the concept of a
relation using a
table rather then a graph or shapes. The information is put into a grid like
structure that
consists of columns running up and down and rows that run from left to right, this
is where
information can be categorized and sorted.

The columns contain information related to name, age, and so on. The rows contain
all the
data of a single instance of the table such as a person named Michelle.

In the Relational Model, every row must have a unique identification or key used to
allocate
the data that will follow it. Often, keys are used to join data from two or more
relations
based on matching identification.

Here is a small example of the grid like Relational Model:

Social Security
Number

Name

Date of Birth
Annual Income

Dependents

M-000-00-0002
F-000-00-0001
000-00-0003

Michelle

June 22nd, 1973

39,000

000-00-0001

Michael

December 12th, 1949

78,510

000-00-0002

Grehetta
March 5th, 1952

0
The Relational Model can often also include concepts known commonly as foreign
keys,
foreign keys are primary keys in one relation that are kept in another relation to
allow for
the joining of data.

An example of foreign keys is storing your mother's and father's social security
number in
the row that represents you. Your parents' social security numbers are keys for the
rows
that represent them and are also foreign keys in the row that represents you. Now
we can
begin to understand how the Relational Model works.

How did we get the Relational Model?

Like most other things the Relational Model was born due to someone�s need. In 1969
Dr.
Edgar F. Codd published the first use of the relational model though it was meant
to be no
more then a report for IBM, if swept across and through data analysis unlike any
before it.

Codd's paper was primarily concerned with what later came to be called the
structural part
of the relational model; that is, it discusses relations per se (and briefly
mentions keys), but
it does not get into the relational operations at all (what later came to be called
the
manipulative part of the model).

Codd�s discovery, his creation was a breath of fresh air for those digging through
data
banks, trying to categorize and define data. When he invented this model he truly
may have
not foreseen what an incredible impact it would have on the world of data.

Known Issues with the Relational Model

Some believe there is a great deal of room for improvement where the Relational
Model is
concerned. It may be a surprise to find not everyone supported relational model.
There
have been claims that the rectangular tables do not allow for large amounts of data
to be
recorded.

With the example of apples and oranges, both are fruits and therefore related in
that way
but apples have different attributes then oranges, At times a user may only want to
see one
or the other, then again the may want to view both. Handling this type of data with
the
relational model can be very tricky.

We are beginning to hear more and more about the need for a better model, a more
adequate structure, still no one has been able to design something that can truly
hold its
own with the Relational Model.
True the model could use a bit of tweaking and leaves a little to be desired, yet,
what would
the perfect model be? What could we use that would apply to as many instances as
the
Relational model, and still surpass its usefulness?

Advantages of the Relational Model

The Relational Model has survived through the years, though there are those who are

always trying to construct a more efficient way, it has managed to come out the
victor thus
far. One reason may be due to the structure it is big enough to be worthy of
optimizing.

Another notable reason is that the relational operations work on sets of data
objects, this
seems to make it a reasonably adequate model for remote access. Finally, it is a
clean
model and concise model that does not encourage design extravagance, or phrased as,

�design cuteness.�

Some prefer the clean and simple style that the Relational Model offers, they can
easily do
with out colorful shapes and stylish layouts, instead wanting nothing more then the
clear cut
facts and relevant information.

Here are a few of the more obvious and noted advantages to the Relational Model:

. Allows for Data Independence. This helps to provide a sharp and clear boundary
between the logical and physical aspects of database management.
.

. Simplicity. This provides a more simple structure than those that were being
before it. A
simple structure that is easy to communicate to users and programmers and a wide
variety of users in an enterprise can interact with a simple model.
.
. A good theoretical background. This means that it provides a theoretical
background
for database management field.

Do not be surprised to find that these are nearly the very same advantages that Dr.
Codd
listed in the debut of this model. It is obvious that he was right, and these
advantages have
been restated again and again since the first publication on his report.
There has been no other model brought into view that has had the advantages of the
Relational Model, though there have been hybrids of the model, some of which we
will
discuss in later articles.

The Relational Model versus Earlier Models

We began with the Hierarchical Model, this model allowed us to distribute our data
in terms
of relation, some what like that of a hierarchy, it showed a parent/child type of
relation. It is
one big down fall being that of the fact that each �child� could only have one
parent, but a
parent could have many children.

This model served us well in its time of glory, and sure there are still systems
using it now,
though trusting their more hefty loads of data to better equipped models.

Following the Hierarchical Model we investigated the Network Model. This model was
closely
kin to the Hierarchical Model in that it to allow for a parent/child view of data,
its main
advantage over the previous model being that it allowed for a many-to-many
relationship
between data sets.

Still the data had to be predefined. Though some forms of this model are still used
today, it
has become some what obsolete.

Finally we come to our current model. The Relational Model. Like those before it,
it to
expresses the relationships between data, only it allows for larger input and does
not have
to be predefined. This model allows users to record and relate large amounts of
data.

This model also allows for multi-level relationships between data sets, meaning
they can be
related in many ways, or even only one way. It is easy to understand how this model
has
managed to out live those that came shortly before it. It is versatile, simple,
clean in
structure, and applicable to nearly every type of data we use.

What is the Semi-Structured Data Model?

The semi-structured data model is a data model where the information that would
normal
be connected to a schema is instead contained within the data, this is often
referred to as
self describing model.

With this type of database there is no clear separation between the data and the
schema,
also the level to which it is structured relies on the application being used.
Certain forms of semi-structured data have no separate schema, while in others
there is a
separate schema but only in areas of little restriction on the data.

Modeling semi-structured data in graphs which have labels that give semantics to
its
fundamental structure is a natural process. Databases of this type include the
modeling
power of other extensions of flat relational databases, to sheathed databases which
enable
the encapsulation of entities, as well as to the object databases, which also
enable recurring
references between objects.

Data that is semi-structured has just recently come into view as an important area
of study
for various reasons. One reason is that there are data sources like the World Wide
Web,
which we often treat as a database but it cannot be controlled by a schema.

Another reason is it might be advantageous to have a very flexible format for data
exchange
between contrasting databases. Finally there is also the reason that when dealing
with
structured data it sometimes may still be helpful to view it as semi-structured
data for the
tasks of browsing.

What is Semi-Structured data?

We are familiar with structured data, which is the data that has been clearly
formed,
formatted, modeled, and organized into customs that are easy for us to work and
manage.
We are also familiar with unstructured data.

Unstructured data combines the bulk of information that does not fit into a set of
databases.
The most easily recognized form of unstructured data is the text in a document,
like this
article.
What you may not have known is that there is a middle ground for data; this is the
data we
refer to as semi-structured. This would be data sets that some implied structure is
usually
followed, but still not a standard enough structured to meet the criteria needed
for the
types of management and mechanization that is normally applied to structured data.

We deal with semi-structured data every day; this applies in both technical and
non-
technical environments. Web pages track definite distinctive forms, and the content

entrenched within HTML usually have a certain extent of metadata within the tags.
Details about the data are implied instantly when using this information. This is
why semi-
structured data is so intriguing, though there is no set formatting rule, and there
is still
adequate reliability in which some interesting information can be taken from.

What does the Semi-Structured Data Model do?

Some advantages to the semi-structured data model include:

. Representation of the information about data sources that normally can not be
constrained by schema.

. The model provides a flexible format used for the data switch over amongst
dissimilar
kinds of databases.

. Semi-structured data models are supportive in screening structured data as semi-


structured data.

. The schema is effortlessly altered with the model

. The data transportation configuration can be convenient.

The most important exchange being made in using a semi-structured database model is

quite possibly that the queries will not be made as resourcefully as in the more
inhibited
structures, like the relational model.

Normally the records in a semi-structured database are stored with only one of a
kind IDs
that are referenced with indicators to their specific locality on a disk. Due to
this the course-
plotting or path based queries are very well-organized, yet for the purpose of
doing
searches over scores of records it is not as practical for the reason that it is
forced to seek
in the various regions of the disk by following the indicators.

We can clearly see that there are some disadvantages with semi-structured data
model, as
there are with all other models, lets take a moment to outline a few of these
disadvantages.

Issues with Semi-Structured Data

Semi-structured data need to be characterized, turned over, stored, manipulated or


analyzed with adeptness. Even so there are challenges in semi-structured data use.
Some of
these challenges include:
Data Diversity: The issues of data diversity in federated systems is a complex
issue, it also
involves areas such as unit and semantic incompatibilities, grouping
incompatibilities, and
non-consistent overlapping of sets.

Extensibility: It is vital to realize that extensibility as used to data is in


indication to data
presentation and not data processing. Data processing should be able to happen with
out
the aid of database updates.

Storage: Transfer formats like XML are universally in text or in Unicode; they are
also
prime candidates for transference, yet not so much for storage. The presentations
are
instead stored by deep seated and accessible systems that support such standards.

In short, many academic, open source, or other direct attention to these particular
issues
have been at an on-the-surface level of resolving representation or definitions, or
even
units.

The formation of sufficient processing engines for well-organized and scalable


storage
recovery has been wholly deficient in the complete driving force for a semi-
structured data
model. It is obvious that this needs further study and attention from developers.

Conclusion

We have researched many area of the semi-structured data model; include the
differences
between structured data, unstructured data, and semi-structured data. We have also
explored the various used for the model.

After looking at the advantages and the disadvantages, we are now educated enough
about
the semi-structured model to make a decision regarding its usefulness.
Though this model is worthy of more research and deeper contemplation. The
advantage of
flexibility and diversity that this particular model offers is more then
praiseworthy.

After researching, one can see many conventional and non-conventional uses for this
model
in our systems. A model example for semi-structured data model is depicted below.
http://www.learn.geekinterview.com/images/dm13.png

The semi-structured information used above is actually the detail pertaining to


this very
article. Each line or arrow in the model had a specific purpose. This purpose is
clearly listed
as Article, Author, Title, and Year.

At the end of each arrow you can find the corresponding information. So this model
example
expresses the information about this article, the information being express is the
title of the
article which is

The Semi-Structure Data Model, also expresses the year in which the article was
written
which is 2008, and finally is tells us who the author is. As you can see from the
example
this data model is pretty easy to follow and useful when dealing with semi-
structured
information like web pages.

Star Schema
What is the Star Schema?

The Star Schema is basically the simplest form of a data warehouse. This schema is
made
up of fact tables and dimension table. We have covered dimension tables in previous
articles
but the concept of fact tables is fairly new.

A fact table contains measurable or factual data about an organization. The


information
contained in the schema is usually numerical, additive measurements, the tables can
consist
of numerous columns and an extensive amounts of rows.

The two tables are different from each other only in the way that they are used in
the
schema. They are actually made up of the same structure and the same SQL syntax is
used
to create them as well.

Interestingly enough in some schemas a fact table can also play the role of a
dimension
table in certain conditions and vice versa. Though they may be physically a like it
is vital
that we also understand the differences between fact table and dimension tables.

A fact table in a sales database, used with the star schema, could deal with the
revenue for
products of an organization from each customer in each market over a period of
time.
However, a dimension table in the same database would define the organizations
customers, the markets, the products, and the time periods that are found in the
fact
tables.

When a schema is designed right it will offer dimensions tables that enable the
user to leaf
through the database and get comfortable with the information that it contains.
This helps
the user when they need to write queries with constraints so that the information
that
gratifies those constraints is routed back into the database.
Star Schema Important Issues

As with any other schema performance is a big deal with the Star Schema. The
decision
support system is particularly important; users utilize this system to query large
quantities
of data. Star Schema�s happen to perform the most adequate decision support
applications.

Another issue that is important mention are the roles that fact and dimension
tables play in
a schema. When considering the material databases, the fact table is essentially a
referencing table, where as the dimension table plays the role of a referenced
table.
http://www.learn.geekinterview.com/images/dm14a.png

We can correctly come to the conclusion that a fact table has a foreign key to
reference
other tables and a dimension table is the foreign key reference from one or
multiple tables.

Tables that are references or are referenced by other tables have what is known as
a
primary key. A primary key is a column or columns with contents that specifically
identify
the rows. With simple star schemas, the fact table�s primary key can have multiple
foreign
keys.

The foreign key can be a column or a group of columns in a table which has values
that are
identified by the primary key of another table. When a database is developed the
statements used to make the tables should select the columns that are meant to form
the
primary keys as well as the foreign keys. Below is an example of a Star Schema.

Simple Star Schema


. The Bold column name Indicates the primary key

. Lines indicate one to many foreign key relationships

. Bold italic column names indicate the primary key that is a foreign key to
another table

Let�s point out a few things about the Star Schema above:

. Items listed in the boxes above are columns in the tables with the same names as
the
box names.

. The Primary key columns are in bold text.

. The foreign key columns are in italic text (you can see that the primary key from
the
green Dimension box is also a key in the orange box, the primary key from the
turquoise
box is also a foreign key in the orange box.)

. You can see that columns that are part of the primary key and the foreign keys
are
labeled in bold and italic text, like the key 1 in orange box.

. The foreign key relationships are identified by the lines that are used to
connect the
boxes that represent tables.
Even though a primary key value must be one of a kind in the rows of a dimension
table the
value can take place many times in a foreign key of a fact table, as in a many to
one
relationship. The many to one relationship can be present between the foreign keys
of the
fact table and the primary key they refer to in the dimension tables.

The star schema can hold many fact tables as well. Multiple fact tables are present
because
the have unrelated facts, like invoices and sales. With some situations multiple
fact tables
are present simply to support performance.

You can see multiple fact tables serving this purpose when they are used to support
levels
of summary data, more specifically when the amount is large, like with daily sales
data.
Referencing tables are also used to define many-to-many relationships between
dimensions.
This is usually referred to as an associative table or even a cross-reference
table. This can
be seen at work in the sales database as well. In a sales database each product has
one or
more groups that is belongs to, each of those groups also contain many products.

The many-to-many relationships is designed through the establishment of a


referencing
table that is meant to define the various combinations of the products and groups
within the
organization.

We can also identify many-to-many relationships by having dimension tables with


multicolumn primary keys that serve as foreign key references in fact tables.

A rough example of this would be yet again with the sale database, as we said
before each
product is in one or more groups and each of those grouse have multiple products,
which is
a many-to-many relationship.

Designing a Star Schema

When designing a schema for a database we must keep in mind that the design affects
the
way in which it can be used as well as the performance.

Due to this fact it is vital that one makes the preliminary investment in time and
research
they dedicate to the design a database one that is beneficial to the needs of its
user. Let�s
wrap things up with a few suggestions about things to consider when designing a
schema:

. What is the function of the organization? Identify what the main processes are
for the
organization; it may be sales, product orders, or even product assembly, to name a
few.
This is a vital step; the processes must be identified in order to create a useful
database.
. What is meant to be accomplished? As all databases, a schema should reflect the
organization, in what it measures as well as what it tracks.

. Where is the data coming from? It is imperative to consider projected put in data
and its
sources will disclose whether the existing data can support the projected schema.

. What dimensions and attributes of the organization will be reflected by the


dimension
tables?

. Will there be dimensions that may change in time? If the organization contains
dimensions that change often then it is better to measure it as a fact, rather then
have it
stored as a dimension.
. What is the level of detail of the facts? Each row should contain the same kind
of data.
Differing data would be addressed with a multiple fact table design or even by
modifying
the single table so that there is a flag to identify the differences can be stored
with the
data. You want to consider the amount of data, the space, and the performance needs

when deciding how to deal with different levels of detail in data.

. If there are changes how will they be addressed, and how significant is
historical
information?

XML Database

What is an XML database?

The XML database is most commonly described as a data persistence system that
enables
data to be accessed, exported, and imported. XML stands for Extensible Markup
Language.

The XML database is a Meta markup language that was developed by W3C to handle the
inadequacies of HTML. The HTML language began to evolve quickly as more
functionality
was added to it.

Soon there was a need to have a domain-specific markup language that was not full
of the
unnecessary data of HTML, thus XML was brought to life.

XML and HTML are indeed very different, the biggest way in which they differ is
where in
HTML semantics and syntax tags are unchanging, in XML the creator of the document
is
able to produce tags whose syntax and semantics are particular to the intended
application.
The semantics of tag in XML are reliant on the framework of the application that
processes
the document. Another difference between XML and HTML is an XML document has to in
good form.

XML�s beginning purpose may have been to mark up content, but it did not take long
for
users to realize that XML also gave them a way to describe structured data, that in
turn
made XML significant as a data storage and exchange format as well.

Here are a few of the advantages that the XML data format has:
. Built-in support for internationalization due to the reality that it utilizes
Unicode.

. Platform self-government or independence.

. The individual decipherable format makes it easier for developers to trace and
repair
errors than with preceding data storage formats.

. The extensibility the method that enables developers to put in additional


information
without breaching applications that were created from older versions of the
arrangement.

. A great quantity of off-the-shelf apparatus for doling out XML documents are
already
present.

Native XML Database

A Native XML Database or NXD defines a model for an XML document instead of the
data in
the document; it stores and retrieves documents in relation to the model.

At the very least the model will consist of attributes, elements, and document
order. The
NXD has a XML document as its fundamental area of storage.

The Database is also not obligated to have any specific underlying tangible storage
model. It
can be built on a hierarchical, relational, or even an object-oriented database,
all of which
we have explored in detail.

It can also use a proprietary storage format such as indexed or compressed files.
So we can
gather from this information that the database is unique in storing XML data and
stores all
agents of the XML model without breaking it down.

We have also learned that the NXD is not really an independent database all of the
time.
And are not meant to replace actually databases, they are a tool; this tool is used
to aid the
developer through providing a full-bodied storage and management of XML documents.

Features of the XML Database

Not all databases are the same, yet there is enough features between them that is
similar to
give us a rough idea of some of the basic structure. Before we continue let us note
that the
database is still evolving and will continue to do so for some time.
One feature of the database is XML storage. It stores documents as a unit, and
creates
models that are closely related to XML or a related technology like DOM.

The model includes un-uniformed levels of complexity as well as supplementation for

content and semi-structured data. Mapping is used to ensure that the XML unique
model of
data is managed.

After the data is stored the user will need to continue to use the NXD tools. It is
not as
useful as to try to access the data tables using SQL as one would think; this is
because the
data that would be viewed would be the model of an XML document not the entities
that the
data depict.

It important to note that the business entity model is within the XML document
domain, not
the storage system, in order to work with the actually data you will have to work
with it as
XML.

Another feature of the database worth mentioning is queries. Currently XPath is the
query
language of choice. To function as a database query language XPath is extended some
what
to allow queries across compilations of documents.

On a negative note XPath was not created to be a database query language so it does
not
function properly in that area.
In order to better the performance of queries NXDs support the development of index
on
the data stored in the collections.

The index can be used to improve the speed of the query execution. Fine points of
what can
be indexed and how the index is fashioned varies with products.
What kind of data types are supported by XML?

You might be surprised to hear that XML does not actually support any data types.
The XML
document is almost always text, even if by chance it does represent another data
type; this
would be something like a date or integer.

Usually the data exchange software converts the data from a text form, like in the
XML
document, to other forms within the database and vice versa.

Two methods are most common in determining which conversion to do. The fist of
these
methods is that the software determines the type of data that is from the database
schema,
this works out well because it is always available at run time.

The other method that is common is that the user clearly provides the data type,
like with
the mapping information.

This can be recorded by the user or even generated without human intervention from
a
database schema or even an XML schema.

When it is generated automatically, the data types can be taken from database
schemas as
well as from certain types of XML schemas.

The is another issue related to conversions as well, this has to do largely with
what text
formats are recognized when being exchanged from XML or what could be produced when

exchanging data from XML.


With most situations the amount of the text formats that are supported for a
specific data
type is given to be some what restricted, this is seen with a single specific
format or with
those that are supported by a particular JDBC driver.
It is important to also note that dates will usually cause issues; this is largely
due to the fact
that the range of possible formats is extremely extensive. When you also consider
number
with international formats, these can add to the problems as well.

Concluding statements

XML may seem to be confusing, however, it is beneficial and even a bit less
complicated
then HTML. Yet when you are beginning to take the step to understanding XML when
you
have spent much time working with HTML, the process can be a bit distressing.

Never fear, once you have completed that step, XML is definitely a dominant format.
It is
also used in almost all of the models we have discussed, making it a vital area to
explore in
more detail.

Entity Attribute Value (EAV)

What is an Entity-Attribute-Value Model (EAV)?

The Entity-Attribute-Value model or EAV is also sometimes referred to at the


Object-
Attribute-Value Model, or even the Open Schema. This is a data model that is often
used in
instances where the amount of attributes, properties, or parameters that can be
used to
define an entity are potentially limitless, however the number that will apply to
the entity is
some what modest.

The easiest way to understand the function of the Entity-Attribute-Value model


design is to
try to understand row modeling, as Entity-Attribute Value models are a universal
form. Let�s
think of a department store database. These databases are responsible for managing
endless amounts of products and product brands.

It is innately obvious that the product names wouldn�t be hard-coded as the names
of the
columns in a table. Alternatively one department�s product descriptions in a
product table
may function as follows: purchases/sales of an individual item are recorded in
another table
that would have separate rows with a way to use the product ID for referencing.

An Entity-Attribute-Value design normally involves a solitary table with three


columns, these
columns most often contain data referring to; the entity, an attribute, and a value
for that
attribute.
In this design one row actually stores a single fact, in a traditional table that
has one
column per attribute, one row stores a set of facts. The Entity-Attribute-Value
design is
applicable when the number of parameters that could apply to an entity is
significantly more
then those that truly apply to a single entity.

Where is the Entity-Attribute-Value Model used?

Perhaps the most notable example of the EAV model is in the production databases we
see
with clinical work. This includes clinical past history, present clinical
complaints, physical
examinations, lab test, special investigations, and diagnoses. Basically all of the
aspects
that could apply to a patient. When we take into account all of the specialties of
medicine,
this information can consist of hundreds of thousands of units of data.

However, most people who visit a health care provides have few findings. Physicians
simply
do not have the time to ask a patient about every possible thing, this is just not
the way in
which patients are examined. Rather then using the process of elimination against
thousands of possibilities the health care provider focuses on the primary
complaints of the
patient, and then asks questions related to those complaints.

Now let�s consider how some one would attempt to represent a general-purpose
clinical
record in a database like those we discussed earlier.

By creating a table or even a set of tables with thousands of columns would not be
the best
choice of action, the vast majority of the columns would be unacceptable, also the
user
interface would be obsolete with out an extremely elaborate logic that could hide
groups of
columns based on the data that has been entered in the previous columns.
To complicate things further the patient record and medical findings continue to
grow. The
Entity-Attribute-Value data model is a natural solution for this perplexing issue,
and you
shouldn�t be surprised to find that larger clinical data repositories do use this
model.

What is the Structure of the Entity-Attribute-Value Table?

Earlier we covered the facts that the EAV table consists of thee columns in which
data is
recorded. Those columns were the entity, the attribute, and the value. Now we will
talk a
little more in-depth about each column.
. The Entity, sticking to the scenario of clinical finding, the entity would be the
patient
event. This would contain at the very least a patient ID and the date and time of
the
examination.
.

. The Attribute, or often referred to as the parameter, is a foreign key into a


table of
attribute definitions. In our example it would be the definitions of the clinical
findings.
The attributes table should contain the attribute ID, the attribute name,
description, data
type, units of measurement, and columns aiding input validation.
.

. The Value of an attribute, this depends on the data type.

Entity-Attribute-Value Database

This database is most commonly called the EAV database; this is a database where a
large
portion of data is modeled as EAV. Yet, you may still find some traditional
relational tables
within this type of database.

. We stated earlier what the EAV modeling does for certain categories of data such
as
clinical findings where attributes are many and few. However where these specific
functions do not apply we can use a traditional relational model instead. Using EAV
has
nothing to do with leaving the common sense and principles of a relational model
behind.
.

. The EAV database is basically un-maintainable without the support of many tables
that
store supportive metadata. These metadata tables usually outnumber the EAV tables
by
about three or more, they are normally traditional relational tables.
.
. The Entity in clinical data is usually a Clinical Event as we have discussed
above.
However for more general purposes the entity is a key into an Objects table that is
used
to note common information about all of the objects in the database. The use of an
Object table does not need EAV, traditional tables can be used to store the
category-
specific details of each object.
.

. The Value brings all values into lings, as in the EAV data example above as well,
this
results in a simple, yet still not scalable structure. Larger systems use separate
EAV
tables for each of their data types, including the binary larger objects, this
deals with the
metadata for a specific attribute in identifying the EAV table in which the data
will be
stored.
.

. The Attribute, in the EAV table this is no more then an Attribute ID, there are
normally
multiple metadata tables that contain the attribute related information.

Issues Associated with the EAV Model

There have been a number of issues with the Entity-Attribute-Value model brought to
light
throughout its lifetime. We will briefly discuss those now. It is important that we
clarify first
that these issues arise when metadata is not used with the EAV model, for metadata
is vital
for its functionality.

Here are some of the issues:

Flaccidity. The litheness is wonderful, still there is a time where we no longer


have any
structure at all. Normally you can not rely on the built in database features like
the
referential integrity any longer. To ensure that the column takes the values within
an
acceptable range only you need to code the integrity checks inside of the
application. This
does not aid in making the model maintainable.

Designer issues. Adding attributes as you go is tolerable for a prototype. Yet if


you are
unaware of what data you want to use from the go, you are just looking for problem.

The technology of the relational databases will be inaccessible and will have to be
recreated
by a development team, this could include system tables, Graphical query tools,
fine
grained data security, incremental back-up and restore, and exception handling,
partitioned
tabled, and clustered indexes. All of which are currently non-existent.
The actual format is not supported well by the DBMS internals. Standard query
optimizers
for SQL do not handle the EAV formatted data that well, and a lot of time will need
to be
dedicated to performance tuning for an acceptable production quality application.

As you can see from above there are still a few issues that need to be addressed by

developers in order to make the EAV optimal. Regardless of those issues we have
also
learned that if we use metadata with the EAV we can avoid many if not all of these
issues.

Entity Relation Diagram


Entity Relation Diagram which is also known as an E-R diagram is a data relation
diagram. The entity relation diagram uses specialized graphical symbols for
illustrating all
of the interrelationships between entities and attributes in the database. It
represents the
arrangement and relationship of data entities for the logical data structure.

There are three general graphical symbols used in entity relation diagram and these

symbols are: box, diamond and oval. The box is commonly used for representing the
entities in the database. The diamond is typically used for representing the
relationships and
finally, the oval is used for representing all the attributes.

In many other entity relation diagrams, the rectangle symbol is used to represent
entity
sets while the ellipse symbol is used to represent attributes. The line is
generally used for
linking attributes to entity sets and entity sets to relationship sets.

The entity relation diagram is used to represent the entire information system for
easy
management of resources. The diagram can make the people concerned easily identify
concepts or entities which exist in the whole information system as well as the
entire
business structure and the complex interrelationships between them.

An entity relation diagram is also often used in visualizing a relational database.


Each of the
entities represent the database table while the relationships lines represent the
keys in one
of the tables pointing to a specific record in the related tables or tables
depending on the
kinds of relationship (one to one, one to many, many to one, many to many).

An entity relation diagram could also be an abstract representation of something


which does
necessarily mean capturing every table needed within the database but is instead
serving to
diagram major concepts and relationships.

It may represent very industry specific theoretical overview of the major entities
and
relationships needed for management of the industry resources whatever they may be.
It
may assist in the designing process of the database for an e-resource management
system
but may not necessarily identify every table which would be used.

The "Crow's Foot" notation is an alternative entity relations diagram. In this


diagram
scheme, the relationships are represented with connecting lines between entities
and the
symbols at the ends of the lines are to represent the cardinality of the
relationship.

In representing cardinality, the "Crow's Foot" notation uses three symbols: the
ring
represent zero; the dash represents one; and the crow's foot represents more or
many.

In representing the four types of cardinality which an entity could have in a


relationship, the
"Crow's Foot" uses the following symbols: ring and dash represents zero or one;
dash and
dash represents exactly one; ring and crow's feet represents zero or more; dash and
crow's
feet represents one of more.

This diagram scheme may not be as famous and widely used as the symbols above but
it is
fast gaining notice especially now that it is used with Oracle texts and in some
visual
diagram and flowcharting tools such as Visio and PowerDesigner.

Those who prefer using the "Crow's Foot" notation say that this technique give
better clarity
in the identification of the many, or child, side of the relationship as compared
to other
techniques. This scheme also gives more concise notation for identifying mandatory
relationship with the use of perpendicular bar, or an optional relation, or an open
circle.

There are many tools for entity relation diagrams available in the market or the
internet
today. The proprietary tools include Oracle Designer, SILVERRUN ModelSphere,
SmartDraw,
CA ERwin Data Modeler, DB Visual ARCHITECT, Microsoft Visio, owerDesigner and
ER/Studio. For those who want free tools, their choices include MySQL Workbench,
Open
System Architect, DBDesigner and Ferret.

Entity Structure Chart

An entity structure chart is a visual representation of everything related to the


entities
related to the business rules, activities as well as the data model for the company
in its
database, data warehouse or any general information system implementation. It is a
chart
that depicts the structure and existence of data attributes and data entities in
the common
data structure.

An entity structure chart draws the graphical structure for any data within the
enterprise
common data structure. This graphical drawing is intended to help data document
analysts,
database administrators, informational technology and all other staff of the
organization
visualized the data structures and the information system design. Entity relation
diagrams
makes use of Entity Structure Chart in order to provide a complete representation
of the
logical data structure.

Without the aid of an entity structure chart, data entities and all attributes and
relations
pertaining to them, would all be defined in bullet format with details in paragraph
format.
This can be straining to the eyes as well as difficult to analyze because one will
have to dig
through all those words and sentences.
With an entity structure chart, it becomes easy to have an analysis that shows the
position
of any selected entity in the structures that have been defined. An entity may
exist in many
structure charts. This chart would also give a great benefit in validating the
position or
absence of an entity within one or more specific structures.

For instance, there might be a need for identifying where a specific set of
departmental data
exists within the structure of the entire organization. With an entity structure,
it would be
very easy to spot and point to the location of the departmental data being looked
for.

The data entities which are represented in the entity structure chart may include
resource
data, roles of the entity, data from each organizational or departmental unit, data
location,
and any other business items which may apply.

An entity represents any real world object and the entity structure is technically
and
basically the formalism used in a structural knowledge representation scheme that
systematically organizes a family of possible structures of a system. Such an
entity
structure chart illustrates decomposition, coupling, and taxonomic relationships
among
entities.

The decomposition of an entity is the particular area which is concerned with how
the any
entity may be broken down further into sub-entities until its atomicity. Coupling
pertains to
the specifications detailing how sub-entities may be coupled together to
reconstitute the
entity.

An entity structure chart has a direct support for the entity relation diagram.
This diagram is
the visual and graphical representation of all of the interrelationships between
entities and
attributes in the database. And while the entity relation diagram uses graphical
symbols
such as box, diamond, oval and lines for representing the entities and
relationships in the
database, the entity structure chart may or may not use the same symbols in trying
to
visualize the structure of the entity for the database or the enterprise
information system.

It could be said that an entity structure chart is the break down of the components
of the
entity relations diagram. In the entity relations diagram only the entities and the
table
relations are being specified. But for the entity structure chart, the visual
illustration may
include a symbol for the entity is structurally represented.

For instance, let us say the entity is a CUSTOMER. The entity structure details
everything
about the customer including name, age, etc, and its relationships with products
and other
entities and at the same time the data type on how the CUSTOMER and its attributes
will be
physically stored in the database.
External Schema

The word schema as defined in the dictionary means plan, diagram, scheme or an
underlying organizational structure. Therefore, as can be very briefly said, an
external
schema is a plan on how to structure data so it can seamlessly integrates with any
information system that needs it.

It also means that data needs to integrate with the business schema of the
implementing
organization. External Schema is a schema that represents the structure of data
used by
applications.

A database management system typically has a three layer architecture composed of


internal schema, conceptual schema and the external schema. The conceptual schema
is
the schema which describes the aspects relevant the universe of discourse in the
DBMS.

Each of the external schemas describes the part of the information which is
appropriate to
the group of users at whom the schema is being addressed. Each external schema is
derived from the conceptual schema.

The external schema definitions are all based on a data dictionary. The universe of

discourse of the data dictionary is all information in the use and management of
the
database system.

The external schema definition system is the means by which the external schemas
are
being defined. The external schema must contain information which must be derivable
from
the conceptual schema.
In systems which are based on the object oriented paradigm, this may not
necessarily mean
that the classes which have been included in the schema have to have been
previously
defined in the conceptual schema. Any external schema may also include classes that
have
been defined in the conceptual schema like it may also contain derived classes.

A derived class could be any classes which have been directly or indirectly defined
on the
bases of the conceptual schema classes and have been defined and included in the
data
dictionary.

The typical external schema definition for the object oriented paradigm includes
three steps
which are: definition of the necessary derived classes; selection of the set of
classes that
will constitute the external schema; and generation of the output external schema.
A general external schema explicitly defines data structure and content in terms of
the data
model that tackles structural, integrity and manipulation of the data. As such, the
external
data schema include the data vocabulary which defines the element and attribute
names,
the content model which holds the definition of relationships and corresponding
structure,
and the data types.

Some of these data types are integer, string, decimal, boolean, double, floast,
hexBinary,
base64Binary, QName,dateTime, time, date, gYearMonth, gYear, gMonthDay, gDay,
gMonth, NOTATION and all others which may be applicable or appropriate for
representing
the entities in the information system.

An external schema is a very important aspect of any information technology


application
most especially in the field of database.

As this defines the structure of data used in applications, this will make it easy
for the
information system to deal with all processes, optimizations and troubleshooting.
This
schema will also ensure that the data used in the information system adheres by the
rules
of the business and follows the framework of the data architecture.

In data modeling, the external schema may refer to the collection of data
structures that
are being used in creating databases representing all objects and entities which
are modeled
by the database.

The same external schema stores the definition of the collection of governing rules
and
constraints place on the data structure in order to maintain structure integrity by
preventing
orphaned records and disconnected data sets. In the same external schema is also
can be
found the collection of operators which are applicable to the data structures such
as update,
insert, query of the database.

Four-Schema Concept

Four-Schema Concept

Four-Schema concept consist of

a physical schema,
a logical schema,
a data view schema, and
a business schema.
The use of the four schema concept is greatly taken advantaged of the in the
implementation of a service oriented business processes integration. It helps to
resolve
problems with 3-schema concept.

Today, there is a very widely recognized trend in the marketing and business
environment
and this trend is the pushing towards a scenario wherein companies and business
enterprises are networked together in order to be able gain higher profit from the
collaborative efforts, and improve operational flexibility while reducing the
operational cost.

This trend the transformation of the current business environment requires


companies to
adopt more collaborative working practices which are based on the integration of
businesses
processes within a wide area of business players such as the business partners,
suppliers,
vendors, and public bodies.

In order for integration to take place, the underlying architecture must have to be
resolved
first so that smooth and near seamless integration can happen.

One of the most important aspects of not just in enterprise data integration but in
computer
science in general is the data modeling.

This is the process of creating a data model with the use of the model theory (the
formal
description of data) so that a data model instance can be created. The physical
schema, a
logical schema, a data view schema, and a business schema.

A typical data model may have an instance of one of the three. The conceptual or
business
schema is used for describing an the semantics pertaining to the business
organizations. It
is in this schema that the entity classes which represent things of significance to
the
organization and the entity relationships which are the assertions about
associations
between pairs of entity classes are being defined.

In the logical schema, the descriptions of the semantics are being contained here.
The
semantics' descriptions may be represented by a particular data manipulation
technology
including the descriptions of tables and columns, object oriented classes, and XML
tags,
among other things.

The physical schema is where the descriptions of the physical means by which data
are
stored and other concerns pertaining to partitions, CPUs, tablespaces, etc.

The addition of the data view schema is what makes the four schema concept complete
and
distinct. The data view schema details how enterprises can offer the information
that they
wish to share with others as well as request what they want.

In real life business data management implementation, there really are many types
of data
model being developed using many different notations and methods.
Some of the data models are biased towards physical implementation while other are
biased
toward understanding business data, and a still a few are biased towards business
managers.

The four schema concept, as mentioned, is biased towards service oriented business
processes integration.

The typical overall information architecture of on-demand enterprise collaboration


or service
oriented business processes integration implementation involves on-demand
information
exchange which is covered under the four schema concept.

Some good practices in this area tend to rely on homogeneous semantics which can
pose
difficulty in achieving for independent databases owned by independent enterprises.

The problem on difficulty can be overcome by developing a new information exchange


model to extend previous global query results and cover independent databases.

The new model provides a four-schema architecture which can already allow
management
for information sharing.

Information matching is also to be considered and this can be done by employing


approaches for query database and export database design.

Service oriented business processes integration is fast become a standard in


enterprise
management and many new schemas are expected to come into development.

Logical Data Model

A logical data model is an important aspect in the design and implementation of a


data
warehouse in that the efficiency of the databases depends heavily on data models.
Logical Data Model refers to the actual implementation of a conceptual module in a
database. It represents normalized design of common data model which is required to

support the design of an information system.

The logical data model elaborates the representation all data pertaining to the
organization
and this model organizes the enterprise data in management technology jargon.

In 1975 when the American National Standards Institute (ANSI) first introduced the
idea of
a logical schema for data modeling, there were only two choices that time which
were the
hierarchical and network models.

Today there are three choices for logical data model and these choices are
relational, object
oriented and Extensible Markup Language (XML). The relational option defined the
data
model in terms of tables and columns. The object oriented option defines data in
terms of
classes, attributes and associations. Finally, the XML option defined data in terms
of tags.

The logical data model is based closely on the conceptual data model which
describes all
business semantic in natural language without pointing any specific means of
technical
implementation such as the use of hardware, software or network technologies.

The process of logical data modeling could be a labor intensive technique depending
on the
size of the enterprise the data model will be used for. The resulting logical data
model
represents all the definition, characteristics, and relationships of data in a
business,
technical, or conceptual environment. In short, logical data modeling is about
describing
end-user data to systems and end-user staff.

The very core of the logical data model is the definition of the three types of
data objects
which the building blocks of the data model and these data objects are the
entities,
attributes, and relationships. Entities refer to persons, places, events or things
which are of
particular interest to the company.

Some examples of entities are Employees, States, Orders, and Time Sheets.
Attributes refer
to the properties of the entities. Examples of attributes for the Employee entity
are first
name, birthday, gender, address, age and many others. Lastly, relationships refer
to the
way where in the entities relate to each other. An example relationship would be
"customers purchase products" or "students enroll in classes".

The above mentioned example is a logical data model using the Entity-Relationship
(ER)
model which identifies entities, relationships, and attributes and normalize your
data.
A logical data model should be carefully designed because it will have tremendous
impact on
the actual physical implementation of the database and the larger data warehouse.

A logical data model influenced the design of data movement standards and
techniques
such as the heavily used extract, transform and load (ETL) process in data
warehousing and
the enterprise application integration (EAI), degree of normalization, use of
surrogate keys
and cardinality.

Likewise, it will determine the efficiency in data referencing and in managing all
the
business and technical metadata and the metadata repository. Several pre-packaged
third
party business solutions like enterprise resource planning (ERP) or HR systems have
their
own logical data model and when they are integrated into the overall existing
enterprise
model with a well designed logical data model, the implementation may turn out to
be
easier and less time consuming resulting in saving of money for the company.

Enterprise Data Model


An Enterprise Data Model is a representation of single definition of data of an
enterprise is
and the representation is not biased on any system application. It independently
defines
how the data is sources, stored, processed or accessed physically.

Enterprise Data Model gives overall picture of an industry perspective by offering


an
integrated blueprint view of entire data that is produced as well as consumed in
all
departments of an enterprise or an organization. It helps to resolve all potential
inconsistencies and parochial interpretations of the data used

It can also be a framework or an architectural design of data integration which


enables the
function of identifying all shareable and/an redundant data across functional and
organizational boundaries. It serves to minimize data redundancy, disparity, and
errors;
core to data quality, consistency, and accuracy.

With Enterprise Data Model being a data architectural framework, the business
enterprise
will have some sort of starting point for all data system designs. Its theoretical
blueprint can
provide for provisions, rules and guide in the planning, building and
implementation of data
systems.

In the area of enterprise information system, the Operational Data Store (ODS) or
Data
Warehouse (DW) are two of the largest components which need carefully designed
enterprise data model because data integration is the fundamental principle
underlying any
such effort and a good model can facilitate data integration, diminishing the data
silos,
inherent in legacy systems.

As the name implies, the very core of an Enterprise Data Model is about the data,
regardless of where the data is coming from and how it will be finally used. The
model is
meant primarily to give clear definitions on how come up with efficient initiatives
in the
aspects of Data Quality, Data Ownership, Data System Extensibility, Industry Data
Integration, Integration of Packaged Applications and Strategic Systems Planning.
The process of making an enterprise model typically utilizes a top down bottom up
approach
for all designs of the data systems including the operational data store, data
marts, data
warehouse and applications. The enterprise data model is built in three levels of
decomposition and forms a pyramid shape.

The first to be created is Subject Area Model which sits on top of the pyramid. It
expands
down to create the Enterprise Conceptual Model and finally the Enterprise
Conceptual Entity
Model is created and occupies the base part of the pyramid. The three models are
interrelated but each of them has its own unique purpose and identity.

A fundamental objective of an Enterprise Subject Area Model is segregating the


entire
organization into several subjects in a manner similar to divide and conquer. Among
the
aspects of this level are Subject Areas, Subject Area Groupings, Subject Area Data
Taxonomy and Subject Area Model Creation.

The Enterprise Conceptual Model, the second level in the pyramid, identifies and
defines the
major business concepts of each of the subject areas. This model is high level data
model
having an average of several concepts for every subject area. These concepts have
finer
details compared to the subject area details. This model also defines the
relationships of
each concept.

The Enterprise Conceptual Entity Model represents all things which are important to
each
business area from the perspective of the entire enterprise. This is the detailed
level of an
enterprise data model which each of the concept being expanded within each subject
area.
It is also in this level that the business and its data rules are examined, rather
than existing
systems so as to create the major data entities, the corresponding business keys,
relationships and attributes.

Star Schema

The star schema, which is sometimes called a star join schema, is one of the most
simple
styles of a data warehouse schema. It consists of a few fact tables that reference
any
number of dimension tables. The facts tables hold the main data with the typically
smaller
dimension tables describing each individual value of a dimension.

Star Schema is characterized by

. simplicity

. allows easy navigation


. has rapid response time

Normalization is not a goal of star schema design. Star schemas are usually divided
into fact
tables and dimensional tables, where the dimensional tables supply supporting
information.

A fact table contains a compound primary key that consists of aggregate of relevant

dimension keys while a dimension table has a simple primary key.

A dimension table is also commonly in the second normal form as it consolidates


redundant
data while a fact table is commonly in the third normal form as all data depend on
either
one dimension or all of them and not just on combinations of a few dimensions.

A star schema is a very important aspect in a data warehouse implementation in that


it is
the best way to implement a multi-dimensional database by using any common
mainstream
relational database. It is also a very simple method and from the perspective of
the users,
the queries are simple because the only joins and conditions involve a fact table
and a
single level of dimension tables, without the indirect dependencies to other tables
that are
possible in a better normalized snowflake schema.

Making a star schema for a database may relatively easy but it still very important
to make
some investments in time and research because the schema's effect on the usability
and
performance of the database is very important in the long run.

In a data warehouse implementation, the creation of the star schema database is one
of the
most important and often the final process in implementing the data warehouse. A
star
schema has also a significant importance in some business intelligence processes
such as
on-line transaction processing (OLTP) system and the on-line analytical processing
(OLAP).

On-line Transaction Processing is a standard, normalized database structure and as


the
name implies, it is used for transactions, which means that involves database table
inserts,
updates, and deletes must be fast. For instance, let us take the scenario in the
organization's call center.

Several call center agents continuously take calls and enter order typically
involving
numerous items which must be stored immediately in the database. This makes the
scenario very critical and that the speed of inserts, updates and deletes should be

maximized. In order to optimize the performance, the database should hold as few
records
as possible at any given time.

On the other hand, On-line Analytical Processing, though this may mean many
different
things to different people, are many for analyzing corporate data. But in some
cases, the
terms OLAP and star schema are used interchangeably. But a more precise way of
thinking
would be to think of a star schema database is an OLAP system which can be any
system of
read-only, historical, aggregated data

The same OLAP and OLTP can be optimized with a star schema in a data warehouse
implementation. Since a data warehouse is the main repository of a company's
historical
data, it naturally contains very high volumes of data which can be used for
analysis with
OLAP. Querying these data may take a long time but with the help of star schema in
implementation, the access time may be made faster and more efficient.

Reverse Data Modeling

Reverse data modeling is basically a form of reverse IT code engineering and it is


a process
wherein an IT expert tries to extract information from an existing system in order
to work
backward and derive a physical model and work further back to a logical model in
the case
of data modeling.

Because of the some of the nuances associated with different database systems and
development platforms, reverse data modeling is generally a difficult task to
accomplish.
But today, there are software vendors trying to offer solutions to make reverse
data
modeling relatively easier to do. These reverse data modeling software solutions
can take
snapshots of existing databases and in producing physical models from which an IT
staff can
begin to verify table and column names.

While it can be generally relatively easy to document table and column names
through using
such tools, or by reviewing database creation scripts, thing that is really very
complicated
and difficult to do is dealing with the relationships between tables in the
database.

Hard skills like coding and software engineering are very important in reverse data

modeling, some soft skills like documentation, employee know-how, training, can
also be
invaluable in figuring out exactly how data is being used by the system in
question.

One of the biggest factors for a successful task of reverse data modeling is
learning as much
as possible about using the application can you determine how the existing data
structures
are being used and how this compares with the original design.
Reverse data modeling is a technology that represents a big opportunity for some
legacy
reporting systems to have its useful life gain extension but it should not be
construed as a
permanent replacement for the data warehousing technology stack.

Reverse data modeling has also a very big role in helping improve the efficiency of

enterprise applications by determining faults in the design, aiding integration of


existing
systems by pinpointing table and column definitions and relationships between such
tables,
and allowing comparisons of existing systems with potential new system solutions.
A data warehouse implementation is based on an enterprise data model a system for
analyzing data in at least data source of an enterprise. The system may be
comprised of
various steps like a step for providing a meta model for an enterprise as well as
step in
forming the data schema from the meta model. Also included in the system may be the
definition for creating a database organized to the data schema, incorporating data
into the
database is part of the system, and performing analysis on the data in the
database.

Some commercial software offers solutions that can enable true enterprise
application
engineering by storing and documenting data, processes, business requirements, and
objects that can be shared by application developers throughout an organization.

With reverse data modeling and the help of such software, the business organization

implementing an enterprise data management system can easily design and control
enterprise software for quality, consistency, and reusability in business
applications through
the managed sharing of meta-data.

Like all other reverse engineering technologies regardless of whether they are for
software
or hardware or non-IT implementation, reverse data modeling is very useful in time
when
there is a deep seated need for system troubleshooting with very complicated
problems.
Reverse data modeling can give IT staff with a deeper insight and understanding of
the data
models thereby empowering them to come up with actions to improve the management of

the system.

Physical Data Model

There are three basic styles of data models: conceptual data model, logical data
model and
physical data model. The conceptual data model is sometimes called the domain model
and
it is typically used for exploring domain concepts in an enterprise with
stakeholders of the
project.

The logical model is used for exploring the domain concepts as well as their
relationships.
This model depicts the logical entity types, typically referred to simply as entity
types, the
data attributes describing those entities, and the relationships between the
entities.
The physical data model is used in the design of the database's internal schema and
as
such, it depicts the data columns of those tables, and the relationships between
the tables.
This model represents the data design taking into account the facilities and
constraints of
any given database management system. The physical data model is often derived from
the
logical data model although some can reverse engineer this from any database
implementation.
A detailed physical data model contains all artifacts a database requires in
creating
relationships between tables or achieving performance goals, such as indexes,
constraint
definitions, linking tables, partitioned tables or clusters. This model is also
often used
calculating estimates for data storage and sometimes is sometimes includes details
on
storage allocation for a database implementation.

The physical data model is basically the output of physical data modeling which is
conceptually similar to design class modeling whose main goal is to design the
internal
schema of a database, depicting the data tables, the data columns of those tables,
and the
relationships between the tables.
In a physical data model, the tables are first identified where data will be stored
in the
database.

For instance, in a university database, the database may contain the Student table
to store
data about students. Then there may also be the Course table, Professors table, and
other
related table to contain related information. The tables will then be normalized.

Data normalization is the process wherein the data attributes in a data model are
being
organized to reduce data redundancy, increase data integrity and increase the
cohesion of
tables and to reduce the coupling between tables.
After the tables are being normalized, the columns will be identified. A column is
the
database equivalent of an attribute. Each table will have one or more columns. In
our
example, the university database may have columns in the Student table such as
FirstName, LastName and StudentNumber columns.

The stored procedures are then being identified. Conceptually, a stored procedure
is like a
global method for implementing a database. An example of a stored procedure would
be a
code to compute student average mark, student payables or number of students
enrolled
and allowable in a certain course. Relationships are also identified in a physical
data model.

A relationship defines how some attributes in one table relate to another


attributes in
another table. Relationships are very important in ensuring that there is data
integrity in a
database after an update, insert or deletions is being performed.

Keys are also assigned in the tables. A key is one or more data attributes which
identifies a
table row to make it unique and thus eliminate data redundancy and increase data
integrity.

Other specifications indicated in a physical data model include the application of


naming
conventions and application of data model patterns.

Database Concepts

In the 1960s, the System Development Corporation, one of the world�s first computer

software companies and a significant military technology contractor, first used the
term
�data base� to describe a system to manage United States Air Force personnel. The
term
�databank� had also been used in the 1960s to describe similar systems, but the
public
seemed less accepting of that term and eventually adopted the word �database�,
which is
universally used today. A number of corporations, notably with IBM and Rockwell at
the
forefront, developed database software throughout the 1960s and early 1970s. MUMPS
(also known as M), developed by a team at Massachusetts General Hospital in the
late
1960s, was the first programming language developed specifically to make use of
database
technology. In 1970, the relational database model was born. Although this model
was
more theoretical than practical at the time, it took hold in the database community
as soon
as the necessary processing power was available to implement such systems.

Databases Are Fun!


Have you ever wondered the name of a movie you watched years ago, and although it
was
on the tip of your tongue, the few words of that short title just wouldn�t come
out? What
about a book you read when you were younger that had a particularly compelling
story to
tell, and despite every effort on your part to relate that story to a friend, the
details simply
weren�t in your head anymore?

Perhaps you wish you had kept a simple, personal record of these movies and books,
so you
could have a quick look at it and easily identify the title that has been bothering
you all this
time? A database would be perfect for that!

The purest idea of a database is nothing more than a record of things that have
happened.
Granted, most professionals use databases for much more than storing their favorite
movies
and books, but at the most basic level, those professionals are simply recording
events, too.

Every time you join a web site, your new account information ends up in a database.
Have
you ever rented movies from Netflix? Their entire web site is essentially one big
database, a
record of every movie available to rent along with the locations of their copies of
that movie
among their many warehouses and customers� homes. The list goes on! Databases are
everywhere, and they assist us in performing many essential tasks throughout our
daily
lives.

You can easily see how databases affect all aspects of our modern lives, since
everything
you do, from calls you make on your mobile phone to transactions you make at the
bank to
the times you drive through a toll plaza and used your toll tag to pay, is recorded
in a
database somewhere.
If these databases did not exist, our lives would surely be much less convenient.
In the 21st
century, we are so accustomed to using credit cards and printing airline boarding
passes at
home that, if databases were to suddenly disappear, it would almost seem like we
were
cavemen again.

Fortunately, we have databases, and there are many people who are skilled at using
them
and developing software to use alongside them. These people take great pride in
their work,
as database programming is difficult but nonetheless very rewarding.

Consider, for example, the database team at Amazon.com. They built an enormous
database to contain information about books, book reviews, products that are not
books at
all, customers, customers� preferences, and tons of other things.
http://learn.geekinterview.com/images/ot1.jpg
It must have taken them months to get the database just right and ready for
customers to
use! But, once it started to work well and Amazon.com went live back in 1995, can
you
imagine the sense of pride those developers had as millions of potential customers
poured
onto the web site and began to interact with the product they spent so much time
perfecting? That must have been an incredible feeling for the database team!

By no means do these brief words do justice to the power, complexity, or utility of


modern
database platforms. However, they are certain to provide at least a little insight
into how
significantly databases have changed our lives and how effective they are at
providing high
quality solutions for difficult problems.

What is a Database?

When you are in a big electronics store buying the latest edition of the iPod, how
does that
store�s inventory tracking system know you just bought an iPod and not, for
example, a car
stereo or a television

Let�s walk through the process of buying an iPod and consider all the implications
this has
on the inventory database that sits far underneath all the shiny, new gadgets on
the sales
floor.

When you hand the iPod box to the cashier, a barcode scanner reads the label on the
box,
which has a product identification number. In barcode language, this number might
be
something like 885909054336. The barcode representing this number can be seen in
Figure
1.
Figure 1. A sample barcode

The barcode acts as a unique identifier for the product; in this case, all iPods
that are the
same model as the one passing across the barcode reader have the same exact
barcode.

The barcode scanner relays the number represented by the barcode to the register at
the
cashier�s station, which sends a request (or a query) to the store�s inventory
database. This
database could be in the same store as the register or somewhere across the country
or
even around the world, thanks to the speed and reliability of the Internet.
The register asks the database, �What are the name and price of the product that
has this
barcode?� To which the database responds, �That product is an iPod, and it costs
$200.�

You, the customer, pay your $200 and head home with a new toy. Your work in the
store is
finished, but the inventory management system still needs to reconcile your
purchase with
the database!

When the sale is complete, the register needs to tell the database that the iPod
was sold.
The ensuing conversation goes something like the following.

Register: �How many products with this barcode are in our inventory?�
Database: �1,472.�

Register: �Now, 1,471 products with this barcode are in our inventory.�

Database: �OK.�

What Did the Database Do?

Data Retrieval

Of course, this is not the whole story. Much more happens behind the scenes than
simple
conversational requests and acknowledgements.
The first interaction the register had with the database occurred when the request
for the
product name and price was processed. Let�s take a look at how that request was
really
handled.

If the database is an SQL database, like MySQL or PostgreSQL or many others, then
the
request would be transmitted in the standard Structured Query Language (SQL). The
software running on the register would send a query to the database that looks
similar to
the following.

SELECT name, price FROM products WHERE id = 885909054336;


http://learn.geekinterview.com/images/ot2.jpg
http://learn.geekinterview.com/images/ot3.jpg
This query instructs the database to look in the products table for a row (also
called a
record) in which the id column exactly equals 885909054336.

Every database may contain multiple tables, and every table may contain multiple
rows, so
specifying the name of the table and the row�s unique identifier is very important
to this
query. To illustrate this, an example of a small products table is shown in Figure
2.

When the database has successfully found the table and the row with the specified
id, it
looks for the values in the name and price columns in that row. In our example,
those
values would be �iPod� and �200.00�, as seen in Figure 2. The execution of the
previous
SELECT statement, which extracts those values from the table, is shown in Figure 3.

The database then sends a message back to the register containing the product�s
name and
price, which the register interprets and displays on the screen for the cashier to
see.

Data Modification

The second time the register interacts with the database, when the inventory number
is
updated, requires a little more work than simply asking the database for a couple
numbers.
Now, in addition to requesting the inventory number with a SELECT statement, an
UPDATE
statement is used to change the value of the number.

First, the register asks the database how many iPods are in the inventory (or �on
hand�).
http://learn.geekinterview.com/images/ot4.jpg
http://learn.geekinterview.com/images/ot5a.jpg

SELECT onhand FROM products WHERE id = 885909054336;

The database returns the number of products on hand, the register decrements that
number
by one to represent the iPod that was just sold, and then the register updates the
database
with the new inventory number.

UPDATE products SET onhand = 1471 WHERE id = 885909054336;

This sequence is presented in Figure 4.

In Figure 4, the database responds to the UPDATE query with UPDATE 1, which simply
means one record was updated successfully.

Now that the number of iPods on hand has been changed, how does one verify the new
number? With another SELECT query, of course! This is shown in Figure 5.

Now, the register has updated the database to reflect the iPod you just purchased
and
verified the new number of iPods on hand. That was pretty simple, wasn�t it?

More on Databases
You now know databases are made of tables, which are, in turn, made of records.
Each
record has values for specific columns, and in many cases, a record can be uniquely

identified by the value contained in at least one column.

In our example, the barcode number uniquely identified the iPod, which cost $200,
in the
products table. You have also seen that values in a database can be modified. In
this case,
the number of iPods on hand was changed from 1,472 to 1,471.

Database Systems

Early Databases

In the 1960s, the System Development Corporation, one of the world�s first computer

software companies and a significant military technology contractor, first used the
term
�data base� to describe a system to manage United States Air Force personnel. The
term
�databank� had also been used in the 1960s to describe similar systems, but the
public
seemed less accepting of that term and eventually adopted the word �database�,
which is
universally used today.

A number of corporations, notably with IBM and Rockwell at the forefront, developed

database software throughout the 1960s and early 1970s. MUMPS (also known as M),
developed by a team at Massachusetts General Hospital in the late 1960s, was the
first
programming language developed specifically to make use of database technology.

In 1970, the relational database model was born. Although this model was more
theoretical
than practical at the time, it took hold in the database community as soon as the
necessary
processing power was available to implement such systems.

The advent of the relational model paved the way for Ingres and System R, which
were
developed at the University of California at Berkeley and IBM, respectively, in
1976. These
two database systems and the fundamental ideas upon which they were built evolved
into
the databases we use today. Oracle and DB2, two other very popular database
platforms,
followed in the footsteps of Ingres and System R in the early 1980s.

Modern Databases

The Ingres system developed at Berkeley spawned some of the professional database
systems we see today, such as Sybase, Microsoft SQL Server, and PostgreSQL.
Now, PostgreSQL is arguably the most advanced and fastest free database system
available,
and it is widely used for generic and specific database applications alike. MySQL
is another
free database system used in roughly the same scope of applications as PostgreSQL.
While
MySQL is owned and developed by a single company, MySQL AB in Sweden, PostgreSQL
has
no central development scheme, and its development relies on the contributions of
software
developers around the world.

IBM�s System R database was the first to use the Structured Query Language (SQL),
which
is also widely used today. System R, itself, however, was all but abandoned by IBM
in favor
of focusing on more powerful database systems like DB2 and, eventually, Informix.
These
products are now generally used in large-scale database applications. For example,
the Wal-
Mart chain of large department stores has been a customer of both DB2 and Informix
for
many years.

Modern Database

The other major player in the database game, Oracle, has been available under a
proprietary license since it was released as Oracle V2 in 1979. It has undergone a
number
of major revisions since then and, in 2007, was released as Oracle 11g. Like DB2
and
Informix, Oracle is mostly used for very large databases, such as those of global
chain
stores, technology companies, governments, and so forth. Because of the similar
client
bases enjoyed by IBM and Oracle, the companies tend to be mutually cooperative in
database and middleware application development.

Microsoft SQL Server, initially based on Sybase, is another full-featured and


expensive
database system designed to attract large customers. Its primary competitors are
IBM and
Oracle, but Microsoft has, to a great extent, been unable to secure a significant
percentage
of the high-end database market as its client base. As a result, SQL Server caters
mainly to
the lower end of the pool of larger database customers.
Some speculate Microsoft�s inability to capture the higher end of the market is a
result of
SQL Server�s dependence on the Microsoft Windows operating system. In many cases,
Windows is seen as less reliable and less stable than UNIX-based operating systems
like
Solaris, FreeBSD, and Linux; all of which support databases like Oracle, DB2 and
Informix,
and MySQL and PostgreSQL.

In order of market share in terms of net revenue in 2006, the leaders in database
platform
providers are Oracle, with the greatest market share; IBM; and Microsoft.
While the database systems with the greatest markets shares use SQL as their query
language, other languages are used to interact with a handful of other relatively
popular
databases. Most developers will never encounter these languages in their daily
work, but for
purposes of being complete, some of these languages are IBM Business System 12,
EJB-QL,
Quel, Object Query Language, LINQ, SQLf, FSQL, and Datalog. Of particular note is
IBM
Business System 12, which preceded SQL but was, for some time, used with System R
instead of SQL due to SQL being relationally incomplete at the time.

Today, organizations with large database projects tend to choose Oracle, DB2,
Informix,
Sybase, or Microsoft SQL Server for their database platforms because of the
comprehensive
support contracts offered in conjunction with those products. Smaller organizations
or
organizations with technology-heavy staff might choose PostgreSQL or MySQL because
they
are free and offer good, community-based support.

Terminology

The term �database� is widely misused to refer to an entire database system.


Oracle, for
example, is not a database but a full-featured Database Management System (DBMS).
In
fact, a DBMS can be used to manage many databases, and as such, a database is just
one
part of a DBMS. In this series of articles, the terms �database system� and
�database
platform� are used to refer to the idea of a DBMS.

Further, most modern database systems employ the idea of the relational database,
and
they are properly called Relational Database Management Systems (RDBMS). The
distinction
between a DBMS and a RDBMS, unless critical to the understanding of a specific
topic, is not
made in these articles.

Database Interaction
Database Interaction

Efficient interaction, efficient storage, and efficient processing are the three
key properties
of a successful database platform. In this article, we explore the first: efficient
interaction.

Interaction Category 1: Command Line Clients

Many database platforms are shipped with a simple command line utility that allows
the
user to interact with the database. PostgreSQL ships with psql, which gives the
user
extensive control over the operation of the database and over the tables and schema
in the
database. Oracle�s SQLPlus and MySQL�s MySQL are similar utilities. Collectively,
these are
also called SQL shells.

Interaction Category 2: GUI Clients

Another popular way to interact directly with a database is by using a graphical


user
interface (GUI) that connects to the database server. Oracle�s proprietary SQL
Developer
software is one of these, although for every database on the market, there are
probably at
least two or three good, free GUI packages available. Figure 2 shows the �object
browser� in
pgAdmin III, a free administration tool for PostgreSQL databases.

Interaction Category 3: Application Development

The final method for interacting with a database is through an application. This
indirect
interaction might occur, for example, when a bank customer is withdrawing money
from an
ATM. The customer only presses a few buttons and walks away with cash, but the
software
running on the ATM is communicating with the bank�s database to execute the
customer�s
transaction. Applications that need to interact with databases can be written in
nearly all
programming languages, and almost all database platforms support this form of
interaction.

Command Line Clients

A command line client usually provides the most robust functionality for
interacting with a
database. And, because they are usually developed by the same people who developed
the
database platform, command line clients are typically also the most reliable. On
the other
hand, effectively using a command line client to its full extent requires expert
database skill.
The �help� features of command line clients are often not comprehensive, so
figuring out
how to perform a complex operation may require extensive study and reference on the
part
of the user. Some basic usage of the PostgreSQL command line client is shown in
Figure 1.
http://learn.geekinterview.com/images/ot9.jpg

All command line clients operate in a similar manner to that shown in Figure 1. For
users
with extensive knowledge of SQL, these clients are used frequently.

One typically accesses an SQL command line client by logging into the database
server and
running them from the shell prompt of a UNIX-like operating system. Logging into
the
server may be achieved via telnet or, preferably, SSH. In a large company, the
Information
Technology department may have a preferred application for these purposes.

GUI Clients and Application Development

GUI Clients
The simplest way to think about a GUI client is to consider it to be a
sophisticated, flashy
wrapper around a command line client. Really, it falls into the third category of
interaction,
application development, but since the only purpose of this application is to
interface with
the database, we can refer to it separately as a GUI client.

The GUI client gives the user an easy-to-use, point-and-click interface to the
internals of the
database. The user may browse databases, schemas, tables, keys, sequences, and,
essentially, everything else the user could possibly want to know about a database.
In most
cases, the GUI client also has a direct interface to a simulated command line, so
the user
can enter raw SQL code, in addition to browsing through the database. Figure 2
shows the
object browser in pgAdmin III, a free, cross-platform GUI client for PostgreSQL .
http://learn.geekinterview.com/images/ot10.jpg
http://learn.geekinterview.com/images/ot11.jpg
Figure 2. The object browser in pgAdmin III

With an easy tree format to identify every element of the database and access to
even more
information with a few simple clicks, the GUI client is an excellent choice for
database
interaction for many users.

Figure 3 shows the Server Information page of MySQL Administrator, the standard GUI
tool
for MySQL databases.

Figure 3. The MySQL Administrator Server Information page


http://learn.geekinterview.com/images/ot12.jpg
http://learn.geekinterview.com/images/ot13.jpg
Application Development

Application development is the most difficult and time-consuming of the three


methods of
interacting with a database. This approach is only considered when a computer
program
needs to access a database in order to query or update data that is relevant to the
program.

For example, the software running on an ATM at a bank needs to access the bank�s
central
database to retrieve information about a customer�s account and then update that
information while the transaction is being performed.

Applications that require databases can be written in virtually any programming


language.
For stand-alone applications, the most popular language for database programming is
C++,
with a growing following in the C# and Java communities. For web applications, Perl
and
PHP are the most popular languages, followed by ASP (and ASP.NET) and Python.
Interest
in using Ruby with the web and databases is growing, as well.

Many database access extensions for modern programming languages exist, and they
all
have their advantages and caveats. The expert database programmer will learn these
caveats, however, and eventually become comfortable and quite skilled at
manipulating
database objects within application code.

Figure 4 and Figure 5 show the code for a simple database application written in
Perl and its
output, respectively.
With all the features of modern programming languages, extremely complex database
applications can be written. This example merely glosses over the connection,
query, and
disconnection parts of a database application.

Getting Ahead with Databases

Database Overview

You have been using databases for a few years, and you think you are at the top of
your
game. Or, perhaps, you have been interested in databases for a while, and you think
you
did like to pursue a career using them, but you do not know where to start. What is
the next
step in terms of finding more rewarding education and employment?

There are two routes people normally take in order to make them more marketable
and, at
the same time, advance their database skills. The first, earning an IT or computer
science
degree, requires more effort and time than the second, which is completing a
certification
program.

If you do not have a science, engineering, or IT degree yet and you want to keep
working
with databases and IT for at least a few years, the degree would probably be worth
the
time. For that matter, if you already have an undergraduate degree, then perhaps a
master�s degree would be the right choice for you? Master�s degrees typically only
require
three semesters of study, and they can really brighten up a resume. An MBA is a
good
option, too, if the idea of managing people instead of doing technical work suits
your fancy.
Your employees would probably let you touch the databases once in a while, too!

Many universities offer evening classes for students who work during the day, and
the
content of those classes is often focused on professional topics, rather than
abstract or
theoretical ideas that one would not regularly use while working in the IT field.
Online
universities like the University of Phoenix also offer IT degrees, and many busy
professionals have been earning their degrees that way for years now.

Certifications, while quite popular and useful in the late 1990s and early 2000s,
seem to be
waning in their ability to make one marketable. That said, getting a certification
is much
quicker than earning a degree and requires a minimal amount of study if the
certificate
candidate already works in the relevant field.

The certification will also highlight a job applicant�s desire to �get ahead� and
�stay ahead�
in the field. It may not bump the applicant up the corporate food chain like an MBA
might,
but it could easily increase the dollar amount on the paychecks by five to ten
percent or
more.
If you feel like you could be making more money based on your knowledge of
databases,
exploring the degree and certification avenues of continuing education may be one
of the
best things you do for your career.

Relational Databases

What is a Relational Database?

Popular, modern databases are built on top of an idea called �relational algebra�,
which
defines how �relations� (e.g. tables and sequences in databases) interact within
the entire
�set� of relations. This set of relations includes all the relations in a single
database.

Knowing how to use relational algebra is not particularly important when using
databases;
however, one must understand the implications certain parts of relational algebra
have on
database design.

Relational algebra is part of the study of logic and may be simply defined as �a
set of
relations closed under operators�. This means that if an operation is performed on
one or
more members of a set, another member of that same set is produced as a result.
Mathematicians and logicians refer to this concept as �closure�.

Integers

Consider the set of integers, for example. The numbers 2 and 6 are integers. If you
add 2 to
6, the result is 8, which is also an integer. Because this works for all integers,
it can be said
that the set of integers is closed under addition. Indeed, the set of integers is
closed under
addition, subtraction, and multiplication. It is not closed under division,
however. This can
be easily seen by the division of 1 by 2, which yields one half, a rational number
that is not
an integer.
Database Relations

Using the integer example as a starting point, we can abstract the idea of closure
to
relations. In a relational database, a set of relations exists. For the purposes of
initially
understanding relational databases, it is probably best to simply think of a
relation as being
a table, even though anything in a database that stores data is, in fact, a
relation.

Performing an operation on one or more of these relations must always yield another

relation. If one uses the JOIN operator on two tables, for example, a third table
is always
produced. This resulting table is another relation in the database, so we can see
relations
are closed under the JOIN operator.
Relations are closed under all SQL operators, and this is precisely why databases
of this
nature can be called relational databases

Database Concurrency and Reliability

Database Concurrency and Reliability Overview

Concurrency and reliability have long been �hot topics� of discussion among
developers and
users of distributed systems. The fundamental problem can be seen in a simple
example, as
follows.

Suppose two users are working on the same part of a database at the same time. They
both
UPDATE the same row in the same table, but they provide different values in the
UPDATE.
The UPDATE commands are sent to the database precisely at the same time. What does
the
database system do about this, and what are the rules governing its decision?

ACID

When discussing concurrency and reliability, developers often talk about the
components of
ACID: atomicity, consistency, isolation, and durability. Together, these properties
guarantee
that a database transaction is processed in a reliable, predictable manner. A
transaction, in
this case, can be defined as any set of operations that changes the state of the
database. It
could be something as simple as reading a value, deciding how to manipulate that
value
based on what was read, and then updating the value.

Atomicity

The atomicity property guarantees that a transaction is either completed in full or


not
completed at all. Thus, the result of an operation is always success or failure,
and no
transaction can result in a partial completion. Essentially, by making a
transaction �atomic�,
all the operations involved in the transaction are virtually combined into one
single
operation.

Two important rules provide transaction atomicity. First, as operations in a


transaction
occur, those operations must remain unknown to all other processes accessing the
database
at the same time. Other processes may see only the final product after the
transaction is
complete, or they will see no changes at all.

The second rule is somewhat of an extension of the first rule. It says that, if any
operations
involved in a transaction fail, the entire transaction fails, and the database is
restored to the
http://learn.geekinterview.com/images/ot38.jpg
state before the transaction began. This prevents a transaction from being
partially
completed.

Database Consistency

Consistency is probably the most fundamental of the four ACID components. As such,
it is
arguably the most important in many cases. In its most basic form, consistency
tells us that
no part of a transaction is allowed to break the rules of a database.

For example, if a column is constrained to be NOT NULL and an application attempts


to add
a row with a NULL value in that column, the entire transaction must fail, and no
part of the
row may be added to the database.

In this example, if consistency were not upheld, the NULL value would initially
still not be
added as part of the row, but the remaining parts of the row would be added.
However,
since no value would be specified for the NOT NULL column, it would revert to NULL,

anyway, and violate the rules of the database. The subtleties of consistency go far
beyond
an obvious conflict between NOT NULL columns and NULL value, but this example is a
clear
illustration of a simple violation of consistency. In Figure 1, we can see that no
part of a row
is added when we try to violate the NOT NULL constraint.

Isolation

The isolation property ensures that, if a transaction is being executed, no


processes other
than the one executing the transaction see the transaction in a partially completed
state. A
simple example of this is as follows. Suppose one customer of a bank transfers
money to
another customer. This money should appear in one customer�s account and then in
the
other customer�s account but never in both accounts simultaneously. The money must
always be somewhere, and it must never be in two places at the same time.
Formally, isolation requires that the database�s transaction history is
serializable. This
means that a log of transactions can be replayed and have the same effect on the
database
as they did originally.

Durability

A database system that maintains durability ensures that a transaction, once


completed, will
persist. This may sound like a vague definition, but it is really quite simple. If
an application
executes a database transaction, and the database notifies the application that the

transaction is complete, then no future, unintended event will be able to reverse


that
transaction. A popular method of ensuring durability is to write all transactions
to a log,
which can be replayed from an appropriate time in the case of system failure. No
transaction is considered to be complete until it is properly written to the log.

Distributed Databases

Distributed Databases Overview

Suppose you created a database for a web application a few years ago. It started
with a
handful of users but steadily grew, and now its growth is far outpacing the
server�s
relatively limited resources. You could upgrade the server, but that would only
stem the
effects of the growth for a year or two. Also, now that you have thousands of
users, you are
worried not only about scalability but also about reliability.

If that one database server fails, a few sleepless nights and many hours of
downtime might
be required to get a brand new server configured to host the database again. No
one-time
solution is going to scale infinitely or be perfectly reliable, but there are a
number of ways
to distribute a database across multiple servers that will increase the scalability
and
reliability of the entire system.
Put simply, we want to have multiple servers hosting the same database. This will
prevent a
single failure from taking the entire database down, and it will spread the
database across a
large resource pool.

By definition, a distributed database is one that is run by a central management


system but
which has storage nodes distributed among many processors. These �slave� nodes
could be
in the same physical location as the �master�, or they could be connected via a
LAN, WAN,
or the Internet. Many times, database nodes in configurations like this have
significant
failover properties, like RAID storage and/or off-site backup, to improve chances
of
successful recovery after a database failure.
Distributed databases can exist in many configurations, each of which may be used
alone or
combined to achieve different goals.

Distributed Database Architecture

A distributed database is divided into sections called nodes. Each node typically
runs on a
different computer, or at least a different processor, but this is not true in all
cases.

Horizontal Fragments

One of the usual reasons for distributing a database across multiple nodes is to
more
optimally manage the size of the database.

For example, if a database contains information about customers in the United


States, it
might be distributed across three servers, one each for the eastern United States,
the mid-
western United States, and the western United States.

Each server might be responsible for customers with certain ZIP codes. Since ZIP
codes are
generally arranged from lowest to highest as they progress westward across the
country,
the actual limits on the ZIP codes might be 00000 through 33332, 33333 through
66665,
and 66666 through 99999, respectively.

In this case, each node would be responsible for approximately one third of the
data for
which a single, non-distributed node would be. If each of these three nodes
approached its
own storage limit, another node or two nodes might be added to the database, and
the ZIP
codes for which they are responsible could be altered appropriately. More
�intelligent�
configurations could be imagined, as well, wherein, for example, population density
is
considered, and larger metropolitan areas like New York City would be grouped with
fewer
other cities and towns.

In a distribution like this, either the database application or the database


management
system, itself, could be responsible for deciding which database node would process

requests for certain ZIP codes. Regardless of which approach is taken, the
distribution of
the database must remain transparent to the user of the application. That is, the
user
should not realize that separate databases might handle different transactions.
Reducing node storage size in this manner is an example of using horizontal
fragments to
distribute a database. This means that each node contains a subset of the larger
database�s
rows.

Distributed Database Architecture Vertical Fragments

Vertical Fragments

The vertical fragment approach to database distribution is similar in concept to


the
horizontal fragment approach, but it does not lend itself as easily to scalability.
Vertical
fragments occur when columns, instead of rows, are distributed across multiple
nodes.

A situation that calls for vertical fragments might arise if a table contains
information that is
pertinent, separately, to multiple applications. Using the previous example of a
database
that stores customer information, we might imagine an airline�s frequent flyer
program.

These programs typically track, among other things, customers� personal


information, like
addresses and phone numbers, along with a list of all the trips they have flown and
the
miles they have accrued along the way.

These sets of data have different applications: the customer information is used
when
mailing tickets and other correspondence, and the mileage information is used when
deciding how many complimentary flights a customer may purchase or whether the
customer has flown enough miles to obtain �elite� status in the program. Since the
two sets
of data are generally not accessed at the same time, they can easily be separated
and
stored on different nodes.

Since airlines typically have a large number of customers, this distribution could
be made
even more efficient by incorporating both horizontal fragmentation and vertical
fragmentation. This combined fragmentation is often called the mixed fragment
approach.

Other Fragmentation Types

A database can be broken up into many smaller pieces, and a large number of methods
for
doing this have been developed. A simple web search for something like �distributed

databases� would probably prove fruitful for further exploration into other, more
complex,
methods of implementing a distributed database. However, there are two more terms
with
which the reader should be familiar with respect to database fragmentation.
The first is homogeneous distribution, which simply means that each node in a
distributed database is running the same software with the same extensions and so
forth. In
this case, the only logical differences among the nodes are the sets of data stored
at each
one. This is normally the condition under which distributed databases run.

However, one could imagine a case in which multiple database systems might be
appropriate for managing different subsets of a database. This is called
heterogeneous
distribution and allows the incorporation of different database software programs
into one
big database. Systems like this are useful when multiple databases provide
different feature
sets, each of which could be used to improve the performance, reliability, and/or
scalability
of the database system.

Replication

In addition to the distribution situations above, full-database replication is also


available for
many database platforms. This is really what we mean when we say a database is
hosted by
multiple servers, but in general, the idea of distributing pieces of a database
should be
considered before putting much thought into wholesale replication of a database.
This is for
one simple reason: replication is expensive. It�s expensive in terms of finance,
time, and
data, but for many applications, it truly is the best solution.

Here, we briefly discuss �master-master� replication, which is perhaps the most


complicated
of all the replication solutions. This is also the most comprehensive replication
solution,
since each master always has a current copy of the database. Because of this, the
entire
database will still be available if one node fails.

The Three Expenses in Distributed Databases

Essentially, replication entails creating exact copies of databases on many


computers and
updating every database simultaneously whenever an update is performed on one
database.
The pitfalls of this process are explained by the three expenses, below.
Replication has finance expense because every server, every hard drive, every
battery-
backed RAID card, every network switch, every fast network connection, every
battery-
backed power supply, and every other piece of associated hardware must be
purchased. In
addition to that are the costs of bandwidth, maintenance, backup servers, co-
location,
remote management, and many other things. For a decent-sized database, this could
very
easily run into the tens of thousands of dollars before even getting to the �every
hard drive�
part of the list.

Replication has time expense because each operation performed on one node�s
database
must be performed on each other node�s database simultaneously. Before the
operation can
be said to be committed, each other node must verify that the operation in its own
database
succeeded. This can take a lot of time and produce considerable lag in the
interface to the
database.

And, replication has data expense because every time the database is replicated,
another
hard drive or two or more fills up with data pertaining to the database. Then,
every time
one node gets a request to update that data, it must transmit just as many requests
as
there are other nodes. And, confirmations of those updates must be sent back to the
node
that requested the update. That means a lot of data is flying around among the
database
nodes, which, in turn, means ample bandwidth must be available to handle it.

How to Initiate Replication

Many of the more popular databases support some sort of native replication. MySQL,
for
example, provides the GRANT REPLICATION command, which initiates replication
automatically. PostgreSQL, on the other hand, requires external software for
replication.
This usually happens in the form of Slony-1, a comprehensive replication suite.
Each
database platform has a different method for initiating replication services, so it
is best to
consult that platform�s manual before implementing a replication solution.

Considerations

When implementing a distributed database, one must take care to properly weigh the
advantages and disadvantages of the distribution. Distributing a database is a
complicated
and sometimes expensive task, and it may not be the best solution for every
project. On the
other hand, with some spare equipment and a passionate database developer,
distributing a
database could be a relatively simple and straightforward task.

The most important thing to consider is how extensively your database system
supports
distribution. PostgreSQL, MySQL, and Oracle, for example, have a number of native
and
external methods for distributing their databases, but not all database systems are
so
robust or so focused on providing a distributed service. Research must be performed
to
determine whether the database system supports the sort of distribution required.

The field of distributed database management is relatively young, so the


appropriate
distribution model for a particular task may not be readily available. In a
situation like this,
designing one�s own distributed database system may be the best development option.

Regardless of the approach taken, distributing a database can be a very rewarding


process
when considering the improvement of the scalability and reliability of a system.
Business Intelligence

What is Business Data

Business data refers to the information about people, places, things, business
rules, and
events in relation to operating a business.

Serious businesses need to consider setting up business intelligence and data


warehouses. In pursuing these capabilities, they need to adopt a holistic view
coupled with
wise investment and careful execution. For a business to really grow, it should
consider
interrelated areas involving people, strategy, process, applications, metrics, data
and
architecture.

It is very important to gather business data and base an organization's decision on


the
statistical report to get precise decisions on how to more the company forward for
sustainability.

Knowing things about people and their buying behaviors can make a company generate
very important business data. For instance, statisticians and market researches
know that
certain age groups have unique buying habits. Races and people from different
demographics locations also have buying patterns of their own so gathering these
information in one business database can be a good way to future target marketing.

In terms of production, business data about where to get raw materials, how much
the cost
is, what are the customs and importation policies of the raw materials' country of
origin and
other information are also very important.

There are many software applications that manage business data for easy statistical

reporting and spotting of trends and patterns.


The Business Data Catalog functionality in some applications allows users to
present line-of-
business data. It can search and retrieve information from back end systems such as

Enterprise Resource Planning (ERP), Customer Resource Management (CRM), Advance


Planner and Optimizer (APO) and Supply Chain Management (SCM).

In many companies, they maintain a business data warehouse where data from several
are
collected and integrate every few minutes. These repositories of business data may
supply
needed information to generate reports and recommendations in an intelligent
manner. Hence the term Business Intelligence is already widely used in the business

industry today.

Business intelligence generally refers to technologies and software applications


that are
used to gather, integrate and analyze business data and other information
pertaining to the
operation of the company. It can help companies gain more comprehensive and in
depth
knowledge of the many factors that can affect their business. These knowledge may
include
metrics on sales, internal operations and production. With recommendations from
business
intelligence, companies can make better decisions for the business.

For processing billions of business data in the data warehouse for business
intelligence,
companies use high powered and secure computer systems that are installed with
different
levels of security access.

Several software applications and tools have been developed to gather and analyze
large
amounts of unstructured business data ranging from sales statistics, production
metrics to
employee attendance and customer relations. Business intelligence software
applications
very depending on the vendor but the common attribute in most of them is that they
can be
customized based on the needs and requirements of the business company. Many
companies have in-house developers to take care of business data as the company
continues to evolve.

Some example of business intelligence tools to process business data include Score
carding,
Business activity monitoring, Business Performance Management and Performance
Measurement, Enterprise Management systems and Supply Chain Management/Demand
Chain Management. Free Business Intelligence and open source products include
Pentaho,
RapidMiner, SpagoBI and Palo, an OLAP database.

Business data is the core of the science of Analytics. Analytics is the study of
business data
that uses statistical analysis in knowing and understanding patterns and trends in
order to
foresee or predict business performance. Analytics is commonly associated with data

mining and statistical analysis although it is more leaned towards physics-like


modeling
which involves extensive computation.

What is Business-Driven Approach

A Business-driven approach is any process of identifying the data needed to support

business activities, acquiring or capturing those data, and maintaining them in the
data
resource.
Everyday, billions of data and information gets carried across different
communications
media. The number medium is of course the internet. Another data communications
media
are television and mobile phones

Any non-business individual or entity may not find the real significance of these
data. They
are merely there to because it is innate in people to communicate and get connected
with
each other.

But individuals or organizations who think about business take advantage of these
data in a
business-driven approach. They try to collect, aggregate, summarize and
statistically
analyze data so they know what products they may want to see and who among the
people
will be their target market.

Many people who started from scratch but were driven by passion for business have
created
multinational corporations. Traditional vendors started from the sidewalk and took
a
business driven approach to move their wares from the streets to the internets
where
millions of potential buyers converge.

Today's businesses are very closely dependent on the use of technology. More and
more
transactions are going online. With the possibility of high security money
transfers, people
can buy items from continents away using just their credit cards. They can also
maintain
online accounts and make fund transfers in seconds at one click of the mouse.

Software application developers and vendors are coming up with innovative tools to
help
business optimize their performance. Large data warehouses, those repositories of
all sorts
data getting bulkier every single minute, are being set up and companies are
investing in
labor intensive high power capable computers connected to local area networks and
the
internet.
To manage these data warehouses, database administrators use relational database
technology with the aid of tools like Enterprise Resource Planning (ERP), Customer
Resource
Management (CRM), Advance Planner and Optimizer (APO), Supply Chain Management
(SCM), Business Information Warehouse (BIW), Supplier Relationship Management
(SRM),
Human Resource Management System (HRMS) and Product Lifecycle Management (PLM)
among thousands of others.

To keep businesses stable and credible especially when dealing with a global market
using
the internet, it needs to ensure security of critical data and information. Keeping
these data
safe is a responsibility which requires coordinated, continuous and very focused
efforts. Companies should invest in infrastructures that shield critical resources.

To keep the business afloat in all situations, it has to see to it that procurement
and
contract management is well maintained and coordinated. It has to wield substantial

purchasing power to cope up with new and highly competitive contracting methods.
A business needs to constantly evaluate enterprise resource planning implementation
and
should have a distinct department to manage finance and human resources with
principles
that are sound.

Collecting, sharing and reporting of technology information by the business


organizations
should be strategized to effectively manage investment in technology and at the
same time
reduce burden of managing redundant and unrelated information.

All these mentioned requirements in a business-driven approach should follow a


business
framework. This framework will guide the company's staff and talents so that all
they do
will be profitable for the company. The business framework will also make it easy
for
applications developer, especially if they are in-house developers, to tailor cut
their outputs
to make the organizations function at its optimum.

A business-driven approach involves keen appreciation of tiny details, spotting and


tracing
of industry trends and patterns, very careful and analytical planning, constant
research on
the needs and preferences of potential buyers and intensive use and investment of
the
latest in technology.

What is Business Drivers

Business drivers are the people, information, and tasks that support the
fulfillment of a
business objective. They lead the company trying to get it away from pitfalls and
turn
unforeseen mistakes into good lessons for future success and sustainability.

A business needs to be constantly driven and updated to be at par with its


competitors and
to be in sync with the latest trends in business technology which change sometimes
very
unexpectedly.

Technology is fast evolving and businesses that do not evolve with technology will
suffer
tremendously. The world of business is a competitive sink or swim arena. Every
single day,
a business is exposed to risks and challenges but this is just a natural thing.

The foremost key business drivers of course are the staff and talents � the people.
Being
the literal brains behind every business, people make and set the objectives,
execution of
critical decisions and constant innovation to move the business forward. Human
resources
department are scouting for the best people in the business industry everyday.
Different
people have different aptitudes so it is important to employ only those with
business
stamina. While most companies prefer people with advance degrees specifically
master and
doctorates in business administration and the like, there are many people who have
not
even finished college but have innate skills for running a business effectively.

Technological innovation is another key business driver. It cannot be argued that


today's
businesses cannot function without the use of information technology. Hundreds of
millions
of data are coming in everyday and for businesses to take advantage of these data
to
improve their products and services, they need use technology. People alone,
although
they are the thinking entities, are more prone to error when it comes to doing
repetitive and
labor intensive tasks. Compared to machines, people also work a lot slower.

Software applications automated daily tasks in the business. Manual recording is a


tedious
and very repetitive work but a software application can overcome this problem with
less
error and faster processing. And business is not just about recording. Everyday,
hundreds
of computations need to be done. Product sales, procurement, employee salary,
inventory
and other things need to be considered. These activities involve millions of
figures and
complex formulas are needed to achieve precise results. Individual results are
aggregated
with other results to come up with a bigger perspective of the company operations.
Tools
like Enterprise Resource Planning (ERP), Customer Resource Management (CRM),
Advance
Planner and Optimizer (APO), Supply Chain Management (SCM), Business Information
Warehouse (BIW), Supplier Relationship Management (SRM), Human Resource Management
System (HRMS) and Product Lifecycle Management (PLM) are increasingly being
employed
by businesses organization to boost their overall performance.

The fast rising in popularity of open source and open standards has also been a
significant
business driver in recent years. Open standards make possible the sharing of non
proprietary so portability is has no longer become an issue. In the past,
integration of
internal systems of a company was expensive because different computing platforms
could
not effectively communicate with each other. Now, with more bulk of data being
transferred from one data warehouse to another, open standards is making things a
lot
faster and more efficient.
Open source on the other hand allows companies free software or software with
minimal fee
and thus save the company a lot of money to be used in other investments. Open
source
software applications are in no way inferior than commercial applications. With
open
source, anybody can make additions to the source code so this can be a good way to
customize software to the business archicture.

Business Architecture

Business architecture is one of the four layers of an IT architecture. The other


three layers
are information, applications and technology. Business architecture describes the
business
processes utilized within the organization.
If one looks at the dictionary, architecture is define as "a unifying or coherent
form or
structure." As one of the layers of the IT architecture, it is extremely important
to know
and understand the business architecture to come up relevant software systems.
Aside
from the technical aspects, information system architects should be concerned with
the
content and usage of the systems they are building for the business organization.

It cannot be argued that today's business is very tightly intertwined with


information system
technology. Even for the small home business, software applications are very much
in
use. It is impossible for multinational corporations to operate without business
software.

A good analogy for business architecture would be architecture of real buildings.


Building
architects need to understand the purpose of the building before doing the design.
If they
are designing homes, they need to also understand certain patterns and trends like
behavior of families. If they design skyscrapers, they need understand weather
conditions
among other things.

In the same manner, a business architects need to understand basic business


concepts. They need to know the requirements of the business. For example, if the
business is about manufacturing of furniture, they need to know where to get the
raw
materials and how much they cost. They also need to know who the target clients are
and
how to deal with competition.

Business architects should also determine the scope of the business. How can the
company
grow and branch out to areas or countries? What is the expected annual growth based
on
product manufacture and sales revenues?

There are many more considerations to account and once all these are in one place,
a
business architect starts drawing the design. The design must have to cater to all
aspects of
the business. There is no trivial aspect in business as tiny details can create
huge impact
on the organization as a whole.

As one of the layer of the IT architecture, the business architecture is the very
framework
where the other layers, information, application and technology are based on.
Business
data constitute information which may be used by business software applications
which are
executed by hardware technology. All these other layers operate within the business

architecture framework.

Software applications are developed to simulation real life activities. Manual


transactions
like recording sales and revenues which in the past involved tallying the data in
books are
automated with the use of business software.
Many business application softwares have been developed to adapt to the business
architecture. There are many private application software developers selling highly

customizable software packages to cater to the needs of an organization. Such


technologies
as Enterprise Resource Planning (ERP), Customer Resource Management (CRM), Advance
Planner and Optimizer (APO), Supply Chain Management (SCM), Business Information
Warehouse (BIW), Supplier Relationship Management (SRM), Human Resource Management
System (HRMS) and Product Lifecycle Management (PLM) among others are can all be
customized based on the requirements as indicted in the business architecture.

The nature of business is fast evolving. Many traditional businesses have evolved
from a
local place to global proportions. One mobile phone has evolved from wood pulp
mills to
rubber works to what today is perhaps one of the global leaders in technological
innovations. Some insurance companies also consider themselves banks and vice
versa. These complex business evolution need to be tracked so appropriate changes
in the
business architecture can also be taken care by the information systems architects.

What is Business Experts

Business experts are people who thoroughly understand the business and the data
supporting the business. They know the specific business rules and processes.

In a typical company setup, a person will have a long way to climb up the corporate
ladder
so he or she will land on top manager positions. During the period that he is
climbing the
corporate ladder, he learns valuable lessons about the company's objectives,
decision
making guides, internal and external policies, business strategy and many other
aspect of
the business.

The chief executive officer (CEO) is the business organization's highest ranking
executive.
He is responsible for running the business, carrying out policies of the board of
directors and
making decisions that can highly impact the company. The CEO must not only be a
business expert in general but must also be a business expert in particular about
all the
details of the business he is running. He can be considered the image of the
company and
his relationship both internally and externally with other companies is very vital
for the
success of the business.

Accountant are key business experts responsible for recording, auditing and
inspecting
financial records of business and prepares financial and tax reports. Most
accountants give
recommendations by laying out projected sales, income, revenue and others.

Marketing people, whether marketing managers or staff, are constantly on the look
out for
marketing trends. They gather different statistical data and demography so they
know the
target for goods and services. They closely work with advertising people.
Business software developers are among the top notch information technology
professionals. Aside from mastering the technical aspect of IT like computer
languages and
IT infrastructure, the must also know the very framework of the business
architecture. Their applications are made to automate business tasks like
transaction
processing and all kinds of business related reporting.

Customer relations specialists take of the needs of clients especially the current
and loyal
ones. They make sure that clients are satisfied with the products and services.
They also
act like marketing staff by recommending other products and service to clients.
Their main
responsibility is keeping the clients happy, satisfied and wanting for more.

Human resource staff take care of hiring the best minds suited for the business.
Since
businesses offer different products and services, the human resource staff is
responsible for
screening potential employees to handle the business operations. In order for the
human
resource staff to match the potential candidate, they (the HR staff) should also
know the ins
and outs of the business they are dealing with.

There are also business experts who do not want to be employed in other companies
but
instead they want to have their own business and be their own boss. These people
are
called entrepreneur. Entrepreneurs spend invest their own money on the business of
their
choice. The make feasibility studies first before throwing in their money or they
may hire
the services of a business consultant.

A business consultant is a seasoned business expert that that has a lot of business
success
experiences under his belt. Most business consultants are not fully attached to one

company alone. Business consultants know all aspects of the business. He recommends

actions by studying the financial status and transaction history of the company he
offering
his services. Many companies are offering business consultancy as the main line of
service.

Corporate lawyers focus on laws pertaining to business. They take charge of the
contracts
and represent companies during time of legal misunderstanding with other entities.
In all
undertakings whether traditional or new, corporate lawyers will have to ensure that
the
company does not violate any law of a given country.

What is Business Schema

A schema that represents the structure of business transactions used by clients in


the real
world. It is considered to be unnormalized data.

The dictionary defines the word "schema" as a plan, diagram, scheme or an


underlying
organizational structure. It can also mean a conceptual framework.
Running a business is a not a simple undertaking. In fact, it can get very complex
as the
business grows and growth is one of the priority goals of any business.

For a business to be efficiently managed, it has to have an adopted standard of


rules and
policies. It has to have business semantics detailed in declarative description.

Business rule is very important to define a business schema. The business rule can
influence and guide behaviors and result to support to policies and response to
environmental events and situations. Rules may be the top means whereby a business
organization directs its movement and defines it objectives and perform appropriate
actions.

A business schema can be represented in a data model with un-normalized data. A


data
model can reflect two and a half of the four different kinds of business rules
which are
terms, facts, results of derivations and constraints. A data model can reflect the
data
parameters that control the rules of the business.

The terms in a business schema is the precise definition of words used in the
business
rule. Order, Product Type and Line items are terms that refer to entity class which
are
things of heavy significance to the business. Attributes are terms that describe
the entity
class. For example, total number and total value are attributes of an order.
Attributes of
Product Type may include manufacturer, unit price and materials used. Quantity and
extended value are attributes for Line Item entity class.

Facts, another business rule in the schema, describe a thing like the role a thing
plays and
other descriptions. A data model has three kinds of facts which relationships,
attributes and
super types / sub-types.
Derivation can be any attribute that is a derivative of other attributes or system
variables. For example, the extended value attribute for the entity class line
items can be
determined by multiplying quantity of line item by the unit price of the product
type.

Constraints refer to conditions which determine what values a relationship or an


attribute
can or cannot have.

Many companies hire consultants to document and consolidated the standards, rules,
policies and practices. Then these documentations are handed to IT and database
consultants so that they can be transformed into database rules that follow the
business
schema.
Many independent or third party consultancy firms, research institutions and
software
application vendors and developers offer rich business schema solutions. Even if
business
rules are constantly changing and software applications are already in final form,
these
software applications can be customized to be in sync with the constantly changing
business
rules in particular and the schema in general.

A business rules engine is widely used software application that is used to manage
and
automate business rules to follow legal regulations. For example the law that
states "An
employee can be fired for any reason or no reason but not for an illegal reason" is
ensured
to be followed by the software.

The rule engine's most significant function is to help classify, register and
manage business
and legal rules and verify for constancy. Likewise, the can infer some rules basing
other
existing rules and relate them to IT application. Rules in IT applications can
automatically
detect unusual situations arising from the operations.

With a business schema clearly defined, there is little room for mistake in running
a
business successfully.

Business-Driven Data Distribution

Business-driven data distribution refers to the situation where the business need
for data at
a specific location drives the company to develop a data site and the distribution
of data to
the said newly developed site. This data distribution is independent of a
telecommunications
network.

Having a data warehouse can have tremendous positive impact on a company because
the
data warehouse can allow people in the business to get timely answers to many
business
related questions and problems. Aside from answers, data warehouses can also help
the
company spot trends and patterns on the spending lifestyles of both their existing
clients or
potential customers.

Building a data warehouse can be expensive. With today's businesses relying heavily
on
data, it is not uncommon for companies to invest millions of dollar on IT
infrastructures
alone. Aside from that, companies still need to expend on hiring of IT staff and
consultants.

For many companies, the best way to grow is to utilize the internet as the main
market
place of their goods and services. With the internet, a company can be exposed to
the
global market place. People from different places can have easy access to goods or
make
inquiries when they are interested. Customer relationship can be sealed a lot
faster without
staff having to travel far places. The company can have presence in virtually all
countries
around the world.

But with a wide market arena comes greater demand for data. One data warehouse
alone
location may not be able to handle the bulk data that need to be gathered,
aggregated,
analyzed and reported.

This is where business-driven data distribution comes in. For instance, a company
that has
mainly operated in New York and London has one data warehouse to serve both
offices. But
since many people on the internet have come to know of the company and availed of
their
services and have since then been loyal clients, many different locations are
starting to
generate bulks of data. Say, Tokyo has an exponentially increasing pool of clients,
the
company should already consider creating a new data warehouse in the location. This
data
warehouse can be the source of data to distribute to other data warehouses located
in other
areas but maintained by the same company.

Companies considering establishing new data warehouses for business-driven data


distribution need many things to consider before jumping into the venture. As
mentioned,
setting up and maintaining a data warehouse does not come cheap. Some of the basic
questions to answer before building the data warehouse are "Will this bring
efficiency to the
company operations?" or "Will this save the company in the long run and bring in
more
revenue, new markets and products, competitive advantage and improved customer
service?"

If all these questions are coming up with positive answers, then the next logical
step would
be to design and implement the new data warehouse that will become the new
business-
driven data distribution center. As in most data warehouses, the appropriate high
powered
computers and servers will have to bought and installed. Relational database
software
applications and other user friendly tools should also be acquired.

As time progresses, a company may have several location that have data warehouses
acting
as business driven data distributions. These separate warehouses continually
communicate
with each transferring data updates every minute and synchronizing and aggregating
data
so business staff can get the most recent trends and patterns in the market.

With this set up, a company can have very competitive advantage over the
competitors.
They can formulate new policies, strategies and business rules based on the demand
created from the reports of the many business driven data distribution centers
around the
world.
What is Business Activity

Business Activities refer the component of information technology infrastructure


representing all the business activities in a company whether they are manual or
automated.

Business activities utilize all data resources and platform resources in order to
performing
specific tasks and duties of the company.

For a company to survive in the business, it must have a competitive edge in the
arena
where multitudes of competitors exist. Competitive advantages can be had by having
rigorous and highly comprehensive methods for current and future rules, policies
and
behaviors for the processes within the organization.

Although business activities broadly relates to optimization practices in business


done by
people in relation to business management and organizational structure, it is
closely and
strictly associated with information technology implementation within the company
and all
its branches if they exist.

Business activities is part of the Information Technology Infrastructure. The


Information
Technology Infrastructure is the framework of the company dwelling on the
approaches for
best practices to be translated into software and hardware applications for optimum
delivery
of quality IT services.

Business activities are the bases for setting management procedures so support the
company in achieving the best financial quality and value in IT operations and
services.

Different companies generally have common business activities. In general, business


activities may be broadly categorized into accounting, inventory, materials
acquisition,
human resource development, customer relationship management, and products and
services marketing.

Although these broad categories may apply to many businesses, companies have
differing
needs to the categories. For instance, a law firm may not need to acquire raw
materials as
do furniture companies. Hotels need intensive marketing while an environmental non
profit
organization may not need marketing at all.

Several vendors sell IT solutions to take care of business activities and expedite
manual
work by automatic them through computers.
An Enterprise Resource Planning (ERP) system is a software based application that
integrates all data processes of a company into one efficient and fast system.
Generally,
ERP uses several components of the software application and hardware system to have
the
integration. ERP system is highly dependent on relational databases and they it
involves
huge data requirements. Companies set up data warehouses to feed the ERP.

Data warehouses in business companies are often called Business Information


warehouse.
They are intelligent data warehouses capable of analysis and reporting. Every day,
the
extract, transform and load data into the database in an intelligent fashion.

Customer Relationship Management (CRM) is another automating process to take care


the
business activity aspect of handling clients and customers of the company. This
system
captures, stores and analyzes customers data so the company can come up with
innovative
moves to further please the clients to their satisfaction. Again, the data is
stored in the data
warehouse or business information warehouse.

Supply Chain Management (SCM) makes the planning, implementation and control of
operation related to storage of raw materials, inventory of work in process and
point-of-
origin to point-of-consumption of finished products.

Supply management system takes care of the methods and process involved in
institutional
or corporate buying. Corporate buying may include purchasing of raw materials or
already
finished goods to be resold.

If not for the benefits of installing, information technology infrastructure these


business
activities would take a long time to finish. Doing these business manually will not
just take a
long time to accomplish but can also lead to many errors and inconsistencies.
With today's business trends going towards online marketing and selling, having a
good
business application software system can bring many advantages and benefits to
companies. Investing in an information technology infrastructure many be initially
expensive
but the return of investments will be long term.

Data Tracking

Data tracking involves tracing data from the point of its origin to its final
state. Data
tracking is very useful in a data warehouse implementation. As it is very well
known, a data
warehouse is very complex system that involves disparate data coming various
operational
data sources and data marts.
Hence, data keeps traveling from one server to another. Data tracking helps develop
data
collection proficiency at each site when proper management actions are being taken
to
ensure data integrity.

Some servers handle data processes by archiving raw, compute and test data
automatically.
Raw data are stored exactly in the format as its was received including the header,
footer
and all other information about the data.

Data tracking can be employed to improve the quality for transactions. For example,
I do
my withdrawal using the automated teller machine (ATM) and something unexpected
happens to my transaction which results in the machine not dispensing any money but

deducting the amount from my balance as reflected in receipt spewed out by the
machine.

When I report this incident to the bank authorities, they can easily trace the
series of events
by tracking the data I entered into the machine and the activities that myself and
the ATM
machine did. Because the data is tracked, they can easily spot patterns which led
to the
problem and from then, they can immediately take actions to improve the services

Data tracking can also be used in cases of fraud being committed. Inside the
company, if
there an erring person, the data tracking process may involve inspecting the audit
trail logs.

Some websites offer aggregated data through data tracking by acquiring certain
fields of
data using remote connection technologies. For instance, web applications can track
the
sales of a company which is being refreshed regularly so while a staff is on
business travel
or simply on holiday, he can still see what is happening in the business operation.
In another example, a website may offer a service such as an hourly update of the
global
weather and this can be done by tracking data from different geographical locations
data
sites.

But despite the uses, there also issues associate with data tracking. The issue of
security
and privacy is one of the biggest areas of concerns with data tracking. Many
website try to
install some small codes on the web browsers so that they can track the return
visits of an
internet user and bring up the preferences he has specified during his previous
visits.

This information about the preferences is tracked from the small code copies on the

computer. This code is called "cookie". Cookies in themselves are intended for good
use but
there many coders who have exploited their use by making them track sensitive
information
and steal them for bad purposes.
There are many specialized used for data tracking purposes available as free
download or
for commercial purposes coming with a fee. These software data tracking tools have
easy to
use graphical dashboards and very efficient back end programs that can give users
the data
they need to track on a real time basis.

Graphical data tracking applications like these make for perfect monitoring tools
for
database or data warehouse administrators who want to keep track of the data
traveling
from data mart or operational data source to another. The graphical presentation
can make
the administrator easily spot the erring and data and have an instant idea of where
that
data is currently located.

With today's fast paced business environment, data tracking tools and devices can
greatly
enhance the information system of any organization in order for them to make wise
decisions and corresponding moves in the face of a challenging situation.

Intelligent Agent

An intelligent agent is a sub-system of the artificial intelligence but with a lot


less
functionalities and "intelligence". An intelligent agent is simple a software tool
designed to
assist end users in performing non-repetitive task related to computing processes.
The
intelligent part is that it can act on its behalf when configured to respond to
specific events.
An intelligent agent is sometimes referred to as bot (short for robot) and is often
used for
processing assistance in data mining.

There are two general types of intelligent agents � the physical agent and the
temporal
agent. The physical agent refers to an agent which uses sensors and other less
abstract and
more tangible means to do its job. On the other hand, the temporal agent may be
purely
codes that use time based stored information which are triggered depending on
configuration.

From those two general types of intelligent agents, there may be five classes of
intelligent
agents based on the degree of their functionalities and capabilities. These five
are simple
reflex agents, model-based reflex agents, goal-based agents, utility-based agents
and
learning agents.

The simple reflex agent functions on the basis of its most current perception is
also based
on the condition � action rule such as "if condition then action rule". The success
of the
agents job depends on how fully observable the environment is.

The model based agent has its current state stored inside the agent which maintains
certain
structures describing the part of the world which are unseen and this kind of agent
can
handle environments which are partially observable. The behavior of this kind of
agent
requires information about the way that the world works and behaves and thus is
sometimes considered to have the world view model.

The goal based agent is actually a model based agent but it stores information
about certain
situations and circumstances in a more desirable way by allowing the agent some
good
choices from among many possibilities.

The utility based agent uses a function that can map a state to a certain measure
of the
utility of the state.

Finally, a learning agent are is a self governing intelligent agents that can learn
and adapt
to constantly changing situations It can quickly learn even from large amounts of
data and
its learning can be online and in real time.

Intelligent agents have become the new paradigm for software development. The
concept
behind intelligent agents have been hailed as "the next significant breakthrough in
software
development". Today, intelligent agents are used in an increasingly wide variety of

applications intended for a wide variety of industries. These applications range


from
comparatively small systems such as email filters to large, open, complex, mission
critical
systems such as air traffic control.

In large data intensive applications like internet web servers, an example of an


intelligent
agent would be a software for determining ranks of websites. This is very important
because
ranking is the basis for advertisement rate and the overall value of the website.
An
intelligent agent may be used to audit the websites ranking in the leading search
engines
around the world.
Perhaps the most ubiquitous example of an intelligent agent is found on our
computers.
These are in our anti virus and anti spyware systems. Intelligent agents are
constantly on
the lookout for viral strain updates so our computers can be protected all the
time.

Data Warehouse Basics

What is Operational Database

Operational Database is the database-of-record, consisting of system-specific


reference data
and event data belonging to a transaction-update system. It may also contain system
control data such as indicators, flags, and counters. The operational database is
the source
of data for the data warehouse. It contains detailed data used to run the day-to-
day
operations of the business. The data continually changes as updates are made, and
reflect
the current value of the last transaction.

An operational database contains enterprise data which are up to date and


modifiable. In an
enterprise data management system, an operational database could be said to be an
opposite counterpart of a decision support database which contain non-modifiable
data that
are extracted for the purpose of statistical analysis. An example use of a decision
support
database is that it provides data so that the average salary of many different
kinds of
workers can be determined while the operational database contains the same data
which
would be used to calculate the amount for pay checks of the workers depending on
the
number of days that they have reported in any given period of time.

An operational database, as the name implies, is the database that is currently and

progressive in use capturing real time data and supplying data for real time
computations
and other analyzing processes.

For example, an operational database is the one which used for taking order and
fulfilling
them in a store whether it is a traditional store or an online store. Other areas
in business
that use an operational database is in a catalog fulfillment system any other Point
of Sale
system which is used in retail stores. An operational database is used for keeping
track of
payments and inventory. It takes information and amounts from credit cards and
accountants use the operational database because it must balance up to the last
penny.

An operational database is also used for supported IRS task filings and regulations
which is
why it is sometimes managed by the IT for the finance and operations groups in a
business
organization. Companies can seldom ran successfully without using an operational
database
as this database is based on accounts and transactions.

Because of the very dynamic nature of an operational database, there are certain
issues
that need to be addressed appropriately. An operational database can grow very fast
in size
and bulk so database administrations and IT analysts must purchase high powered
computer hardware and top notch database management systems.

Most business organizations have regulations and requirements that dictate storing
data for
longer periods of time for operation. This can even create more complex setup in
relation to
database performance and usability. With ever increasing or expanding operational
data
volume, operational databases will have additional stress on processing of
transactions
leading to slowing down of things. As a general trend, the more data there are in
the
operational database, the less efficient the transactions running against the
database tend
to be.
There are several reasons for this one of the most obvious reasons is that table
scans need
to reference more pages of data so it could give results. Indexes can also grow in
size so it
could support larger data volumes and with this increase, access by the index could
degrade
as there would be more levels that need to be traversed. Some IT professionals
address this
problem by having solutions that offload older data to data stores for archive.

Operational databases are just part of the entire enterprise data management and
some of
the data that need to be archived go directly to the data warehouse.

What is Operational Data Store (ODS)

An Operational Data Store (ODS) is an integrated database of operational data. Its


sources
include legacy systems and it contains current or near term data. An ODS may
contain 30 to
60 days of information, while a data warehouse typically contains years of data.

An operational data store is basically a database that is used for being an interim
area for a
data warehouse. As such, its primary purpose is for handling data which are
progressively in
use such as transactions, inventory and collecting data from Point of Sales. It
works with a
data warehouse but unlike a data warehouse, an operational data store does not
contain
static data. Instead, an operational data store contains data which are constantly
updated
through the course of the business operations.

ODS is specially designed such that it can quickly perform relatively simply
queries on
smaller volumes of data such as finding orders of a customer or looking for
available items
in the retails store. This is in contrast to the structure of a data warehouse
wherein one
needs to perform complex queries on high volumes of data. As a simple analogy, a
data
store may be a company's short term memory storing only the most recent information

while the data warehouse is the long term memory which also serves as a company's
historical data repository whose data stored are relatively permanent.
The history of the operational data store goes back to as early as the year 1990
when the
original ODS system were developed and used as a reporting tool for administrative
purposes. But even then, the ODS was already dynamic in nature and was usually
updated
every day as it provided reports about daily business transactions such as sales
totals or
orders being filled.

The ODS that time are now referred to as a Class III ODS. As information technology

evolved, so did ODS with the coming of the Class II ODS which was already capable
of
tracking more complex information such as product and location codes, and to update
the
database more frequently (perhaps hourly) to reflect changes. And then came the
Class I
ODS systems from the development of customer relationship management (CRM).
Many years, IT professional were having great problems with integrating legacy
applications
as the process would entail so many resources for maintenance and other efforts had
done
little to care of the needs of the legacy environments. With experimentations and
development of new technologies, there was little left for company IT resources. As
IT
people had experienced with legacy applications, the legacy environment has become
the
child consuming its parent.

There were many approaches done to respond to the problems associated with legacy
systems. One approach was to model data and have information engineering but this
proved to be slow in the delivery of tangible results. With the growth of legacy
systems
came the growth in complexity as well as the data model.

Another response done to address legacy system problems was the establishment of a
data
warehouse and this has proven to be beneficial but a data warehouse only addresses
the
informational aspect of the company.

The development of an operational data store has greatly addressed the problems
associate
with legacy systems. Much like a data warehouse, data from legacy systems are
transformed and integrated into the operational data store and once there, data
ages and
then passed into a data warehouse. One of the main roles of the ODS is to represent
a
collective, integrated view of the up-to-the-second operations of the company. It
is very
useful for corporate-wide mission-critical applications.

On-Line Analytical Processing

On-Line Analytical Processing

On-Line Analytical Processing is a processing that supports the analysis of


business trends
and projections. It is also known as decision support processing and OLAP. An OLAP
software enables companies to have real-time analysis of data stored in a database.
An
OLAP server is typically a separate component of an information system which
contains
specially coded algorithms and indexing tools to efficiently process data mining
tasks with
minimal impact on database performance.

OLAP uses multidimensional view of aggregate data to provide quick access to


strategic
information for further analysis and with this, a data user can have fast and very
efficient
view of data analysis as OLAP turns raw data into information that can be
understood by
users and manipulated in various ways. The multidimensional views of data that OLAP

requires also come with packaged calculation-intensive capabilities and time


intelligence.

OLAP is part of a wide category of business intelligence that includes ETL (extract
transform
load), relational reporting and data mining. Some critical areas of a business
enterprise
where OLAP are greatly used include business reporting for sales, marketing,
management
reporting, business process management (BPM), budgeting and forecasting, financial
reporting and similar areas.

Those databases that are planned and configured for use with OLAP use a
multidimensional
data model which can enable complex analytical and ad-hoc queries with a rapid
execution
time. Outputs from an OLAP query are displayed in a matrix or pivot format with
dimensions
forming the row and column of the matrix; the measures, the values.

OLAP is a function of business intelligence software which can enable data end
users to
easily and selectively get extracted data and view them from different points of
view. In
many cases, an OLAP as aspects designed for managers who want to look to make sense
of
enterprise information as well as how the company fares well with the competition
in a
certain industry. OLAP tools structure data in a hierarchical manner which is
exactly the way
many business mangers thinks of their enterprises. But OLAP also allows business
analysts
to rotate data and change relationships and perspectives so they get deeper
insights into
corporate information to enable them to analyze historical as well as future trends
and
patterns.

The OLAP cube is found in the core of any OLAP system. The OLAP cube is also
referred to
as the multidimensional cube or a hypercube. This cube contains numeric facts which
are
called measures and these measures are further categorized into several dimensions.
The
metadata of the cube is often made from a snowflake schema or star schema of tables
in a
relational database. The hierarchy goes from measures which are derived from the
records
in the fact table and dimensions which are derived from dimension tables.

A claim has it that OLAP cubes for complex queries has the power to produce answers
in
around 0.1% of the time for the same query on OLTP relational data. Aggregation is
the key
for OLAP to achieve the amazing performance and in OLAP, aggregations are built
from the
fact table. This is done by changing the granularity on specific dimensions and
aggregating
up data along these dimensions.

There are different kinds of OLAP such as Multidimensional OLAP (MOLAP) which uses
database structures which are optimal for attributes such as time period, location,
product
or account code; Relational OLAP (ROLAP) wherein the base data and the dimension
tables
are stored as relational tables and new table are created so they can hold
aggregated
information; and Hybrid OLAP (HOLAP) which can be a combination of OLAP types. Many

software vendors have their own versions of OLAP implementations.

On-Line Transaction Processing

On-Line Transaction Processing


On-Line Transaction Processing is a processing that supports the daily business
operations.
Also know as operational processing and OLTP. An OLTP is a database which must
typically
allow the real-time processing of SQL transactions to support traditional retail
processes, e-
commerce and other time-critical applications. It is also a class of program that
helps to
manage or facilitate transaction oriented applications such as data entry and
retrieval
transactions in a number of industries, including banking, airlines, mail order,
supermarkets, and manufacturers.

With today's business environment, it is impossible to run a business without


having to rely
on data. Processing online transactions these days increasingly requires support
for
transactions spanning a large network or even the global internet and may include
many
companies. Because of this great demand, many new OLTP software implementations use

client � server processing and brokering of software applications that can enable
transactions to run on various computer platforms within a network.

Today, with the ubiquity of the internet, more and more people even from those
remote
areas are not doing transactions online through an e-commerce environment. The term

transaction processing is often associated with the process wherein an online shop
or
ecommerce website accepts and processes payments through a customer's credit or
debit
card in real time in return for purchased goods and services.

During the process of online transactions, a merchant payment system will


automatically
connect to the bank or credit card company of the customer and carry out security,
fraud
and other checking for validity after which authorization to take the payment
follows. In is
strongly advised that when a company looks for other companies that will handle
online
transactions and processing, the company should have a system infrastructure that
is
robust, secure and reliable that give customers fast, seamless and secure checkout
time.
An OLTP implementation tends to be very large involving very high volume of data at
any
given time. Business organizations have invested in sophisticated transaction
management
software like Customer Information Control System (CICS) and database optimization
tactics that can help OLTP process very large numbers and volumes of concurrent
updates
on an OLTP-oriented database.

There are also many OLTP brokering programs which can distribute transaction
processing
among multiple computers on a network that can enhance the functions of an OLTP
working
on a more demanding decentralized database system. Service oriented architectures
and
web services are now commonly integrated with OLTP.

The two main benefits with using OLTP are simplicity and efficiency. OLTP helps
simplify a
business operation by reducing paper trails and helping draw faster and more
accurate
forecasting for revenues and expenses. OLTP helps provide a concrete foundation
with
timely updating of corporate data. For an enterprise' customers, OLTP allows the
more
choices on how they want to pay giving them more flexible time and enticing them to
make
more transactions. Most OLTP transactions offer services to customers 24 hours a
day seven
days a week.

But despite the great benefits that OLTP can give to companies and their customers,
there
are certain issues that it needs to address. The main issues pertaining to OLTP are
on
security and economic costs. Because an OLTP implementation is exposed on a
network,
more specifically the internet, the database may be susceptible to hackers and
intruders
who may be waiting on the side to get sensitive information on people and their
bank and
credit card accounts.

In terms of economic cost, when a business goes offline to do some steps of a


process,
buyers and suppliers tend to miss out on the services and benefits of an OLTP and
the
smallest system disruption may mean loss of time and money. But with proper care
and
implementation, OLTP still will remain to be a big help to business organizations
specially
those operating on a large scale.

What is Aggregation

In the broadest sense of the word, aggregation means collecting and combining of
data
horizontally, vertically and chronologically and then expressed in summary form to
be used
for statistical analysis. In the more technical sense, aggregation is a special
kind association
that specified a part of whole relationship between the component part and the
while.

As opposed to ordinary association, aggregation is more of an asymmetric


relationship and
transitive relationship. Aggregation also implies a stronger coupling and behavior
is
normally propagated across an aggregation.
In relational database, aggregation refers to combination of data from different
records. For
example, sales records and corporate income from one or several branches can be
reported
using aggregation. The process of aggregation of various data from several sources
can be
done by executive one database query.

An example of the use of aggregation in internet advertising is to get information


and trends
on a particular group based on a specific variable like income, age and profession.
The
information gathered and the patterns spotted may be used for website
personalization.
This is useful when a company wants to choose content to specific users and
advertising can
be directly targeted to them. For instance, a site that is selling a movies and
music on CDs
and DVDs may recommend or advertise genres based on the age of the user and the
data
aggregate for the group.
Online analytic processing (OLAP) is a particular method used in data warehouses
which
uses a simple type of data aggregation for the marketer to use an online reporting
strategy
to process gathered information.

Data aggregation can be used for personal data aggregation services to offer a user
one
point for collection of his personal information from other websites. The user can
use a
single master personal identification number (PIN) which he can use to access a
variety of
other accounts like airlines, clubs and financial institutions. This kind of
aggregation is often
called "screen scraping".

Aggregation services are offered as a standalone or may be in conjunction with


other
services like bills payment and tracking of portfolio which is provided by other
specialized
websites. Many big and stable companies that use the internet for web presence and
transactions offer aggregation services to entice visitors.

During the course of time, large amounts of aggregated account data from provider
to
server are transferred and may develop into a comprehensive database of user
profiles with
details of balances, securities transactions, credit information and other
information. Privacy
and security become a major issue but there are independent companies offering
these
related services.

Because of the possibility of liabilities that may arise from activities related to
data
aggregations such as security issues and infringement of intellectual property
rights,
aggregators may agree on a data feed arrangement at the discretion of the end user
or
customers. This may involve using an Open Financial Exchange (OFX) standard in
requesting and delivering the information to the customer. This agreement will
provide an
opportunity for institutions to protect the interest of their customers' interest
and for
aggregators to come up with more robust services. Screen scrapping without the
content
provider's consent can lead to allowing subscribers to see any account opened
through a
single website.

Another form of aggregation which is ubiquitous in the internet today is RSS


syndication.
RSS, which stands for Really Simple Syndication, is a small database that contains
headlines
and description of news or other information on a website. RSS gets aggregate data
from
several specified sources from other websites and they are automatically updated at
a
central point in one's syndicated website. Contents from RSS are read using a feed
reader
or an aggregator.

Automatic Data Partitioning

Automatic data partitioning is the process of breaking down large chunks of data
and
metadata at a specific data site into partitions according to the request
specification of the
client.
Data sites contain multitudes of varied data which can be extremely useful as a
statistical
basis for determining many trends in businesses. Because data in the data sites can
grow at
a very fast rate, the demand for internet traffic also increases. a good software
with
partitioning capability should be employed to manage the data warehouse. Many
software
application handling data also have advanced functions like traffic shaping and
policing so
that sufficient bandwidth can be maintained.

Relational database management systems (RDBMS) effectively manage data sites. This
database system follows the relational model introduced by E. F. Codd in which data
is
stored tables while the relationship among data is stored in another tables. This
is in
contrast to flat files where all data is stored in one contiguous area.

Since RDMS data is not stored in one contiguous area but instead broken down into
tables,
it becomes easy to partition data whether manually or automatically for easy
sharing and
distribution.

The biggest advantage to data partitioning is that I can divide large tables and
indexes into
smaller parts and as a result, the system's performance can be greatly improved
while
contention is reduced and data availability and distribution is increased.
Automatic data
partitioning makes the job of the database administrator a lot easier especially in
labor
intensive jobs such as doing back ups, loading data, recovering and processing a
query.

Data partitioning is commonly done by either splitting selected elements or by


creating
smaller separate databases each containing the basic components like tables,
indexes, and
transaction logs.

Horizontal partitioning is a technique where different rows are placed into


different
tables. For example, zip codes with less than 25000 are placed in a table called
EasterCustomer while those greater than 25000 are placed in a table called
CustomerWest.
If customers want to view a complete list of records, the database uses a view with
union
function.

Vertical partitioning is another technique wherein tables are created with fewer
columns
with additional separate tables to store the rest of the remaining columns.
Usually, the
process involves the use of different physical storage.

Data partitioning is used in a distributed database management system, a software


systems
which can allow the management of a distributed database. A Distributed database is
a
collection of many database which are logically interrelated and distributed over
many
computers in a network. This can allow certain clients to view only the data they
need in
their specifications while the rest of the viewer can see all the data as one not
partitioned.
Most of today's most popular relational database management systems have different
criteria for partitioning data. Their only similarity is that they take a partition
key and assign
a portion based on some criteria.

Some of the partitioning methods used as criteria include range partitioning, list
partitioning, hash partitioning and composite partitioning.

In range partitioning, the database systems selects a partition if the partitioning


key is
within a certain given range. For example, a partition could include all the rows
where a zip
code column has values between 60000 and 69999.

List partitioning is a method where a partition is assigned a specific list of


values like a list
of all countries in Southeast Asia.

Hash partitioning uses the value taken from a hash function. For instance, if there
are
partitions, the value returned for the function could be from 0 to 3.

Composite partitioning take a combination from the above mentioned portioning


methods.

What is Cache

A cache is a type of dynamic and high speed memory that is used to supplement the
function of the central processing unit and the physical disk storage. The cache
acts as a
buffer when the CPU tries to access data from the disk so the data traveling from
the CPU
and physical disks can have synchronized speed. Disk reading and writing process is

generally slower than CPU function.

In computer science theory, a cache is any collection of data that duplicates


original values
which are stored elsewhere in the computer. The original data may be expensive to
fetch
because of the disparity in access time between components. So a cache can act as a

temporary storage where data which are most frequently accessed are stored so fast
processing. In future processing, the CPU may just access the duplicated copy
instead of
getting it from the physical disk storage which is slower and performance can
suffer.

A cache can either be a reserved section in the memory of the computer or a


separate
storage device with very fast speed. In personal computers, there two common types
of
caching namely: memory caching and disk caching.
A memory cache is sometimes known as RAM cache of cache store. This is a portion of
the
random access memory (RAM) which is made of high speed static RAM (SRAM). SRAM is
faster than the dynamic RAM (DRAM). When computers are executing, most programs
access the same data or instructions repetitively so storing these data or
instructions in
memory cache makes performance effective.

Other memory caches are directly built in the main body of the microprocessor. For
instance, the old Intel 80486 microprocessor has 8K of memory cache while the
Pentium
had 16K. The cache as also called Level 1 (L1) caches. Modern computers come with
external cache memory which is called Level 2 (L2) cache. The L2 cache is situated
between
the CPU and the DRAM.

On the other hand, disk caching is almost similar to memory caching except that
disk cache
uses the conventional old memory instead of the high speed SRAM. Frequently
accessed
data from the disk storage device are stored in the memory buffer. The program
first needs
to see if there is data from the disk cache before getting data from the hard disk.
This
method significantly increases performance because access speed in RAM can be as
much
as thousands of times faster than access speed in hard disks.

A cache hit is a term used when data is found in the cache. The cache's
effectiveness is
determined by its hit rate. A technique known as smart caching is used by many
cache
systems as well. The technique is able to recognize certain types of data which are
being
frequently used and automatically caches them.

Another kind of cache is the BIND DNS daemon which maps domain names to IP
addresses.
This makes it easier for numeric IP addresses to be matched faster with their
corresponding
domain names.

Web browsers also employ a caching system of recently viewed web pages. With these
caching system, a user will not have to wait to get data from remote servers
because the
latest pages are on his computer's web cache. A lot of internet service providers
user proxy
cache for there clients to save on bandwidth in their networks.

Some search engines have indexed pages in their cache so when links to these web
pages
are shown in the search results and the actual website is temporarily offline or
inaccessible,
the search engine will give the cached pages to the user.

What is Access Path


In relational database management system (RDBMS) terminology, Access Path refers to
the
path chosen by the system to retrieve data after a structured query language (SQL)
request
is executed.

A query may request at least one variable to be filled up with one value or more. A
query
may look like this:

SELECT family_name FROM users WHERE family_name = 'Smith'

The query tells the computer to select all 'Smith' family names which may number to
a few
thousand from among tens of thousands within the database table. And so the
database
management system will have to estimate a filter factor using the determined value
for the
variable and an access path will have to be determined to get to the data.

Access path selection can make a tremendous impact on the overall performance of
the
system. The query mentioned above is a very simple query with one variable being
matched
to values from one table only. A more complex query may involve looking for many
variables which can be matched to many different records on separate tables. Some
of
these variables even have complex conditions such as greater than or less than some
value
of integers. Many relational database makers have their own algorithms to optimize
choosing of access paths while minimizing total access cost.

Optimization of access path selection maybe gauged using cost formulas with I/O and
CPU
utilization weight usually considered. Generally, query optimizers evaluate the
available
paths to data retrieval and estimate the cost in executing the statements using the

determined paths or a combination of these paths.

In choosing an access path, the RDBMS optimizer examines the WHERE clause and the
FROM clause. It then lays out possible plans of execution using the determined
paths and,
with the use of statistics for the columns, index and tables accessible to the
statement, the
optimizer then estimates the cost of executing the plan.

Access paths selection for joins where data is taken from than one table is
basically done
using the nested loop and merging scan techniques. Because joins are more complex,
there
are some other considerations for determining access path selections for them.

In general, the most common ways of paths selections include the following:

Full Table Scan - The RDBMS software scans all rows from the table and filters out
those
that do not mach the criteria in the query.
Row ID Scan � This is the fastest retrieval method for a single row. The row
identification
(rowed) give the exact location of the row in the specified database.

Index Scan � With this method, the RDBMS retrieves a row of records by traversing
the
index using the indexed column values required by the query statement. There are
many
types of index scans which many include Index Unique Scans, Index Range Scans,
Index
Skip Scans, Full Scans, Fast Full Index Scans, Index Joins and Bitmap Indexes.

Cluster Access Scan � This is used to retrieve all rows that have same cluster key
value.
The rows are coming from a table stored in an indexed cluster.

Hash Access Scan � This method locates rows in a hush cluster basing on some hash
value. All rows containing the same hash values are stored within the same data
block.

A new system present for optimizing access path selection is by defining an index
and
segment scan, the two types of scans which are available for SQL statements. Before

returning the tuples, simple predicates called search arguments (SARGS) are added
to the
indexes. There are many more techniques under research by RDBMS vendors like
Microsoft
(SQL), Oracle, MySQL, PostgreSQL and many others.

What is an Ad Hoc Query

An Ad-Hoc Query is a query that cannot be determined prior to the moment the query
is
issued. It is created in order to get information when need arises and it consists
of
dynamically constructed SQL which is usually constructed by desktop-resident query
tools.
This is in contrast to any query which is predefine and performed routinely.

The word Ad hoc comes from Latin which means "for the purpose". It generally refers
to
anything that has been designed to answer very specific problems. An Ad hoc
committed for
instance is created to deal with a particular undertaking and after the undertaking
is
finished, the committee is disbanded. In the same manner, an ad hoc query does not
reside
in the computer or the database manager but is dynamically created depending on the

needs of the data user.

In the past, for users to analyze various kinds of data, multiple sets of queries
are being
constructed. These queries are predefined under the management of a database or
system
administrator and so a barrier between the users' needs and the canned information
exists.
As a result, the end user gets a bombardment of unrelated data in his query
results. The IT
resources also get a heavy toll since a user may have to execute several different
queries at
any given period.
Today's widely used active data warehouses accelerate retrieval of vital
information to
answer interactive queries in a mission critical application.

Most users of data are in fact non technical people. Everyday, the retrieve
seemingly
unrelated data from different tables and database sources. Many ad hoc query tools
exists
so non technical users can execute very complex queries without trying to know what

happens at the backend. Ad hoc query tools include features to support all types of
query
relationships which include one-to-many, many-to-one, and many-to-many. End users
can
easily construct complex queries using graphical user interface (GUI) navigation
through
object structures in a drag and drop manner.

Ad hoc queries are used intensively in the internet. Search engines process
millions of
queries every single second from different data sources. Any keywords typed the
internet
user are dynamically generated with an ad hoc query against virtually any database
back
end. As the basic structure of an SQL statement consist of SELECT keyword FROM
table
WHERE conditions, an ad hoc query dynamically supplies the keyword, data source and
the
conditions without the user knowing it.

Although issuing an ad hoc query against a database may be more efficient in terms
of
computer resources because using a predefined query may cause one to issue more
than
one query, using an ad hoc query may also have a heavy resource impact depending on
the
number of variables needed to be answered.

To reduce impact on memory due to usage of ad hoc queries, the computer must have
huge
amount of memory, provide very fast devices to be used as temporary disk storage
and the
database manager must prevent very high memory usage ad hoc queries from being
executed. Some database managers anticipate huge sort requirements by having exact
match pre-calculated results sets.

But in a general ad hoc environment, a user is discouraged from issuing an ad hoc


query to
produce a report based on millions of transactions from the last years. Instead,
users may
choose data from a given range.

Because of the high potential of performance degradation when a complex ad hoc


query is
executed, database managers sometimes only provide copy of the live database to be
regularly refreshed. The copy is sometimes referred to as data warehouse and the
act of
querying as data mining.

What is Data Administration


Data administration refers to the way in which data integrity is maintained within
data warehouse.

Data warehouses are very large repository of all sorts of data. These data maybe of

different formats.

To make these data useful to the company, the database running the data warehouse
has
to be configured so that it obeys the business data model which reflects all
aspects of the
business operation. These aspects include all business rules pertaining to
transactions,
products, raw materials management, human resource management, customer relations
and all other including the tiniest of details.

A data warehouse is managed by one or more database administrators depending on the

size of the data warehouse and the bulk of data being processed. The database
administrator has many responsibilities. Among his important duties are creating
and
testing backups for database recoverability; verifying and ensuring data integrity
by making
sure tables, relationships and data access are constantly monitored; maintaining
database
security by defining and implementing access controls on the databases, ensuring
that the
database is always available or uptime is at maximum; performance tuning of the
database
system of software and hardware components; and offering end user support as well
as
coordinating with programmers and IT engineers.

Many database administrators who manage large data warehouses set certain policies
so
that the flow and data gathering, transformation, extraction, loading, and sharing
will have
as minimal problem as possible.

Data Access Policy refers to who has access privileges to what particular data.
Certain data
like administrative data can contain confidential data therefore access is
restricted to the
main public. Other data are freely available to all staff members of the company
only so
that they can view particular trends in the industry and they can be guided in
coming up
with needed innovations in their products or services.

Data Usage Policy is a guide so data will not be misused or just be distributed
freely when it
is not intended to be. The use data falls into several categories like update, read
only or for
external dissemination.

Most data warehouses have data integration and integration policy in which all the
data are
represented within a single logical data model where physical data models take data
from.
The database administrator should be extra keen with details in developing such
model and
all corresponding data structures and domains. All the needs of an enterprise data
are
considered in the development and future modifications of domains, values and data
structures.
Data application control policies are implemented to ensure that all data taken
from the
warehouse are appropriately handled with care to ensure integrity, security and
availability.
It is not uncommon in business organizations where staff save data locally in
division
computers, discs and networks but these data are often saved in different kinds of
applications like Word and Excel. Database administrators are responsible for
access
controls for different kinds of data formats so that the resources cannot be
overloaded and
data warehouses re protected from unauthorized modifications or disclosure which
may
result to data loss or loss of integrity.

Relational databases, which are the most widely used kinds of database today
especially in
very large warehouse, have very complex structures and need to be carefully
monitored.
They need to strictly obey the business rules of the enterprise as well as in sync
with the all
the data models. Investing in robust software applications, which can be available
from a
wide array of software developers and vendors, and hiring a highly trained and
skillful
database administrator can certainly be a company asset.

What is Data Aggregation

In Data Aggregation, value is derived from the aggregation of two or more


contributing data
characteristics.

Aggregation can be made from different data occurrences within the same data
subject,
business transactions and a de-normalized database and between the real world and
detailed data resource design within the common data architecture.

Reporting and data analysis applications that work closely to tie together company
data
users and data warehouses need to overcome problem on database performance. Every
single day, the amount data collected increases at exponential proportions. Along
with the
increase, the demands for more detailed reporting and analysis tools also
increases.
In a competitive business environment, the areas that are given more focus to gain
competitive edge over other companies include the need for timely financial
reporting, real
time disclosure so that the company can meet compliance regulations and accurate
sales
and marketing data so the company can grow a larger customer base and thus increase

profitability.

Data aggregation helps company data warehouses try to piece together different
kinds of
data within the data warehouse so that they can have meaning that will be useful as

statistical basis for company reporting and analysis.


But data aggregation, when not implemented well using good algorithm and tools can
lead
data reporting inaccuracy. Ineffective way of data aggregation is one of the major
components that can limit performance of database queries.

Statistics have shown that 90 percent of all business related reports contain
aggregate
information making it essential to have proactive implementation of data
aggregation
solutions so that the data warehouse can substantially generate data for
significant
performance benefits and subsequently open many opportunities for the company to
have
enhanced analysis and reporting capabilities.

There are several approaches to achieving an efficient data aggregation. Having


robust and
high powered servers will make the database perform incrementally better. Another
approach is to do partitioning, de-normalization, and creating OLAP cubes and
derivative
data marts. Report caching and broadcasting can also help boost performance. And
another
method is having summary table.

But while these approaches have been proven and tested, they may have some
disadvantages in the long run. In fact those approaches have already been lumped
among
the traditional techniques by some database and data warehouse professionals.

Top data warehouse experts recommend that having a good and well define enterprise
class
solutions architected to support dynamic business environments have more long term
benefits with data aggregation. The enterprise class solutions provide good methods
to
ensure that the data warehouse has high availability and easy maintenance.

Having a flexible architecture also allows for future growth and flexibility and
most business
trends nowadays tend to lean towards exponential growth. The data architecture of
data
warehouses should use standard industry models so they can support complex
aggregation
needs. It should also be able to support all kinds of reports and reporting
environments.
One way to test if the data warehouse is optimized is if can process pre-
aggregation with
aggregation on the fly.

Data warehouses should be scalable as the amount of data will definitely grow very
fast.
Especially now that new technologies like RFID can allow gathering of more
transactional data,
scalability will be important for the future data needs of the company.

Data aggregation can really grow to be a complex process through time. It is always
good to plan
the business architecture so that data will be in sync between real activities and
the data model
simulating the real scenario. IT decision makers need to make careful choice in
software
applications as there are hundreds of choices that can be bought from software
vendors and
developers around the world.
What is Data Collection Frequency

Data Collection Frequency, just as the name suggests refers to the time frequency
at which
data is collected at regular intervals. This often refers to whatever time of the
day or the
year in any given length of period.

In a data warehouse, the relational database management systems continually gather,

extract, transform and load data onto the storage system. Along with these
processes, there
could be a potentially large number of data consumers simultaneously accessing data

warehouses getting aggregated data reports for statistical analysis of both company
and
industry trends.

Having a log of the data collection frequency of a data warehouse is very important
for a lot
of reasons.

For one, having knowledge about data collection frequency is extremely important in

recording inventory transactions. A company can take a physical inventory to be


used later
in reconciling quantities recorded in stocks. The inventory transaction records can
be used
to record inflows (receipts) and outflows (shipments) of inventory at loading
stocks. At the
final stage, the data of inventory transactions will be batch updated along with
implementing extensive and intensive error checking for quantities, negative
inventories
validated warehouse amounts and valid serial numbers.

All these activities are monitored by the data warehouse system and data collection

frequency will be useful in analyzing so many things like if the transactions were
legal or
illegal and many other related useful information.

In a company data warehouse, data collection solutions are important because they
enable
the business organization to have real time information and visibility in supply
chain. This
can greatly improve decision making processes, accuracy in customer information and

products or services sales and material availability and reporting data warehouse
operations. The data collection frequency can also help increase return of
investments (ROI)
through improved equipment and labor productivity.

Data collection frequency is particular a great help in advertising and marketing


by
determining media exposure of, say, an e-commerce website. And e-commerce website
needs to have intensive media exposure. In the internet where e-commerce takes
place,
there are thousands of other competitors. There competitors will do all they can to
get top
exposure to internet users and buyers. One way to do this is to get top ranks in
search
engines.

Data collection frequency record is a good determinant for media exposure of the e-
commerce site and the products and services it offers. The record for frequency of
data
collected could be used in calculating the number of prospects which have been
reached
with different media vehicles at different levels of frequency of exposure.

Sometime, data warehouses can experience problems both hard and software in nature.
To
troubleshoot problems, IT professionally generally look at the logs to see which
point the
system encountered such problems. Having a record for data collection frequency can
give
the troubleshooter some hints about problem. For instance, at some point, data
collection
was so heavy that it could cause processing to be intensive to the point of
hardware
breakdown.

Business intelligence is fast evolving and has long been a critical component of a
company's
daily operations. As it continues to evolve, the need for real time data warehouse
which can
provide data consumers with rapid updates becomes even more demanding.

Many companies are finding that they need to refresh their data warehouses on more
frequent basis because tools in business intelligence are being used more and more
often
for decision making in operations. According to may data warehouse specialists,
data
warehouse is not just about loading data for business analyst to forecast; it is
more about
daily decisions.

With real time data collection, for sure database managers and data warehouse
specialists
will surely make more room for recording data collection frequency.

What is Data Aggregation


In Data Aggregation, value is derived from the aggregation of two or more
contributing data
characteristics.

Aggregation can be made from different data occurrences within the same data
subject,
business transactions and a de-normalized database and between the real world and
detailed data resource design within the common data architecture.

Reporting and data analysis applications that work closely to tie together company
data
users and data warehouses need to overcome problem on database performance. Every
single day, the amount data collected increases at exponential proportions. Along
with the
increase, the demands for more detailed reporting and analysis tools also
increases.

In a competitive business environment, the areas that are given more focus to gain
competitive edge over other companies include the need for timely financial
reporting, real
time disclosure so that the company can meet compliance regulations and accurate
sales
and marketing data so the company can grow a larger customer base and thus increase

profitability.

Data aggregation helps company data warehouses try to piece together different
kinds of
data within the data warehouse so that they can have meaning that will be useful as

statistical basis for company reporting and analysis.

But data aggregation, when not implemented well using good algorithm and tools can
lead
data reporting inaccuracy. Ineffective way of data aggregation is one of the major
components that can limit performance of database queries.

Statistics have shown that 90 percent of all business related reports contain
aggregate
information making it essential to have proactive implementation of data
aggregation
solutions so that the data warehouse can substantially generate data for
significant
performance benefits and subsequently open many opportunities for the company to
have
enhanced analysis and reporting capabilities.

There are several approaches to achieving an efficient data aggregation. Having


robust and
high powered servers will make the database perform incrementally better. Another
approach is to do partitioning, de-normalization, and creating OLAP cubes and
derivative
data marts. Report caching and broadcasting can also help boost performance. And
another
method is having summary table.

But while these approaches have been proven and tested, they may have some
disadvantages in the long run. In fact those approaches have already been lumped
among
the traditional techniques by some database and data warehouse professionals.

Top data warehouse experts recommend that having a good and well define enterprise
class
solutions architected to support dynamic business environments have more long term
benefits with data aggregation. The enterprise class solutions provide good methods
to
ensure that the data warehouse has high availability and easy maintenance.

Having a flexible architecture also allows for future growth and flexibility and
most business
trends nowadays tend to lean towards exponential growth. The data architecture of
data
warehouses should use standard industry models so they can support complex
aggregation
needs. It should also be able to support all kinds of reports and reporting
environments.
One way to test if the data warehouse is optimized is if can process pre-
aggregation with
aggregation on the fly.

Data warehouses should be scalable as the amount of data will definitely grow very
fast.
Especially now that new technologies like RFID can allow gathering of more
transactional
data, scalability will be important for the future data needs of the company.
Data aggregation can really grow to be a complex process through time. It is always
good
to plan the business architecture so that data will be in sync between real
activities and the
data model simulating the real scenario. IT decision makers need to make careful
choice in
software applications as there are hundreds of choices that can be bought from
software
vendors and developers around the world.

What is Data Collection Frequency

Data Collection Frequency, just as the name suggests refers to the time frequency
at which
data is collected at regular intervals. This often refers to whatever time of the
day or the
year in any given length of period.

In a data warehouse, the relational database management systems continually gather,

extract, transform and load data onto the storage system. Along with these
processes, there
could be a potentially large number of data consumers simultaneously accessing data

warehouses getting aggregated data reports for statistical analysis of both company
and
industry trends.

Having a log of the data collection frequency of a data warehouse is very important
for a lot
of reasons.

For one, having knowledge about data collection frequency is extremely important in

recording inventory transactions. A company can take a physical inventory to be


used later
in reconciling quantities recorded in stocks. The inventory transaction records can
be used
to record inflows (receipts) and outflows (shipments) of inventory at loading
stocks. At the
final stage, the data of inventory transactions will be batch updated along with
implementing extensive and intensive error checking for quantities, negative
inventories
validated warehouse amounts and valid serial numbers.

All these activities are monitored by the data warehouse system and data collection

frequency will be useful in analyzing so many things like if the transactions were
legal or
illegal and many other related useful information.

In a company data warehouse, data collection solutions are important because they
enable
the business organization to have real time information and visibility in supply
chain. This
can greatly improve decision making processes, accuracy in customer information and

products or services sales and material availability and reporting data warehouse
operations. The data collection frequency can also help increase return of
investments (ROI)
through improved equipment and labor productivity.
Data collection frequency is particular a great help in advertising and marketing
by
determining media exposure of, say, an e-commerce website. And e-commerce website
needs to have intensive media exposure. In the internet where e-commerce takes
place,
there are thousands of other competitors. There competitors will do all they can to
get top
exposure to internet users and buyers. One way to do this is to get top ranks in
search
engines.

Data collection frequency record is a good determinant for media exposure of the e-
commerce site and the products and services it offers. The record for frequency of
data
collected could be used in calculating the number of prospects which have been
reached
with different media vehicles at different levels of frequency of exposure.

Sometime, data warehouses can experience problems both hard and software in nature.
To
troubleshoot problems, IT professionally generally look at the logs to see which
point the
system encountered such problems. Having a record for data collection frequency can
give
the troubleshooter some hints about problem. For instance, at some point, data
collection
was so heavy that it could cause processing to be intensive to the point of
hardware
breakdown.

Business intelligence is fast evolving and has long been a critical component of a
company's
daily operations. As it continues to evolve, the need for real time data warehouse
which can
provide data consumers with rapid updates becomes even more demanding.

Many companies are finding that they need to refresh their data warehouses on more
frequent basis because tools in business intelligence are being used more and more
often
for decision making in operations. According to may data warehouse specialists,
data
warehouse is not just about loading data for business analyst to forecast; it is
more about
daily decisions.
With real time data collection, for sure database managers and data warehouse
specialists
will surely make more room for recording data collection frequency.

What is Data Completeness

In any data resource, it is essential to meet requirements of current as well as


future
demand for information. Data completeness assures that the above criterion is
fulfilled.

Data completeness refers to an indication of whether or not all the data necessary
to meet
the current and future business information demand are available in the data
resource.

It deals with determining the data needed to meet the business information demand
and
ensuring those data are captured and maintained in the data resource so they are
available
when needed.
A data warehouse has six main processes. These processes should be carefully
carried out
by the data warehouse administrator in order to achieve data completeness. The
processes
are as follows:

� Data Extraction � the data in the warehouse can come from many sources and of
multiple data format and types with may be incompatible from system to system. The
process of data extraction includes formatting the disparate data types into one
type
understood by the warehouse. The process also includes compressing the data and
handling
of encryptions whenever this applies.

� Data Transformation � This processes include data integration, denormalization,


surrogate key management, data cleansing, conversion, auditing and aggregation.

� Data Loading � After the first two process, the data will then be ready to be
optimally
stored in the data warehouse.

� Security Implementation � Data should be protected from prying eyes whenever


applicable as in the case of bank records and credit card numbers. The data
warehouse
administrator implements access and data encryption policies.

� Job Control � This process is the constant job of the data warehouse
administrator and
his staff. This includes job definition, time and event job scheduling, logging,
monitoring,
error handling, exception handling and notification.

The measure of a data warehouse's performance depends on one of the factors


pertaining
to availability of useful data which is also an indication of the success of a
business
organization in reaching its own goals. All data can be imperfect in some fashion
to some
degree. It is the goal of the data warehouse manager to pursue perfect data which
is
consumed by the public resources without the need for creating appreciable value.
The data
warehouse manager and his staff should come up with strategies to be able to
provide
substantial accuracy and timeliness of data at a reasonable cost so as not to
burden the
company with extra expenses.

In most cases, data warehouses are available twenty four hours a day, seven days a
week.
So that comprehensive data is gathered, extracted, loaded and shared within the
data
warehouse, regular updates should be done. Parallel and distributed servers target
for world
wide availability of data so data completeness can be achieved with investing in
high
powered servers and robust software applications. Data warehouses are also designed
for
customer level analysis, aside from organizational level analysis and reporting. So
flexible
tools should be implemented in the data warehouse database to accommodate new data
sources and support for metadata. Reliability can be achieved when all these are
considered.

The success in achieving data completeness in a warehouse is not just dependent on


the
current status of the database and its physical set-up. At the planning stage,
every detail
about the data warehouse should be carefully scrutinized. All other frameworks of
the data
warehouse should also be carefully planned including the details of the business
architecture, business data, business schema, business activities, data model,
critical
success factors, meta data, comprehensive data definition and other related aspects
of
organizational functions.

Having complete data can give an accurate guidance of the business organization's
decision
maker. With complete data, statistical reports will be generated with will reflect
and
accurate status of the company and how it is faring with the trends and patterns in
the
industry and how to make innovative moves to gain competitive advantages over the
competitors.

What is Data Compression

Data Compression is a method using which the storage space required for storing
data is
reduced with the help of mathematical techniques. Data compression is also referred
to as
source coding. This is the process of encoding data information using as few bits
as possible
compared to the unencoded data.

As a real life non digital analogy, the world "development" could be compressed as
the word
"dev't or dev". Despite the few use of letters, the all three words give the same
meaning to
a person with the benefit of saving space on the computer and saving paper space
and ink
in printing.
In the more technical and mathematical sense, data compression is applying certain
algorithms in order to reduce bits in a data file. Most computer software
applications for
compressing data use a variation of the LZ adaptive dictionary-based algorithm in
reducing
file sizes without changing the meaning of the data. "LZ" refers to the name of the
creators
of the algorithm, Lempel and Ziv.

Data compression is very useful in two main areas: resource management and data
transmission. With data compression, consumption of expensive resources like hard
disk
can be greatly reduced. But the downside to this is that compressed data often
needs extra
processing for decompressing so extra hardware may be needed.

In terms of transmission, compressed data will help save bandwidth and as result, a

company may not need to spend extra money for bandwidth. But as with any
communication, a protocol need exists between the sender and receiver to get the
message
across.
There two main types of data compression namely lossless compression and lossy
compression. As the name implies, lossy compression results in a lot of lost bits
while the
lossless compression may not remove bits but eliminate them but by changing them
into
data information with lesser demands for number of bits.

The lossless compression may let one recreate exactly the original file to be
compressed
while the lossy compression is based on the concept of break the file into smaller
formats
for storage and easy transmission and putting the parts back together at the target
site
after transmission.

In a lossless data compression for instance, a picture may have a nice blue sky but
the file
size is big and the user may want to reduce the file size without compromising the
quality of
the nice blue color. To make this possible, one has to change the color value for
particular
pixels. Because the picture has lots of blue, the program would then pick one color
of blue
and use it for every pixel. An algorithm will take care of this such that the file
is rewritten in
a manner where every sky pixel refers to the picked blue color so redundancy by
using
different pixels of different shades of blue is reduced.

On the other hand, lossy compression is very useful internet applications as the
nature of
the sending files over the internet is breaking a file into packets. The problem
with lossy
compression is that one could get stuck with the receiving application's
interpretation of the
compression program from the source. Data that needs to be reproduced exactly like
databases cannot use lossy compression. But the benefit of lossy compression is the
big
reduction in files size.

Some examples of lossless data compression include entropy encoding,


. Burrows-Wheeler Transform,

. Prediction by Partial Matching (also known as PPM),

. Dictionary Coders (LZ77 & LZ78 and LZW),

. Dynamic Markov Compression (DMC),

. Run-length encoding and context mixing.

Examples of lossy data compression include vector quantization,

. A-law Compander, Mu-law Compander,

. Distributed Source Coding Using Syndromes (for correlated data),

. Discrete Cosine Transform,


. Fractal compression,

. Wavelet compression,

. Modulo-N code for correlated data and linear predictive coding.

What is Data Concurrency

Data Concurrency ensures that both official data source and replicated data values
are
consistent, that means whenever data values official data source is updated then
the
corresponding replicated data values must also be updated via synchronization in
order to
maintain consistency.

In a single user database, each transaction is processed serially, therefore there


is not need
for contention with interference from other transactions. But in a large data
warehouse
environment, there could be hundred or thousands of users and data consumers from
across many different locations trying to access the warehouse simultaneously.
Therefore, a
single user database will not do.

In a multi-user database powering a data warehouse, transactions are executed


simultaneously from a wide array of sources. Each of these transactions has the
potential to
interfere with other running transactions within the database. So, it is a good
practice to
isolate transactions from each other within the multi user environment. But there
must be a
way of collating the transaction data so that the data warehouse can come up with
aggregated reports.

Allowing more than one application or other data consumers to access the same data
simultaneously while being able to maintain data integrity and database consistency
is the
main essence of data concurrency.

Because transactions are isolated from each other, data will definitely be
replicated. For
example, I and two other friends are simultaneously buying the same item from the
same
e-commerce site. We are also simultaneously buying with one thousand others from
different parts of the globe. Therefore we are technically doing the same
transaction. But
unbeknown to us, our transactions are processed in isolated cases in the backend
data
warehouse. Yet, the database interprets all of us as using the same data
simultaneously.

When multiple users attempt to make modifications to a data at the same, some level
of
control should be established to that having one user's modification affect
adversely can be
prevented. The process of controlling this is called concurrency control.

There are three common ways that databases manage data currency and they are as
follows:
1. Pessimistic concurrency control � in this method, a row is available to the
users when
the record is being fetched and stays with the user until it is updated within the
database.

2. Optimistic concurrency control � With this method, a row cannot be available to


other
users while the data is currently being updated. During updating, the database
examined
the row in the database to determine whether or not any change has been made. An
attempt to update a record that has already been changed can be flagged as
concurrency
violation.

3. Last in wins � with this method, any row can never be available to users while
the data
is currently being updated but there is no effort made to compare updates with the
original
record. The record would simply be written out. The potential effect would be
overwriting
any changes that are being made by other concurrent users since the last refresh of
the
record.

Some relational database management systems have multi-version concurrency control.

This works by automatically providing read consistency to a query which results in


a
situation where all data seen by query can only come from a single point in time,
or a term
known as statement level read consistency. Another read consistency is the
transaction
level read consistency. The RDMS uses the information stored in the rollback
segments
which contain old values of data that have been altered by recently committed or
uncommitted transactions to get a consistent view. Both consistency and concurrency
are
closely related to each other in database systems.

What is Data Conversion

Data Conversion, as the name implies, deals with changes required to move or
convert data
from one physical environment format to that of another, like moving data from one
electronic medium or database product onto another format.

Every day, data is being shared from one computer to another. This is a very common

activity especially in data warehouses where database severs gather, extract,


transform and
load data from different sources at every moment. Since these data gathered and
shared
from different computers which may have different hardware and software platforms,
there
should be a mechanism in dealing with data so that each computer server receiving
them
can understand what information the data contains.

Data conversion is technical process of changing the bits contained in the data
from one
format to another format for purpose of interoperability between computers. The
most
simple example of data conversion is a text file converted from one character
encoding to
another. Some of the more complex conversions involve conversion of office file
formats
and conversion of audio, video and image file format which needs to consider
different
software applications to play or display them.
Data conversion can be difficult and painstaking process. While it may be easy for
a
computer to discard information, it is difficult to add information. And adding
information is
not just simply padding bits but sometimes is would involve human judgment.
Upsampling,
the process of converting data to make it more feature rich, is not about adding
data. It is
about making room for addition, a process which also needs human judgement.

To illustrate, a true color image can be easy to convert to grayscale but not the
other way
around. A Unix text file can be converted to Microsoft file by simply adding a CR
byte but
adding color information to a grayscale image cannot be programmatically dne
because only
human judgment can know which colors are appropriate for each section of the image;
this
is not rule based that can be easily done by a computer.

Despite that fact that data conversion can be done directly from format to another
desire
format, may application use pivotal encoding in converting data files. For
instance,
converting Cyrillic text files from KO18-R to Windows-1251 is possible with direct
conversion
using a look up table between encodings. But a more people use conversion from by
first
converting the KOI8-R file to Unicode before converting to Windows-1251 because of
manageability benefits. Conversion with character encoding is a lot easier this way
because
having a lookup table and all permutations of character encodings involves hundreds
of
records.

Data conversion sometimes results to loss of information. For instance, converting


a
Microsoft Word file to plain text files results in a lot of data loss because the
text file
removes the Word formatting feature. To prevent this from happening, the target
format
must support the same data constructs and features of the source data.

Inexactitude can also be a result of data conversion. This means that the result of
the
conversion can be conceptually different from the source files. An example would be
the
extant in word processors, WYSIWG paradigm and desktop publishing applications
compared to the structural descriptive paradigm found in XML and SGML.

It is important to know the workings of both source and target format when
converting
data. If the format specifications are unknown, reverse engineering can be applied
to carry
out any conversion as this can attain close approximation of the original
specification
although there is no assurance that there can be no error or inconsistency. In any
case,
there applications that can detect errors so appropriate actions can be done.

What is Data Fragmentation

Data warehouse is implemented in an organisation with the help of data architecture

schema. Elements which are specific to the company or organisation are defined in
data
architecture schema. For instance, the administrative structure should be designed
according to the real life undertakings of the company's administrative department
so that
data resources can be managed to simulate the administrative department.

In addition, there should also be a database technology description of the


methodologies to
be used in storing and manipulating the defined data. A good design interface to
map data
in other systems is also a good way to implement a data warehouse along with
infrastructure layout on how to support common data operations such as data
imports,
emergency procedures, data back ups and external data transfers.

When there is no proper guide or framework in the implementation of a data


warehouse,
common mistakes which could have been easily avoided will occur. A common pitfall
would
be different operations with data which makes it difficult to control the flow
within the
system. The result would be data fragmentation which can have tremendous negative
impact such as potential increase cost. The problem with data fragmentation is
typically
encountered with rapidly growing business organizations or those having different
lines of
business in the one company.

A company trying to acquire other business or having mergers with another company
may
potentially experience difficulty with data fragmentation. For instance, in the
field of
manufacturing and retail, multiple order entries or order fulfillment systems may
be poorly
integrated resulting in data fragmented since stored at different storages in
different
locations. In another instance, poor integration with delivery services over
multiple channels
such as the web, retail offices and call centers can also result in data
fragmentation.

The fundamental cause of data fragmentation also often lies in the complexity of an
IT
infrastructure especially if there is an absence of an integrated architectural
foundation
which is substantial in the interoperation of big volumes of heterogeneous data
from various
applications and business data accumulated over many years. It is not uncommon for
business organizations to involve and have significant changes in business rules so
an IT
infrastructure should evolve as well, and this means the company has to invest
more.

In many companies, more than 50 percent of the budget for IT operations is focused
on
building and maintaining points of integration especially when dealing with legacy
systems
dedicated to supply chain, finance, customer relations management and other mission

critical aspects of the business.

Problems related to data fragmentation can be serious and relatively difficult and
costly to
attend. But they can be prevented with a good data architecture and IT
infrastructure
design that takes into consideration the future growth of a business organization.

The data architecture phase of an information system planning, when properly and
carefully
executed to the tiniest detail, can force a company to specify and draw a line
between
internal and external flow of information. A company should be keen in seeing
patterns
developed over the years and trends for the future. From this particular stage, it
could be
highly possible that a company can already identify costly pitfalls and shortfalls
related to
information, disconnection between department and branches, and disconnection
between
current and future business endeavors. At this stage alone, more than half of the
problem
stemming data fragmentation is prevented.

What is Data Flow Diagram

The Data Flow Diagram is commonly used also for the visualization of structured
design
data processing. The normal flow is represented graphically. A designer typically
draws
context level DFD first showing interaction between the system and the outside
entities.
Then this context level DFD will then be exploded in order to further show the
details of
system being modeled.

Larry Constantine invented the first data flow diagrams based on Martin and
Estrin's data
flow graph model of computation.

A DFD is one of the three essential perspectives of Structured Systems Analysis and
Design
Method (SSADM). In this method, both the project sponsors and the end users need to

collaborate closely throughout the whole stages of the evolution of the system.
Having a
DFD will make the collaboration easy because the end users will be able to
visualize the
operation of the system, the will see a better perspective what the system will
accomplish
and how the whole project will be implemented.

A project implementation can also be made more efficient especially in progress


monitoring.
The DFD of the old system can be laid side by side with the new system's DFD so
that
comparisons can be made and weak points can be identified so that the appropriate
innovations can be developed.
There are four components of a data flow diagram which are the following:

External Entities / Terminators - These refer or points to the outside parts of the
system
being developed or modeled. Terminators, depending on whether data flows into or
from
the system, are often called sinks or sources. They represent the information as
wherever it
comes from or where it goes.

Processes � The Processes component modifies the inputs and corresponding outputs.

Data Stores � refers to any place or area or storage where data will be placed
whether
temporarily or permanently.

Data Flows � refers to the way data will be transferred from one terminator to
another, or
through processes and data stores.
As a general rules, every page in a DFD should not contain more than 10 components.
So, if
there are more than 10 components in one processes, one or more components should
have
to be combined and then make another DFD to detail the combination in another page.

Each component needs to be number. Same goes for each subcomponent so that it will
be
easy to follow visually. For example, a top level DFD must have components numbered

1,2,3,4,5 and next level subcomponent (for instance of number 2) numbered 2.1, 2.2,
2.3
and so on.

There are two approaches to developing a DFD. The first approach is the Top Down
Approach where a DFD starts with a context level DVD and then the system is slowly
decomposed until the graphical detail goes down to a primitive level.

The other approach, Event Partitioning Approach, was described by Edward Yourdon in
Just
Enough Structured Analysis. In Event Partitioning Approach, a detailed DFD is
constructed
all events are made. For every event, a process is constructed and then each
process is
linked with other processes through data stores. Each process' reaction to a given
event is
modeled by an outgoing data flow.

There many DFD tools available in the market today. Some of these DFD tools include

Microsoft Visio, ConceptDraw, Dia, SmartDraw and SILVERRUN ModelSphere. Most of


these
tools have drag and drop capabilities and have analysis tools to help a designer
see instant
potential flaw in the diagram so correction can done immediately.

What is Data Dictionary

From a general information technology technical perspective, a data dictionary is a


set of
metadata which contains the definition and representation of data elements. From
the
perspective of a database management system, a data dictionary is a set of table
and views
which can only be read and never altered.

When implementing a data warehouse which is management by a relational database


management system, it is a requirement to have a data dictionary. The benefit of
having a
data dictionary is that data items will always be consistent wherever tables within
the
database enterprise they may be stored. For instance, several telephone numbers may
be
stored in different tables in different locations.

It is a known fact there telephone numbers are being written down in different ways
by
different people. With a data dictionary, the format of the telephone number within
the
whole organization will always be the same, and hence consistency is maintained.
Most data dictionaries contain different information about the data used in the
enterprise. In
terms of the database representation of the data, the data table defines all schema
objects
including views, tables, clusters, indexes, sequences, synonyms, procedures,
packages,
functions, triggers and many more. This will ensure that all these things follow
one standard
defined in the dictionary. The data dictionary also defines how much space has been

allocated for and / or currently in used by all the schema objects.

Other information defined in a typical data dictionary which is related to database

implementation also include default values for database columns, names of the
database
users, the users privileges and limitations, database integrity constraint
information, and
many other general information.

A data dictionary is in fact a database implementation as well as they contain data

information about data. It is typically structured in tables and views just like
other data in a
database. Most data dictionaries are central to a database and are very important
tool for
kinds of users from the data consumers to application designers to database
developers and
administrators.

A data dictionary is used when finding information about users, objects, schema and

storage structures. Every time a data definition language (DDL) statement is


issued, the
data dictionary becomes modified.

Organizations that are trying to develop an enterprise wide data dictionary need to
have
representational definition for data elements and semantics. Semantics refer to the
aspects
of meaning expressed in language. In the same manner, an enterprise wide data
dictionary
semantics component focuses on creating a precise meaning of the data elements.
Representational definition, on the other hand, defines the way that data elements
are
being stored in the computer such as data types including string, integers, floats,
double or
data formats.

Glossaries are similar to data dictionaries except that glossaries are less precise
and contain only
terms and definitions not very detailed representations of data structures. Data
dictionaries may
initially start with a simple collection of data columns and definitions of the
meanings of the
columns content and may start to grow at a high rate.

Data dictionaries should not be confused with data models because the latter
usually include
more complex relationships between elements of data.

When discrete logic is added to definitions of data elements, a date dictionary


could evolve into
full ontology.

What is Data Dimension


Data Dimension is mainly used in data warehouse implementations. A data warehouse
is
implemented to that organizations can profit from data driven operation which
constitute a
major component in running businesses these days.

To be effective with a data driven operation, data which is the basis for
statistical results for
trending should be accurate and timely. In order to achieve timeliness, a company
should
invest in top of the line server hardware technologies which includes fast
computers and
network equipment, a task which is relatively easy to do as long as there is money.

But in order to achieve accuracy, the data warehouse should be based on a carefully

planned data architecture based on real life business rules. This process is not
just
expensive but it also takes so much time and careful attention to the tiniest of
details so
that the data architecture reflects the real life business operations.

The data dimension employed in a data warehouse is designed to compartmentalize


data in
the warehouse. The data dimensions should clearly define the structured labeling of

information to otherwise unordered numeric measures. To illustrate this, let us


think of a
sales receipt. A sales receipt may contain several information on it. Such data as
"Date",
"Customer Name" and "Product" are all data dimensions which could be have
meaningful
applications to a sales receipt. A data element could be to some degree similar to
a
categorical variable in the science of statistics.

There are three main functions of a data dimension which are filtering, grouping
and
labeling. For instance, in a company data warehouse, each person, regardless of
whether
this person is a client, a company staff, or company official, is categorized
according to
gender � male, female or unknown. If a data consumer wants a report by gender
category,
say, all males, the data warehouse will have a fast and efficient means in sifting
the big bulk
of data within the data warehouse.

In general, each dimension found in the data warehouse could have one or more
hierarchies. For example, the "Date" dimension may contain several hierarchies like
Day >
Month > Year; or Week > Year. It is up to the design of the data warehouse how the
hierarchy in data dimension will be laid out.

A concept in data warehousing called "role-playing dimension" is used when multiple

application with the same database recycle data dimensions. For example, in the
"Date"
dimension again, the said "Date" can be used for "Date of Delivery" as well as
"Date of
Sale" or "Date of Hire". This can help the database save space on storage.

A dimension table is used in a data warehouse as one of the set of companion tables
to a
fact table (which of course contains business facts). A dimension table contains
attributes or
fields which are used as constraints and group data when performing a query.
Another related term used in data warehousing is the degenerate dimension. This is
a
dimension derived from a fact table but it does not have its own dimension table.
This is
generally used in cases where the grain of the fact table represents transactional
level data
and a user wants to main specific system identifiers like invoice or order numbers.
When
one wants to provide a direct reference back to a transactional system without
having to
care about overhead cost from maintaining a separate dimension table, then a
degenerate
dimension is the way to go.

What is Data Dissemination

The best example of dissemination is the ubiquitous internet. Every single second
throughout the year, data gets disseminated to millions of users around the world.
Data
could sit on the millions of severs located in scattered geographical locations.

Data dissemination on the internet is possible through many different kinds of


communications protocols. The internet protocols are the most popular non-
proprietary
open system protocol suite in the world today. They are used in data dissemination
through
various communication infrastructures across any set of interconnected networks.
Despite
the name internet protocol, they are also well suited for local area networks (LAN)
and wide
area network (WAN) communication.

Using the internet, there are several ways data can be disseminated. The world wide
web is
an interlinked system where documents, images and other multimedia content can be
accessed via the internet using web browsers. It uses a mark up language called
hyper text
markup language (HMTL) to format disparate data into the web browser.

The Email (electronic mail) is also one of the most widely used systems for data
dissemination using the internet and electronic medium to store and forward
messages. The
email is based on the Simple Mail Transfer Protocol (SMTP) and can also be used by
companies within an intranet system so that staff could communicate with other.
The more traditional ways for data dissemination which are still in wide use today
are the
telephone systems which include fax systems as well. They provide fast and
efficient ways
to communicate in real time. Some telephone systems have been simulated in internet

applications by using the voice over internet protocol (VoIP).

Through this protocol, hundreds of free or minimally charge international phone


calls are
already available. This simulated phone calls is possible using the computer with
microphone and speaker system or headphones. When a video camera is used, it could
be
possible to have video conferencing.

Of course, the use of non digital materials for data dissemination can never be
totally
eliminated despite the meteoric rise of electronic communication media. Paper memos
are
still widely used to disseminate data. The newspaper is still in wide circulation
to
communicate vital everyday information in news and feature items.

Despite the efficiency of electronic means of data dissemination, there are still
drawbacks
which may take a long time to overcome, if at all. Privacy is one of the most
common
problems with electronic data dissemination. The internet has thousands of loop
holes
where people can peep into the private lives of other people. Security is also a
related
problem with electronic data dissemination. Every year, millions of dollars are
lost to
electronic theft and fraud. Every time a solution is found for a security problem,
another
malicious programs spring up somewhere in the globe.

Many companies set up precautionary measures against security invasion in their


information systems. Some set up user accounts with varying privileges to data
access.
Many set up internet firewalls and anti virus software on their computers to
prevent
intrusions.

Data dissemination is a very substantial aspect of business operation. Most of


today's
businesses are data driven. It is a common scenario where business organizations
invest
millions for data warehouses including hardware, software and manpower costs, to
make
data dissemination fast, accurate and timely. Information gathered from
disseminated data
form as basis for spotting industry trends and patterns and decision making in
companies.

What is Data Duplication

The definition of what constitutes a duplicate has somewhat different


interpretations. For
instance, some define a duplicate as having the exact syntactic terms and sequence,

whether having formatting differences or not. In effect, there are either no


difference or
only formatting differences and the contents of the data are exactly the same.
In any case, data duplication happens all the time. In large data warehouses, data
duplication is an inevitable phenomenon as millions of data are gathered at very
short
intervals.

Data warehouse involves a process called ETL which stands for extract, transform
and load.
During the extraction phase, multitudes of data come to the data warehouse from
several
sources and the system behind the warehouse consolidates the data so each separate
system format will be read consistently by the data consumers of the warehouse.

A data warehouse is basically a database and having unintentional duplication of


records
created from the millions of data from other sources can hardly be avoided. In the
data
warehousing community, the task of finding duplicated records within large
databases has
long been a persistent problem and has become an area of active research. There
have
been many research undertakings to address the problems of data duplication caused
by
duplicate contamination of data.
Several approaches have been implemented to counter the problem of data
duplication. One
approach is manually coding rules so that data can be filtered to avoid
duplication. Other
approaches include having applications of the latest machine learning techniques or
more
advance business intelligence applications. The accuracy of the different methods
for
countering data duplication varies. For very large data collection implementing
some of the
methods may be too complex and also expensive to be deployed in their full
capacity.

Despite all these counter measures against data duplication and despite the best
efforts in
trying to clean data, the reality still remains that that data duplication will
never be totally
eliminated. So it is extremely important to understand its impact on the quality of
a data
warehouse implementation. In particular, the presence of data duplication may
potentially
skew content distribution.

There are some application systems that have duplication detection functions. These

functions are developed by calculating a unique hash value for a certain data or
group of
data such as a document. Each document, for instance, is being examined for cases
of
duplication by comparing it against some hash value in either an in-memory hash or
persistent lookup system. Some of the most commonly used hash functions include
MD2,
MD5, or SHA. These three are the most preferred due to their desirable properties.
They are
also easily calculated based on arbitrary data or document lengths and they have
lower
collision probability.

Data duplication can also be similar to problems like plagiarism and clustering.
But the case
of plagiarism could either be exact data duplication or just plain similarity to a
certain
documents. Documents which are considered to be plagiarized may refer to the
abstract
idea and not the word for word content. Clustering on the other hand is a method
which is
used to make clusters of data that have somehow similar characteristics. Clustering
is used
for fast retrieval of relevant information from a database.

Careful planning in the implementation of data warehouse which include clear


definition of
the data architecture and investing in robust IT hardware and software
infrastructure can
help minimize problems brought about by incidences of data duplication.

What is Data Layer

Spatial Data is a kind of data that reflects the real world which has become too
complex for
the direct and immediate understanding of data consumers. Spatial Data are used to
create
models of reality and designed to have some similarity with selected aspects of the
real
world including status and nature of the reality.

A Spatial Database is therefore a collection of spatially referenced data which are


acting as
model of the real world in that it sets and approximates selected phenomena. These
phenomena are then converted in digital form and may representations of past,
present or
future time.
Data Layers are actually a set of spatial objects, coverage or themes. A Data Layer
may
represent a single entity from the Spatial Database or a group of related entity
types which
are conceptually the same.

For example a layer representing a real life landscape may have only stream
segments or
may have streams, coastlines, lakes and swamps.

The entity set to be included in a Data Layer actually depends on the system as
well as the
database model although in some cases, the database may have already been built by
combining all the entities into a single Data Layer.

The basic elements in a Spatial Database are actually the same as a regular
database. It
also has an entity but this entity refers to any "a phenomenon of interest in
reality that is
not further subdivided into phenomena of the same kind" such as a "city" entity
which could
be further broken down into component parts or "forest" entity which could be
further
subdivided into smaller forests.

An Entity also has an attribute as in the case of regular databases.


An Attribute refers to any characteristic of an entity selected for representation.

An Object is "a digital representation of all or part of an entity".

In a Spatial Database, the very distinct difference between an Entity an Object is


that an
Entity is the element in reality while the object is the element as it is
represented in the
physical database. Both of these elements are modeled in a Geographic Information
System
(GIS) database.
To better illustrate a Data Layer in a Spatial Database, let us take an example of
a Spatial
Database dealing with geographical data of a particular place. The Spatial Database
may
contain a Data Layer which consists of specific types of geographical data such as
soil map,
cadastral map or any image.

In the implementation of object-relational DBMS data storage, each layer may be


associated
with a set of Data Tables to store both the spatial and non-spatial components.
There will be
a different table which will be associated with each Data Layer for storing each of
the
different geometric elements such line, point, polygon and raster.

Accessing the Spatial Data within the DBMS may be possible through a generic
Application
Programming Interface (API). The API can encapsulate any internal differences among

database systems. The API can also map spatial data types into the specific DBMS
implementations with the use of spatial indexing or their in-built optimization
facilities
It is common nowadays to have Data Warehouse that have database systems which can
integrate spatial data types in object-relational data base management systems.

New advancement will make this set up even more popular in the future with the
development of GIS technology. Companies can generate reports not just in
traditional
tables but with graphical maps reflecting data about the company as well.

What is Data Loading

A Data Warehouse, is not just a rich repository of company data. It is also an


overall
strategy and process for making a cutting edge decision support system. One of the
main
objectives of a Data Warehouse it to bring together various information from
several
sources whose platforms could be totally different from one another but the Data
Warehouse has the responsibility of putting all these disparate data format into a
unified
data system that can be used for making Business decisions.

In data warehousing, there is common term called ETL which stands for Extract,
Transform
and Load.

Data Loading is dependent on the specifications of the database management system


that is
powering the Data Warehouse.

In general, before data can be loaded, the database and the tables to load must
have been
created already. There are many utility programs available which can build
databases and
define the user table with the SQL CREATE TABLE statements. When the load process
begins, the database system typically builds primary key indexes for each of the
tables
which have a primary key. Also, user-defined indexes are also built.
In some really large databases especially those used in data warehouses, it is
common to
encounter several stages when data loading. It is also common in Data Warehouse
implementations to have data loaded into the database from an input file.

Data Warehouse typically employs automated data loading. During the input stage of
this
loading process, the database validates syntaxes and control statements. It then
inputs
records, monitor progress and status which is indicated by the error handling and
cleanup
functions.

At the conversion stage, input records are transformed into row format. Data is
then
validated and checked for any referential integrity. All arithmetic and conditional

expressions are defined within each input column specification. Finally, data is
written in the
rows of the table.
Data Loading could be a critical process when the design and implementation of a
Data
Warehouse is not done well or performed in a controlled environment. Contingency
measures must be prepared during the data loading process in case of an
administration
failure.

When such failure occurs, the administrator should be ready with the knowledge of
the
structure of the processes and the whole database in particular and the Data
Warehouse in
general. All specific traces of the processes being executed should be tracked in
down not
just.

In fact, due to the complexity of the Data Warehouse loading process, there are a
lot of
specialized Extraction, Transformation, Loading (ETL) software applications which
can be
bought in the market today.

The most important benefits that can be derived from these tools include easy
identification
of relevant information inside the data source; easy extraction or retrieval of the

information; simple customization and integration of different kinds of data coming


from a
wide array or disparate data sources into a unified common format; fast cleaning of

resulting data set based on define Business Rules; and efficient propagation of
data to the
Data Warehouse or Data Marts.

Data Loading is part of a larger and more complex component of the Data Warehouse
architecture called Data Staging. Complex programming is often involved in data
staging.
This component also often involves analysis of quality data and filters which can
identify
certain patterns and data structures within the existing operational data.

But whether a database administrator uses data loading tools or generates his own
programming codes, one of the most effective ways to manage a Data Warehouse is to
develop a good Data Warehouse data loading strategy.

What is Data Collection

A database can be vast shared collection composed of data which are logically
related to
each other. Businesses rely heavily on data as they are Databases are used for
managing
the business day to day tasks so Data Collection happens every single day.

Collection of data may seem a simple and trivial task. But databases have gone a
long way
from simply being able to define, create, maintain and control data access. Today
most
complex applications cannot function without data and database managers. And Data
Collection is one of the most critical tasks to handle by companies and their IT
staff.
Two popular approaches to constructing database management systems emerged in the
1970s. The first approach was exemplified by IBM involving a data model which
requires
that all data records are assembled into a collection called Trees.

As a consequence, some records were roots while all others had unique parent
records. An
application programmer is permitted to query and navigate from the root to the
record of
interest one at a time. This process was rather slow but at the time, records were
stored on
serial storage devices particularly magnetic tapes.

The other approach at the time was the Integrated Data Store (IDS) developed from
General Electric. This approach led to a new development of a new kind of database
system
called the Network Database Management System (DBMS).

This database was designed to represent more complex data relationships compared to

those represented by the Hierarchical Database Systems like that of IBM's. But
still, query
navigation involved moving from a specific entry point to the record of interest.

Today's dominant databases, if not all, are based on the relational database model
proposed
by E. F. Codd. This design tried to overcome the shortcomings of the previous
databases
like the data inefficient data retrieval scheme.

With relational databases, data is represented in table structures which are called
relations
and access to these data is through high level and non-procedural query language
done in a
declarative manner.

The problem with previous database involving algorithms which obtain desired
records one
at a time has been overcome with having to specify only a predicate that identifies
the
desired records of combination of records in relational databases.

A Relational Database Management System (RDBMS) has a query optimizer to interpret


the
predicated specification into a process to access the database to solve the query.
Relational
databases maximize data independence and minimize redundancy.

Today's business data warehouses are sometimes regarded as islands of databases.


They
are geographically separated and have incompatible hardware architectures and
communication protocols. But this can be held seamless together in Data Collection
with
Distributed Database Management Systems.

A Distributed Database Management System (DDBMS) is a single database which is


split
into several fragments which are stored on several computers under the control of
separated database management systems. These computers are connected on a network.
Each computer has local autonomy but they can also process data stored on other
computers within the network.

Software applications are specifically written to tie these autonomous DBMS


together. Local
applications manage data that are not from other sites while global applications
manage
data from other sites. Data Collection can be made seamless by the application
despite the
geographic distance between two or more systems.

Big and competitive companies invest money for Data Collections that can
incorporate
advanced numeric and text string searches, table handling methods, relational
navigations
through pages, and user defined rules to help spot relationships between data and
elements.

What is Combined Data

In a company, a database contains millions of atomic data. Atomic data are data
information that cannot be further broken down. For example, product name is an
atomic
data because it can longer be broken down but product raw material can be broken
further
into raw components depending on the good. An individual products sales is another
atomic
data.

But business organizations are not just interested in the minute details but they
are also
interested in the bigger picture. So, atomic data are combines and aggregated. When
this is
done, the company can already determine regional or total sales, total cost of
goods,
selling, general and administrative expenses, operating income, receivables,
inventories,
depreciation, amortization, debt, taxes and other figures.

Data Mining, or taking data from the vast repository of data warehouse, uses
combined data
intensively. Software applications in conjunction with a good relational database
management system have been developed to come up efficient ways to store and access

data gathered over time and space for statistical analysis.


Data Mining is technically described as "the nontrivial extraction of implicit,
previously
unknown, and potentially useful information from data" and "the science of
extracting useful
information from large data sets or databases". It is a process involving large
amounts of
data being sorted to pick out relevant data from potentially non-relevant sources.

One of the biggest problems with Data Mining is level of Data Aggregation. For
example, in
an online survey by a private organization on the smoking trends of one region, it
can be
reflected that one data set contains records of those who currently smoke, another
of those
who have quit smoking and another data set contains records of those who have never

smoked at all.
The collection within each data set continuously rises as data from other sources
keep
coming. The traditional ways to combine these data are done with either using ad
hoc
method or putting each data set to certain model and them combining them.

Newer methods have been developed to efficiently Combine Data from various sources.

Several data coming various tables and databases can now be combined into a single
information table. One method used is a likelihood procedure which provides an
estimation
technique to address identifiable problems with aggregated data from some tables
related
to other tables.

Companies find valuable investment in technologies having Business Intelligence.


Business
Intelligence combines the vast repository of business data warehouse with a
software
systems that analyzes and reports based on the gathered Business Data.

An example of a Business Intelligence technology is Online Analytical Processing,


or OLAP.
OLAP can quickly provide answers to analytic queries which are in nature
multidimensional.
It can combine data from different sources and generate reports for sales,
marketing,
financial forecasting, budgeting, and other related aspects of business.

OLAP can make complex ad hoc and analytical queries on a database configured for
OLAP
use and the execution can be very fast given the fact that a server needs to answer
many
users at a time from different geographical locations. OLAP combines data to give a
matrix
output format with dimensions forming rows and columns representing the values and
measures.

Combined Data is also heavily found in Data Farming, a process where high
performance
computers or computing grids run simulations billion times across a large
parameters and
value space to come up a landscape of output data to be used for analyzing trends,
insights
and anomalies of many dimensions. It can be compared to a real plant farm and a
harvest
data comes after some time.

What is Change Data Capture

Change Data Capture refers to the process of capturing changes which are made to a
production data source. Change Data Capture is typically performed by reading the
source
of database management software logs. Some of the features of Change Data Capture
are

. It consolidates units of works

. Ensures that data is synchronized with the original source


. Reduces data volume in a data warehousing environment.

In a data warehousing environment, some events oftentimes required that relational


data
be extracted and transported from one or more sources of databases and then loaded
into
the data warehouses for processing and analysis. Change Data Capture immediately
identifies the event and processes only the data which has not changed. It will not
change
the entire table but makes the changed data available for whatever use.

If Change Data Capture was not implemented, extracting business data from any
database
would be an extremely difficult and cumbersome process. This will involve moving
the entire
contests of the tables into flat files and load the files into the data warehouse.
This is not
just cumbersome but also expensive.

Change Data Capture is not dependent on any intermediate flat files to temporarily
contain
data outside the relational database. Changed data resulting from INSERT, UPDATE,
and
DELETE operations are captured and then stored in a database object which is called
change
table. The changed data will then be made available to any applications which will
need
them in a controlled manner.

Some of the terminologies describing Change Data components include the following:

Source System � This refers to the production of database containing the source
table
where Change Data Capture will capture the changes.

Source Table � Is the table in the database which contains data the user will want
to
capture. Any changes made to the source table will be instantly reflected in the
change
table.

Change Set � This term refers to the collection of change tables.

Change Table � This is the database table which contains the changed data which
results
from DML statements made to a single source table. This table can consist of the
change
data itself and the system metadata. The Change Data is stored in a database table
while
the system metadata is needed for maintaining the change table.

Change table need to be managed so that its size will not grow without limit. This
is done by
managing the data in change tables and automatically purging change data which are
no
longer needed. A procedure can be automatically set to be called periodically to
remove
data from the change table which are no longer required.
Security is imposed in the Change Data process by having data subscribers, any user

application which will want to get data, register with the database management
system.
They will then specify their interest from one or more source to tables and the
database
manager or administrator will give the subscribers their desired permissions,
privileges or
access.

The Change Data Capture environment is very dynamic. The data publisher can add and

remove change tables at whim at any time. Depending on the database application,
subscribers may not get explicit notification when the publisher makes changed to a
table
but views can be used to check by the subscriber. There are many more mechanism
employed so that subscribers can always adjust to changes in the database where
subscription is active.

Change Data is an indispensable feature of any relational database management


system
(RDBMS) especially those being used in large data warehouses. They make sure
servicing of
data is fast and efficient, changes are monitored for easy troubleshooting and
analysis and
referential integrity is always maintained between tables.

Classic Data Warehouse Development

Classic Data Warehouse Development is the process of building an enterprise


business
model, creating a system data model, defining and designing a data warehouse
architecture, constructing the physical database, and lastly, populating the
warehouses
database.

In a real business environment, the data warehouse is the main repository of the
company's
historical data as well as data subscribed from other sources so that the company
can up
with statistical analysis to better understand the patterns and trends in the
industry where
they are operating. When they have a clear understanding of the industry trends,
the can
adjust their business rules and policies as well as come up with innovations in
their products
and services to gain competitive advantage over other companies within the same
industry.

In the Classic Data Warehouse Development, the first step is to define the
enterprise
business model.

During this phase, all real life business activities are gathered and listed. A
case model for
the entire business is drawn. This includes interaction between the business and
its external
stakeholders. For a enterprise business model to be consistent, business
requirements are
identified using a very systematic and complex approach.
Some enterprise business modelers do base the functions on an organizational
structure as
it is prone to change over time with the fast changing of business trends and the
potential
growth. What is essential is a consistent framework for the business is defined
that can last
a long period.

An enterprise business models show how the business workers and other entities work

together to realize business processes. The object model can be made from the
aggregate
collection of all the process and people and events involved.

When the enterprise business model is in place, the next step would be to create a
system
data model. This data model is actually an abstract data model describing how data
is used.
This data model represents entities, business events, transactions and other real
life
activities defined by the enterprise business model.

In a technical sense, the system data model would be used in the actual
implementation of
the database. The system data models are the technical counterparts of the entities
created
in the enterprise business model.

The next step would be defining the data warehouse architecture. The data warehouse

architecture is a framework describing in details how the elements and services of


the
warehouse fit together and how to manage the data warehouse's growth through time.
Just
like in building a house, there should be a set of plans, documents, specifications
and
blueprint.

When all is set, planned and documented, it will time to set up the physical
database. The
demand of the data warehouse specifies the need for a physical database system.
Computer
hardware is one of the biggest considerations in setting up a physical database.

The processing power of the computer should be able to handle labor intensive
processing.
The storage devices should be able to hold large bulks of data which get updated
every few
minutes. Networking support should be fast and efficient.

Other consideration in setting up the physical database is what software


application to use
and which vendor to buy from. There are plenty of relational database software
applications
available in the market. Some of these include Microsoft SQL, Sybase, PostgresSQL,
Informix and MySQL.

When the physical database is set, measure and dimensions have already been laid.
Measures are individual facts and dimensions refer to how facts need to be broken
down.
For example, data warehouse for a grocery may have dimensions for customers,
managers,
branches and measures of revenue and costs. The next step in the classic data
warehouse
development is to populate the fact and dimension tables with appropriate data. The

database may be set to it populated hourly, weekly or anytime depending on the


need.

What is Consumer Profile

It is very important to document consumer profiles in the data warehouse. Consumer


profile
constitutes an essential component when the organization needs reports on the
operating
trends and patterns and how the organization is performing.

If the organization is into a business competition, having customer profiles in the


database
can give the answer to the following issues:

� Basis on the product gross margin and customer cost of service, who are the most
profitable customers today?
� Where these the same customers who were also the most profitable last month or
last
year?
� What products are selling the best and which ones are the least in sales
performance?
� Which particular products are popular among a certain age bracket?
� What products need to be reinvented and what new products can be derived to cater
to
the taste of the market aged 21-30 year old?

If the organization is a government entity, the consumer profiles may answer


questions
like:

� Who are the country's citizens?


� What are the jobs available for the people?
� Who among the people earn less than the minimum wage?
� Which regional area are the poorest and has the most inefficient way to
delivering basic
service to the people?
� What is the crime rate of a certain big city?

People on the internet cruise one from one site to another, read articles, register
on their
favorite website and make purchases online through the website that they trust. As
they do
these activities, they are actually giving information about themselves.
It may surprise some people that when they open certain web pages, the ads that
appear
are often related to their taste and interest.

The capture and use of consumer profiles are on web activities have been a great
fueling
force in e-commerce. Many online websites set up separate database to exclusive
function
as recommendation engines.

Using consumer profiles for e-commerce sites can be a very complicated activity.
People do
not just stick to one website. And online companies will have to make sure that
they
recommend the appropriate product to the right market. But consumer profiles are
constantly changing.

There can be several reasons why consumer profiles get out of data. It could be
that the
consumer has been away from the site or they have simply changed preferences. One
reason why consumer profiles change is not because the consumers themselves caused
the
changed but because of poor algorithms in the servers that could not come up with
correct
analytical processes.

Most data warehouses for e-commerce sites have engines that observe behavioral
activities
on the site. These engines track purchases, registrations, visited product reviews
and all
other activities which they may get information about. This way, consumer profiles
are
constantly updated.

Getting consumer profiles can be a heavy workload on the data warehouse servers.
Servers
need to weed out irrelevant data. More sophisticated data warehouse setups use a
complex
combination of content, age, frequency and other unique factors to deliver the best
possible
way of targeting advertising to the right markets.
There are several software applications on the market that specifically deals with
consumer
profiles. Some intelligent business solutions are composed of a suite of many
solutions to
many aspects of business and consumer profiling is among the suite of solutions.

What are Critical Success Factors

Critical Success Factors are areas of activity in which favorable results are
necessary for a
company to reach its goal. Critical Success Factors are intensive used in business
organizations as essential guides for the company or project to achieve its mission
and
goals.

For example, one of the Critical Success Factors for of a company involved in
developing
information technology solutions is user involvement. Some general critical success
factors
include money factors like having positive cash flow, profit margins and revenue
growth;
customer satisfactions factors; product development factors and many others.

D. Ronald Daniel first presented the idea of Critical Success Factors in the 1960s.
A decade
later, John F. Rockart of MIT's Sloan School of Management popularized the idea and
since
then, the idea has been extensively used in helping business organizations
implement
projects and industry strategies.

Today, there are already different ways in which the concept of Critical Success
Factors is
being implemented and this will probably tend to evolve.

According to Rockart, Critical Success Factors refer to "The limited number of


areas in which
results, if they are satisfactory, will ensure successful competitive performance
for the
organization. They are the few key areas where things must go right for the
business to
flourish. If results in these areas are not adequate, the organization's efforts
for the period
will be less than desired."

To illustrate the concept of Critical Success Factors, let us say someone want to
set up a
bookstore. The person defines his mission as "To be the number bookstore in town by

offering the widest selection of books and sustain customer satisfaction rating of
90%.

From the mission, the objectives are:


- To have a wide array of books for sale
- Sustain a 90% customer rating
- Expand book store space for future growth

From the objectives the activities to realize them would then be listed and lay out
in a very
clear perspective. This will give the company better focus and making good at these

activities will be the critical success factors of the company.


The idea of identifying critical success factor is the primary basis for
determining the
company's information needs. If the objectives are not met because the information
needs
were not acquired, then the organization will fail.

In the bookstore case mentioned above, we can already identify some information
needs in
a few minutes although identifying these needs in detail should take time and
participation
from different staff of the company. To have a wide array of book, the information
needed
would be where to find book suppliers, how to build strong and stable relationships
with
publishers, how to come up with fast and efficient shipment system and others.

In sustaining 90% customer rating, the needed information would be what the main
topics
that buyers like are, what promotional activities will the bookstore undertake. In
the book
store expansion, the needed information would be which location will the bookstore
expand,
how will the physical set up be like, what IT needs will be taken into
consideration.

It is also essential to identify the constraint of the critical success factors. In


understanding
the constraints, critical success factors defense measures can be derived. Knowing
the
constraints will eliminate guess works which can bring about greater risks to the
company's
success.

Knowing critical success factors in the operation of the business can really
strengthen
management strategy. Risk management process can be more focused and many issues
will
be corrected and probability of failure greatly reduced. Every single activity
within the
organization will be directed towards achieving the overall success of the company.

What is Crosstab

Crosstab, or Cross Tabulation, is a process or function that combines and/or


summarizes
data from one or more sources into a concise format for analysis or reporting.
Crosstabs
display the joint distribution of two or more variables and they are usually
represented in
the form of a contingency table in a matrix.

A Crosstab should never be mistaken for frequency distribution because the latter
provides
distribution of one variable only. A Cross Table has each cell showing the number
of
respondents which gives a particular combination of replies.

An example of Cross Tabulation would be a 3 x 2 contingency table. One variable


would be
age group which has three age ranges: 12-20, 21-30, and 31-up. Another variable
would be
the choice of polo shirt or t-shirt. With a crosstab, it would be easy to for a
company to see
what the choices of shirts are for the three age groups. For instance, the table
would show
that 20% of those aged 12-20 prefer polo, while only 10% of those aged 31-up prefer
t-
shirts. With the information, they can up with moves which will be beneficial to
the success
of the business.

Cross Tabulations are popular choices for statistical reporting because they are
very easy to
understand and they are laid out in a clear format. They can be used with any level
of data
whether the data is ordinal, nominal, interval or ratio because the Crosstab will
treat all of
them as if they are nominal data. Crosstab tables are provide more detailed
insights to a
single statistics in a simple way and they solve the problem of empty or sparse
cells.

Since Cross Tabulation is widely used in statistics, there many statistical process
and terms
that are closely associated with it. Most of these processes are methods to test
the
strengths of Crosstabs which is needed to maintain consistency and come up with
accurate
data because data being laid out using Crosstabs may come from a wide variety of
sources.
The Lambda Coefficient is a method of testing the strength of association of
Crosstabs when
the variables are measured at nominal level. Cramer�s V is another testing method
that test
the strength of Crosstabs which adjusts the number of rows and columns. Other ways
to
test the strength of Crosstabs associations include Chi-square, Contingency
Coefficient, Phi
Coefficient and the Kendall tau.

Companies find the services of a data warehouse very indispensable. But inside the
data
warehouse can be found billions of data which most of them are unrelated. Without
the aid
of tools, these data will not make any sense to the company. These data are not
homogenous. They may come from various sources, often from other data suppliers and

other warehouses which may be coming from other branches in other geographical
locations.

Software applications like relational database monitoring systems have Cross


Tabulation
functionalities which allow end users to correlate and compare any piece of data.
Crosstab
analysis engines can examine dozens of table very fast and efficiently and these
engines can
even create full statistical outputs by very clicks of the mouse or keyboards.

Relational database applications have a Crosstab query function. This function can
transform
rows of data to columns or any table for statistical reporting. With Crosstab
query, one can
send a command to the database server and the server can aggregate data like
breaking
down reports by months, regional sales, product shipment and many more.

Many advance database systems have dynamic Crosstab features. This is very useful
when
dealing with columns that do not have a static number. Crosstabs are heavily used
in
quantitative marketing researches.

What are Data Access Tools


Data access is the process of entering a database to store or retrieve data. Data
Access
Tools are end user oriented tools that allow users to build structured query
language (SQL)
queries by pointing and clicking on the list of table and fields in the data
warehouse.

Thorough computing history, there have been different methods and languages already
that
were used for data access and these varied depending on the type of data warehouse.
The
data warehouse contains a rich repository of data pertaining to organizational
business
rules, policies, events and histories and these warehouses store data in different
and
incompatible formats so several data access tools have been developed to overcome
problems of data incompatibilities.
Recent advancement in information technology has brought about new and innovative
software applications that have more standardized languages, format, and methods to
serve
as interface among different data formats. Some of these more popular standards
include
SQL, OBDC, ADO.NET, JDBC, XML, XPath, XQuery and Web Services.

Structured Query Language is a computer language used in Relational Database


Management Systems (RDBMS) for retrieving and management of data. Although SQL has
been developed to be a declarative query and data manipulation language, several
vendors
have created SQL DBMS and added their own procedural constructs, data types and
other
propriety features. SQL is standardized both by ANSI and ISO.

ODBC, which stands for Open Database Connectivity is a standard software


application
programming interface used for data management systems. Different computer
languages
can access data into different types and implementation of RDBMS using the ODBC.

JDBC which stands for Java Database Connectivity can be to some degree the same as
ODBC but is used for the Java programming language.

ADO.NET is a Microsoft proprietary software component for accessing data and data
services. This is part of the Microsoft .Net framework. ADO stands for ActiveX Data
Object.

XML stands for Extensible Markup Language is primarily a general purpose markup
language. It is used to tag data so that sharing of structure data can be done
through
disparate systems across the internet or any network. This makes data of any format

portable among different computer systems making XML one of the most used
technologies
in data warehousing.

XML data can be queried using XQuery. This is almost semantically the same with
SQL. XML
Path Language is used to address portion of an XML document or other computing
values
like strings, Booleans, number and others based on any XML document.

Web services are software components that make possible the interoperability of
machine to
machine interaction over the internet. They are also commonly known as Web API that
are
accessed over the internet and execute on another remote system.

Many software vendors develop applications that have graphical user interface (GUI)
tools
so that even non programmers or non database administrators can build queries by
just
clicking the mouse. This GUI data access tools give users access via data access
designer
and data access viewer. With the data access designer, an end user can create
complex
databases even if he does not have intensive background.
Templates that are complete with design framework and sample data are available
readymade. With the data access viewer, the user can run and enter data and make
changes and modifications and graphically see what see the commands without having
to
care the complex process happening in the background.

Data access tools makes the tasks of database administrators a lot easier
especially if the
database being management is a large data warehouse. Having a graphical interface
for
data access gives the administrator a clearer status of the database because most
programmatic query languages may look cryptic on the command line interface.

What is Common Metadata

In simple but technical term, metadata is a data that describes another data. It
can be any
item describing an individual datum or a collection of multiple content items.

Metadata is very useful in facilitating the use, management and understanding of


data in a
large data warehouse. Depending on the type of data and the context where the data
is
being used, metadata required to effectively manage a database or large data
warehouse
varies.

For instance in a library system, metadata to be used will surely include


description of book
contents, authors, data of publication and the physical location of books in the
library. If the
context of use is about photography, the metadata which will be used are for
description of
the camera, model, types, photographer, date photograph was taken, location where
photograph is taken and many other things. In the case of an information system
where
data involved is the content of files in the computer, metadata used will be
describing
individual data items and their field names, length, etc.

Common Business Metadata is one of the foundations of an intelligent business


system.
Common Data Names, Common Data Definitions and Common Data Integrity rules need to
be very consistent. Trustworthiness and common understanding are important
prerequisites
for integrating business intelligence into operational business processes. Without
those
perquisites, integrating business intelligence into business operations can
potentially make
untold damage to a high degree.

A Common Metadata is the bases for sharing data within an enterprise. These data
refer to
a common definition of data items, Common Data Names and Common Integrity Rules.
With these commonalities come Common Transformations for all master data items
including customer, employee, product, location and many others. This also includes

Common Transformations for all business transaction data and all business
intelligence
system metrics.

In a data warehouse, it is extremely important to have a Common Warehouse


Metamodel.
This model specifies the modeling aspect of metadata which are being used for
relational,
non-relational, multi-dimensional and other objects found within the data warehouse
so that
the system will have a Common Metadata Structure that adheres to the underling
business
data architecture.

Common interfaces that can be useful in enabling interchange of business and


warehouse
intelligence metadata are specified within the Common Warehouse Metamodel. This can
be
in conjunction with warehouse tools, warehouse metadata repositories and data
warehouse
platforms in heterogeneous distributed warehouse environments. A Common Warehouse
Metamodel is based on three standard which are Unified Modeling Language (UML),
Meta
Object Facility (MOF) and XML Metadata Interchange (XMI).

Common Warehouse Metamodels are also useful in enabling users to trace data lineage
as
they provide objects that effectively describe where the data came from and how or
when
the data is being created. The instance of the metamodel are exchanged through the
XML
Meta Data Interchange documents.

Today's business trends are heading towards the internet as the main highway to
gather
and share data. But the internet is full of all sorts of data. This includes
different data
formats, different applications using and sharing data and different server
systems.
Problems can arise in terms of hardware and software portability.

The use of Common Metadata tries to melt the boundary down because the format by
which
a common data is packaged can be read by disparate systems. So, whether the data
shared
in used a relational database or an excel flat files, the processing server within
the data
warehouse will know how to deal with data for processing.

What is Data Mapping

Data mapping is a very important aspect in data integration. In fact, it is the


first step in the
many complex tasks associated with data integration which include data
transformation or
data mediation between a data source and its destination; identification of
relationships in
data which is vital in analysis of data lineage; discovery of sensitive data like
some last
digits in a social security number; or consolidation of many databases into one
while
identifying redundancy.

Many modern business organizations are striving towards a common goal of uniting
business and data applications in order to increase productivity and efficiency.
Such goals
have been translated into recent trends in data management like such technologies
as
Enterprise Information Integration (EII) and Enterprise Application Integration
(EAI).

These technologies try to answer question on how organizations can integrate data
meaningfully from many disparate systems so that companies can better execute and
understand the very nature of their business.
In order to interconnect business more efficiently, many companies need to map data
and
translate these data between the many kinds of data types and presentation format
that are
in wide use and availability today.

There several ways to do data mapping using procedural codes, using XSLT
transformation
or using tools with graphical interfaces. Newer methods to do data mapping involves

evaluating actual data values in two data sources and automatically discovering
complex
mappings between the sets at the same time. Semantic data mapping can be achieved
by
having a metadata registry consulted to look up synonyms with data elements.

Today's enterprise data stored in data warehouses is of high volume data stored in
relational databases and XML based applications. Both of these applications
generally
cannot provide for an attractive way for presentation of data to company data
consumers,
customers and partners.

In order to address these problems, several XML based single source strategies have
been
developed. But in some of these supposed solutions, there is still a lack of
multiple
transformation stylesheets for each desired final output. In many cases, the need
to publish
content from relational database is not met.

There are many software applications solution to address the aforementioned


problems.
These software help publish XML and database contents by letting the end user
create
graphical designs which can simultaneously produce stylesheets with data mapping
for XML
or data from relational databases and come up with nicely laid out documents in
HTML,
Word, RTF or PDF formats.
Some software applications also offer HMTL to XML data mappings. With the rapid use
of
XML for delivering structure data on the world wide web, HTML and XML has been used
to
work together to present data in a professional way. This HTML and XML
collaboration
allows user to separate content and formatting so that data can be used in more
extensible
manner for web applications.

The HTML to XML mapping software can access data stored in HTML format and convert
it to
XML without losing the document's style. The conversion results in an XML schema
reflecting the content model while an instance of the XML document contains the
actual
content and an XSLT stylesheet takes care or the presentation style.

The extensive use of data mapping together with a company's robust data warehouse
architecture and business intelligence system can definitely result in an orderly,
efficient
and fast business day to day operation.
What is Conceptual Schema

In any data warehouse implementation there are many different considerations which
should in place before the final physical setting up. This is to avoid in problems
related to
quality of data and consistencies in data processes.

A conceptual schema is an abstract definition of the whole project. In the case of


data
warehouse and business intelligence system, a conceptual schema represents the map
of
concepts and their relationships. A data warehouse is built upon a business data
architecture and the business data architecture defined all the common business
structures
that pertain to the overall activities of the business enterprise.

The conceptual schema describes the semantics of a company. It represents the


series of
assertions and rules pertaining to the nature of the business processes, entities
and events.
In particular, the conceptual schema the thing which are very significant to the
company, a
term called entity classes and the characteristics of these things, a term called
attributes.
The association between pairs of the those things of significance is called a
relationship.

A conceptual schema, although it greatly represents the data warehouse and the
common
structure of data, is not a database design. It exists as different levels of
abstractions.
These abstractions are the basis for the implementation of a physical database.

Any conceptual schema is done by using a human oriented natural language. This
natural
language is used to fine elementary facts. The conceptual schema is totally
independent of
any implementation whether database or non-IT implementation.

The data model and query design of a business architecture and should be performed
at the
conceptual level and then mapped to other levels. This means that at the conceptual

schema everything should be gotten right in the first place. And then as the
business grows
and evolves changes can be made later. But many keen data architects usually design
the
data model for scalability. This means that all business growth and evolution are
already
taken into consideration in the concept schema level.

A conceptual schema should follow the following criteria: expressability, clarity,


simplicity
and orthogonality, semantic stability, semantic relevance, good validation
mechanisms,
abstraction mechanisms and formal foundation.

Making the conceptual schema commonly involves the close coordination between the
domain expert and data modeler. The domain expert best understands the application
domain. He or she understands the scope of the enterprise activities including the
individual
roles of the staff and the clients. He or she also understands the scope of the
products and
services involved. On the other hand, the task of the data modeler is to formalize
the
informal knowledge of the domain expert.
As the case should be, the communication between the domain expert and the data
modeler
involves verbalizing fact instances from data use cases, verbalizing fact types in
natural
language, validating rules in natural language and validating rules using sample
populations.

With close coordination between the domain expert and the data modeler, the
expected
output should be a conceptual schema that have data expressed as elementary facts
in
plain English sentences (or in any language appropriate depending on the users).
Facts are
also laid out on how they are group into structures.

Conceptual schema is really intensively used not just in database implementation in


many
other IT systems as well. It is plain definition of abstract ideas and entities
whereby all
technical specifications are taken from. Even in other fields not related to IT,
having a
conceptual schema before actual implementation of a plan helps facilitate a project

smoothly and efficiently

What is Connectivity

Computer networks are the main connectivity mechanism for passing data in an
electronic
environment. A network is composed of several computers connected by a wired or
wireless
medium so data and other resources can pass through for sharing.

A computer network may be as small as two computers connected by wire or wireless


medium to as big as millions of computers connected throughout the internet. There
are
generally five classifications of network connectivity which are personal area
network (PAN),
local area network (LAN), campus area network (CAN), metropolitan are network (MAN)
and
wide area network (WAN).

Computer networks may also be classified according to the hardware technology used
in
connecting each device. The classification include Ethernet, wireless, LAN, Home
PNA and
power line communication.

The arrangement of computers in a network can also vary. The network topology
refers to
geometric forms in network connectivity. This could also describe the way computers
see
each other in relation to their logical order. Examples of network topologies are
mesh, ring,
star, bys, star-bus combination, tree or hierarchical topologies. It is good to
note that
although topology implies form, network topology is really independent of the
physical
placement or layout of computers. For instance, a star topology does literally mean
that
computers form a star but it means that computers are connected using a hub which
has
many points to imply a star form.
Perhaps the biggest aspect of computer connectivity is the use of communications
protocol.
In a network, different formats of data are being shared by different computer
systems
which may have different hardware and software specifications. Communications
protocol
tries to break down the disparity so that data could be shared and appropriately
processed.

Communications protocol are the set of rules and standards by which data is
represented,
signaled, authenticated and corrected before or after sending over the channel of
communication. For example, in a voice communication like the case of radio
dispatcher
talking to mobile stations, they follow a standard set of rules on how to exchange
communication.

A communication protocol may be hard to generalize because of the varied purposes


and
different degree of sophistication. But most connectivity protocols commonly have
the
following properties:

. communicating devices detect the physical connection whether it is wired or


wireless

. devices do handshaking, a process of trying to find out if the other one exists

. negotiation about different characteristics of the connection

. determining the start and end a message


. formatting of a message for compatibility

. determining how to deal with corrupted message


. detecting unexpected data loss and acting on the appropriate steps

. properly closing the connection

The internet, the largest arena for computer and data connectivity, the protocols
are
assigned by the Internet Engineering Task Force (IETF) in close coordination with
the W3C
and ISO/IEC standard bodies. These bodies deal mainly with standards of the TCP/IP
and
Internet protocol suite.

The Institute of Electrical and Electronics Engineers (IEEE) an international non-


profit,
professional organization also sets communication protocols among electronic and
electrical
devices.

Some of the major protocol stacks include open standards for connectivity including
the
Internet protocol suite (TCP/IP), file transfer protocol (FTP), Open Systems
Interconnection
(OSI), iSCSI, Network File System (NFS) and Universial Plug and Play (UPnP).

Proprietary standard protocols include DECnet, AppleTalk, Systems Network


Architecture
(SNA) and Distributed Systems Architecture (DSA).
What is Data Derivation

Data Derivation refers to the process of creating a data value from one or more
contributing
data values through a data derivation algorithm.

Almost all business organizations in today's environment are becoming more and more

dependent on the data produced from the data warehouses and information systems in
order to support the company's operations. Since data accuracy is important,
knowledge of
how data is derived is very vital and important.

As systems evolve, the bulk data increases too, most especially that more people
and
business are moving to the internet for what used to be offline transactions. With
the
evolution of information systems, functionalities also grow complex so a need to
have an
associated documentation for data derivation becomes more indispensable.

Data derivation applies to all real life activities which are being represented in
the data
model and aggregated in the process within the data information systems or data
warehouse. For instance, in a database that keeps records of wild migratory birds,
there are
records of data pertaining to a variable called "Population Size". The basic
question would
be "How was the population size of migratory birds derived?" The answer may be that
data
derived from the recorded observations, estimation, inference or a combination of
all and
then getting the sort of average or any other formula.

It is a known fact that proper data derivation is the key to having an accurate
understanding of the core content of any output as the this is the process of
making new
and more meaningful data from the aggregation or basis of the raw data which had
been
collected by the database.
A derived data could be any variable. For example, in a database that computes a
person's
age when the record only keeps his birthday, the age is computed using certain
formula
deriving age from the birthday.

In any data warehouse implementation, it is important to have data dictionary which


details
all specifications for derived data. In the case of the person's age above, even if
the derived
data may be simple, pitfalls could exist because the data from which the person's
age will
be calculated should be used consistent. Below shows the possible inconsistencies
as the
with the source variable set could make one have several choices for the age
calculation
algorithm:

Person_age = floor ((randdate - dob) / 365.25)


The algorithm above results in having an approximate age which accounts for leap
years
while the algorithm below with slight modification will take into account the
century effect:

Person_age = floor ((randdate - dob) / 365.23)

As can be seen, the same variable can have several ways to derive. There it is
extremely
important to have a data dictionary so that the users can be guided about which
data
derivation they are using and stick to one algorithm if they want consistency.

Having a data derivation mechanism in a warehouse can make it improve performance.


Most often, because warehouses contain billions of raw data which gets regularly
updated,
having a snapshot of, say, an inventory information, can result in slow performance
because
the information is so bulky to the point that many data warehouses do not even
include
them. This problem, because this arises from bulky inventory data can overcome by
keeping
the inventory information at weekly or even monthly level and then using data
derivation
formula to minimize sizes of data sets.

Problems arising with data derivation can be hard to find. Therefore, data
derivation
formulas should always be carefully planned and documented so the flow of day to
day
operations will definitely be smooth and very efficient.

What is Data Partitioning

Data Partitioning is the formal process of determining which data subjects, data
occurrence
groups, and data characteristics are needed at each data site. It is an orderly
process for
allocating data to data sites that is done within the same common data
architecture.
Data Partitioning is also the process of logically and/or physically partitioning
data into
segments that are more easily maintained or accessed. Current RDBMS systems provide

this kind of distribution functionality. Partitioning of data helps in performance


and utility
processing.

Data Partitioning can be of great help in facilitating the efficient and effective
management
of highly available relational data warehouse. But data partitioning could be a
complex
process which has several factors that can affect partitioning strategies and
design,
implementation, and management considerations in a data warehousing environment.

A data warehouse which is powered by a relational database management system can


provide for a comprehensive source of data and an infrastructure for building
Business
Intelligence (BI) solutions. Typically, an implementation of a relational data
warehouse can
involve creation and management of dimension tables and fact tables. A dimension
table is
usually smaller in size compared to a fact table but they both provide details
about the
attributes used to describe or explain business facts. Some examples of a dimension
include
item, store, and time. On the other hand, a fact table represents a business
recording like
item sales information for all the stores. All fact table need to be periodically
updated using
data which are the most recently collected from the various data sources.

Since data warehouses need to manage and handle high volumes of data updated
regularly,
careful long term planning is beneficial. Some of the factors to be considered for
long term
planning of a data warehouse include data volume, data loading window,
Index maintenance window, workload characteristics, data aging strategy, archive
and
backup strategy and hardware characteristics

There are two approaches to implementing a relational data warehouse: monolithic


approach and partitioned approach. The monolithic approach may contain huge fact
tables
which can be difficult to manage.

There are many benefits to implementing a relational data warehouse using the data
partitioning approach. The single biggest benefit to a data partitioning approach
is easy yet
efficient maintenance. As an organization grows, so will the data in the database.
The need
for high availability of critical data while accommodating the need for a small
database
maintenance window becomes indispensable. Data partitioning can answer the need to
small database maintenance window in a very large business organization. With data
partitioning, big issues pertaining to supporting large tables can be answered by
having the
database decompose large chunks of data into smaller partitions thereby resulting
in better
management. Data partitioning also results in faster data loading, easy monitoring
of aging
data and efficient data retrieval system.

Data partitioning in relational data warehouse can implemented by objects


partitioning of
base tables, clustered and non-clustered indexes, and index views. Range partitions
refer to
table partitions which are defined by a customizable range of data. The end user or

database administrator can define the partition function with boundary values,
partition
scheme having file group mappings and table which are mapped to the partition
scheme.

There are so many ways wherein data partitioning can be implemented. Implementation

methods vary depending on the database software application vendor or developer.


Management of these partitioned data can vary as well. But the important thing to
note is
that regardless of the software application implementing data partitioning, the
benefits of
separating data into partitions will continue to bring benefits to data warehouses,
which now
have become standard requirements for large companies in order to operate
efficiently.

What is Data Repository


Data Repository is a logical (and sometimes physical) partitioning of data where
multiple
databases which apply to specific applications or sets of applications reside. For
example,
several databases (revenues, expenses) which support financial applications (A/R,
A/P)
could reside in a single financial Data Repository.

A database warehouse is one large Data Repository of all business related


information
including all historical data of the business organization implementing the data
warehouse.
Data warehousing is a complex process of building a data repository in the form of
a
relational database so that the company can support web or text mining in order to
leverage
data and transform or aggregate them into useful information.

In all cases, organizations use data warehousing to gain a competitive advantage,


support
for decision making processes through comprehensive data analysis.

Some of the key components of data warehousing are Decision Support Systems (DSS)
and
Data Mining (DM).

Data volumes in data warehouse could grow at an exponential rate so there should be
a
way to handle this tremendous growth. With respect to storage requirements, the
critical
needs that need to be seriously considered in a data warehouse are high
availability, high
data volume, high performance and scalability, simplification and usability and
easy
management.

Partitioning of data into a logical or in some cases physical Data Repository could
greatly
help meet the requirement in relation to dealing with the exponential growth of
data
volumes in the data warehouse. If all the data in the data warehouse were not
partitioned
into several Data Repositories, then there will be profound disadvantage in terms
of
perfomance and efficiency.

For one, if the central server fails, the system would come to a halt. This is
because data is
just located in one monolithic system, and when the hardware fails, there is no
sort back
up. It may take some time to get the server up, depending on the nature of the
problem.
But in a business company, even a few minutes of business stoppage can already
translate
into thousands of potential dollars lost from the business.

When Data Repository is employed in the data warehouse, the load can be distributed

across many databases or even across many servers. For instance, instead of having
one
computer handle the database related to customers, several databases could be
handling
the different aspects of customers.
In a very large company such as a company that has several branches around the
country,
instead of having all the customers in one database, several databases may be
handling
different branch customer database in a data repository. Or as earlier mentioned,
several
company departmental database may be broken down into various Data Repository such
as
one data repository supporting several databases (revenues, expenses) which support

financial applications (A/R, A/P) could reside in a single financial Data


Repository.

Data Repository offers easier and faster access due to the fact that related
information are,
to some degree, lumped or clustered together. For instance, in the example with
financial
Data Repository, anybody from the financial department or any other data use
wanting
information related to financials will not have to dig through the entire volume of
the data in
the data warehouse.

For database administrators, employing Data Repository means a lot easier way to
maintain
the data warehouse system because of the compartmentalized nature. When there is
problem within the system, it may be easy to trace the cause of the problem without
having
to use a top down approach for the whole data warehouse. In most companies, one
database manager or administrator is usually assigned to one data repository to
ensure
data reliability for the whole system.

What is Data Scheme

Data Scheme is a diagrammatic representation of the structure of data. It


represents any
set of data that is being captured, manipulated, stored, retrieved, transmitted, or

displayed. A Data Scheme can be a complex diagram with all sorts of geometric
figures
illustrating data structure and data relationships to one another in the relational
database
within the data warehouse.

As an example, let us take a generic website and illustrate the Data Scheme.

One of the general data categories in website Data Scheme is the User Accounts and
Privileges and Watchlist. The Data Scheme may draw one big box for User Accounts
and
Privileges and Watchlist. Within this big categories are four smaller data
categories boxes
named User, Watchlist, User Group and User New Talk.

The User box contains basic account and information about users such name,
password,
preferences, settings, email address and others. The Watchlist box contains
registered users
and the pages the user watches, the namespace number, notification timestamp and
others.
The User Group permission box maps users to their groups with defined privileges.
The User
New Talk box stores notification of user talk page changes for the display of "You
have have
new messages" box.
Within each of the four boxes are defined the data and each of the data names as
well as
data types. The Watchlist box may contain the following data, the corresponding
name and
type:

user_id: INTEGER (5)


user_fullname: VARCHAR (255)
user_password: TINYBLOB
user_email: TINYTEXT
user_options: BLOB
user_token: CHAR (32)

The same structure goes with the other tables as well. It is very clear that all
data
structures are being defined with names and data types. In a real data scheme
diagram,
there could be hundreds of boxes, data names and types and crossing lines
connecting one
entity to another.

The graphical look for a data scheme has some similarities to a flowchart which is
a
schematic representation of an algorithm or a process. But while a flowchart allows
business
analysts and programmers to locate the responsibility for performing an action or
making
correct decisions and allowing the relationship between different organizational
units with
responsibility over a single process, a data scheme is just merely a graphical
representation
of data structure. There is no mention of any process whatsoever.

It may be a good point to note the Data Scheme may or may not represent the real
lay out
of the database but just a structural representation of the physical database. To a
certain
degree, the data scheme is a graphical representation of the logical schema in data

modeling. A logical schema is technically a data model that describes semantics as


data are
represented by a particular data manipulation technology.
Data Schemes are handy guides for database and data warehouse implementation. They
can be compared to an architect's blue print of a house or a building wherein it is
easy to
locate some key points without spending too much time digging deep into minute
details.
Because of its graphical nature, professionals implementing a data warehouse will
not have
to strain their eyes on data structures and focus more on other details especially
in dealing
with programmatic codes.

Data Schemes are also highly useful in troubleshooting databases. If some points of
the database
are faulty, Data Schemes helps to pinpoint the cause of the error. Some errors in
database and
computer programming languages which are not related to syntax can be very hard to
trace.
Logic errors and errors related to data can be very hard to pin down, but with the
help of a
graphical Data Scheme, errors may be made easier to spot.
Data Store

A data store is very a very important aspect of a data warehouse in that it acts as
support
of the companies need for up-to-the-second, operational, integrated, collective
information.
It is a place where data such as databases and flat files are saved and stored.
Data stores
are great feeders of data to the data warehouse.

In a broad sense, a data store is a place where data is integrated from a variety
of different
sources in order to facilitate operations, analysis and reporting. It is can be
considered an
intermediate data warehouse for databases despite the fact that a data store also
includes
flat files.

Some data warehouses are designed to have data loaded from a data store which
consists
of tables from a number of databases which supported administrative functions like
financial, human resource, etc).

In some cases, the store are contained in one single database, while in other
cases, the
data store is scattered in different databases in order to allow tuning to support
many
different roles.

Those who prefer not to store a data store in a single database argue that the
tuning
choices are based on the very nature of the data and not on database design and the
access
on the large volumes of data would be negatively affected to a certain degree. It
also
matters in terms of the politics of getting everyone's concurrence.

A data store is an important link in a data warehouse's staging area. A staging


area is
conceptual place in the data warehouse which stands between the analytics system
and the
legacy system.
Some people think of the staging area as the "back room" portion of the data
warehousing
environment. This is where the collective process of the data warehouse known as
ETL
(extract, transform and load) is done. Whenever there is a need for data to be
executed in
the ETL process, the data warehouse gets the data from the data store which
contains all
the data at rest.

The data store, being an integral part of the data warehouse architecture, is the
first stop
for the data on its way to the warehouse. The data store is the place where data is
collected
and integrated and made sure of its completeness and accuracy.
In a lot of data warehousing implementations, all data transformations cannot be
completed
without having a full set of data being available. So, if there is a high rate of
data, capturing
is possible without having to constantly change data in the warehouse.

In general, data stores are normalized structures which integrate data which are
based on
certain subject areas and not on specific applications. For instance, a business
organization
may have more than 50 premium applications.

A premium subject area data store collects data using a feed from different
applications
providing near real time enterprise wide data. The data store is constantly
refreshed in
order to stay current. The history will then be sent to the data warehouse.
A data store can be a great tool as a reporting database for line-of-business
managers and
service representatives who will be requiring an integrated picture of the
enterprise
operational data. Some important aspects of business operation such as operational
level
reports and queries on small amounts of data can be made more efficient by the data
from
the data store.

For instance, if one wants specific data for only one calendar quarter, it may be
wise to just
query the data store. It would be much faster because querying the data warehouse
will
involve sifting through data for several years.

Data Thesaurus

Data Thesaurus deals with understanding patterns, trends, and relationships in


historical
data, and providing visual information to the decision maker. Data Thesaurus helps
to
identify common business terms and data names. It is useful for locating data in
metadata
warehouse.

A data thesaurus really consists of several metadata. Metadata is any kinds of data
which
describes another data. On the other hand, the literal meaning of thesaurus
according to
dictionary.com is "an index to information stored in a computer, consisting of a
comprehensive list of subjects concerning which information may be retrieved by
using the
proper key terms."

The data thesaurus, as part of the whole data warehouse system, is being
implemented in
line with all business rules and enterprise data architecture. The terms within it
are all
pertaining to business words because these terms are chosen and assigned as subject

keywords in the use of data warehouse queries and data results.


Basically, a data thesaurus contains preferred terms, non-preferred terms,
specifiers and
indicators. Preferred terms are words which should be used in representing a given
concept
despite the fact that there may be many other words which seem fit.

For example, in a medical data thesaurus, the word "infant" may be the preferred
term
instead of the word "baby" despite the fact they are commonly used interchangeably
in the
real world.

Non-preferred terms are of course the opposite to the preferred terms but they have
their
own considerations too. In the event that there are two or more words which can be
used in
expressing the same concept, the data thesaurus specifies which one to use as the
preferred while listing the others as non-preferred terms.

These non-preferred terms can be synonyms, abbreviations or alternative spellings


but one
is discouraged from using them. In most data thesaurus implementations, the can be
easy
to recognized because they are written in italics. Non-preferred terms are used
simply to
make sure that the preferred term is correct.

Specifiers are used when there are two or more words which are needed to express a
concept. An example would be "chief executive officer". The data thesaurus will
then make a
cross reference about the specifier against a combination of preferred terms so the
system
can now how to represent the group of words. In many data thesaurus, they are also
written in italics but they are typically followed by a + sign e.g. chief executive
officer+.

Indicators are also the same as the non-preferred terms but they point to a
selection of
some possible preferred terms in case there is not exact match that can be found
for the
concept and a single preferred term.
The ISO 2788 sets the Guidelines for the establishment and development of
monolingual
thesauri. This standard defines all aspects of a data thesaurus including Scope and
field of
application;

. References

. Definitions

. Abbreviations and symbols

. Vocabulary control; Indexing terms (General, Forms of terms, Choice of singular


or plural
forms, Homographs or polysemes, Choide of terms, Scope notes and definitions)

. Compound terms (General, Terms that should be retained as compounds, Terms that
should
be syntactically factored, Order of words in compound terms)
. Basic relationships in a thesaurus (General, The equivalence relationship, The
hierarchical
relationship, The associative relationship)

. Display of terms and their relationships (General, Alphabetical display,


Systematic display,
Graphic display)

. Management aspects of thesaurus construction (Methods of compilation, Recording


of terms,
Term verification, Specificity, Admission and deletion of terms, The use of
automatic data
processing equipment, Form and contents of a thesaurus, Other editorial matters)

Data Warehouse Engines

Data Warehouse Engines handle storage, quering and load mechanisms of large
database. It is an undisputable fact that implementing a data warehouse is such a
very
challenging task. This becomes even more challenging and difficult to do when we
take into
consideration the diversity of both operational data sources and target data
warehouse
engines. Both target and source data engines may be totally different when it comes
to
semantics such as considerations with regards to core data models.

Also, they may be totally different in the aspects of infrastructure such the
operational
details on data extraction and importation. When there is no common and sharable
descriptions for both the structures of data sources and target data warehouse
engines
when result in having acquisition of more data warehousing tools.

A research has shown that a fifty percent growth is record every year when in comes
to the
amount of data that business organizations retain to be used for analytic purposes.
In some
other industries such those in e-commerce, the web, telecommunications, retail and
governments, the growth rate figure may even be higher. These increasing trends
show that
there is a need for more powerful data warehouse engines.

From just a couple of years ago when data needed to power business intelligence
were just
stored in a central warehouses and a few other data sources within the departments,
now
there are countless ways to deal with high volumes of data with a multitude of data
sources
coming from wide geographical locations.

There are many kinds of data warehouse engines. Some of these data engines are
specific
to relational database implementations while some are open and can be used by any
implementing database software.
Micro-Kernel Database Engine is used by Btrieve database developed by Pervasive.
This
database engine uses module method to separate the backend of a database from the
interface used by developers. The core operations such as update, write and delete
records
of the database are separated from from the Btrieve and Scalable SQL modules. By
doing
such, programmer can use several methods of accessing the database simultaneously.

Microsoft uses the Jet Database Engine many of its products. The Jet, which stands
for Joint
Engine Technology, had its first version developed in 1992 which consisted of three
modules
for manipulating a database. The Jet is used for databases dealing with lower
volume of
data.

For database engines that deal larger volumes of data processing, Microsoft
provided
Microsoft Desktop Engine (MSDE). This was later followed by SQL Server Express
Edition
and most recently by SQL Server Compact Edition. However, the Jet can be upgraded
to
SQL Server.
InnoDB is a storage engine used by MySQL and is included in current binaries
distributed by
MySQL AB. It features an ACID-compliant support for transactions as well as
declarative
referential integrity. When Oracle acquired Innobase Oy, InnoDB became a product of

Oracle Corporation. But InnoDB is dual license as it is also distributed under the
GNU
General Public License.

MyISAM is MySQL's default storage engine and is a non-transactional high


performance
storage engine which as originally developed for data warehouse applications. Based
on an
older ISAM code, the MyISAM today has many new and useful extensions. MyISAM today
is
also one most commonly used data warehouse engines.

Data warehouse engines vary depending on the needs of the organization. But it is
common
today to acquire data warehouse engines that can handle the needs of very big,
terabyte-
scale business intelligence applications. This will make organizations get faster
information
to help then achieve success in the competition.

Data Warehouse Infrastructure

Data Warehouse Infrastructure basically supports a data warehousing enviroment with


the
help of a combination of technologies. In its most general definition, a data
warehouse is
large repository of all sorts of data the implementing organization would in need
in the
present and in the future. But the real of data warehouse and its functions and
features may
very depending upon the need of the organization and what it can afford.
So, the overall design and methodology of data warehouse will be depending on the
data
life cycle management policy of the organization. The general life data life cycle
starts with
pre-data warehouse, data cleansing, data repository, and front-end analytics.

The pre-data warehouse is like stage or area where the designers need to determine
which
data contains business value for insertion. Some of the infrastructure found in
this area
includes online transaction processing (OLTP) database which stored operational
data.

These OLTP databases may be residing in some transactional business software


solutions
such as Supply Chain Management (SCM), Point of Sale, Customer Serving Software and

Enterprise Resource Planning (ERP) and management software. OLTP databases need to
have very fast transactional speeds and up to the point accuracy.

Metadata computer application servers also can be found within this area. Metadata,
which
means data about data in computer speak, make sure that data which into the data
lifecycle
process are accurate and clean. It also makes sure that they are well defined
because
metadata can help speed up searches in the future.

During the data cleansing, data undergoes a collective process referred to as ETL
which
stands for extract, transform, and load. Data are extracted from outside sources
like those
mentioned in the pre-warehouse. Since these data may come in different formats from

disparate data, they will be transformed to fit the business needs and requirements
before
they are loaded into the data warehouse.

Tools at this phase include software applications created with almost any
programming
language. These tools could be very complex and many companies prefer to buy then
instead of having in house programmers. One of the requirements of a good ETL tool
is that
it could efficiently communicate with the many different relational databases. It
should also
be able to read various file formats from different computer platforms.

At the data repository phase, data are stored in corresponding databases. This is
also the
phase where active data of high business value to an organization are given
priority and
special treatment. Data repositories may be implemented as data mart of operational
data
store (ODS).

A data mart is smaller that a data warehouse and is more specific as in it is built
on a
departmental level instead of company wide level. An ODS are sort of resting place
for data
and they hold recent data before they are migrated to the data warehouses. Whether
a data
warehouse implements both or not, the tools in this stage are all related to
databases and
database computer servers.
The front-end analysis may be considered the last and most critical stage of the
data
warehouse cycle. This is the stage where data consumers will interact with the data

warehouse to get the information they need. Some of the tools used in this area are
data
mining applications which are used to discover meaningful patterns from a chaotic
system
or repository.

Another tools is the Online Analytical Processing (OLAP) will used in analyzing
historical data
of the organizations and slice the required business information. Some of the other
tools are
generic reporting or data visualization tools so that end users can see the
information in
visually appealing layouts.

Data Value

Data values are what actually take place in the data variable set aside by the data
entities
and all its attributes. It consists of facts and figures of data items, data
attribues and data
characteristcs.

From the data model whose structural part includes collection of data structures
used in
creating objects and entities modeled by the database, to the integrity part
defining rules
that govern constraints placed on the data structures, to the manipulation part
which
defines the collection of operations that can be applied to the data structures to
update and
query, data values are the concrete entities for all those abstract models.

For example, a database may have a table called "employee" with employee attributes
such
as first name, family name, address, age, address, email address, marital status,
job title,
monthly salary and many others. All these mentioned terms are simply descriptions
about
the entity any they are the building blocks of the whole database table structure.

They do not yet have value until somebody inserts real value into them. In the next
step, an
end user may add a record about a new employee so the following data values may be
entered into the database table:
JOHN (first name),
SMITH (family name),
15 OAK AVENUE, BRONX, NEW YORK, USA (address � in most cases, the address is broken

down into street number, state, zip code, country, etc);


35 (age);
JS@SMITH.com (email);
SINGLE (marital status);
CEO (job title);
$4000 (monthly salary).
Because every column within the database table is assigned a certain data type, it
can only
be expected that each data attribute should have to draw its value from a certain
very
specific set of values allowed as being defined by the data type.

For instance, if a column is defined to accept only integer data values, it can
never accept
any letter or a string of letters. In the above example, the age attribute may be
defined to
be of integer data type, which generally accepts a range of 0-255 for an 8-bit
unsigned
integer. So, no data user can enter the value "thirty two" into the age field.

A data value domain refers to the definition of an explicit set of restrictions on


a set of
values within a data type and is very useful for data validation. A set of semantic
rules are
also an addition to the restrictions set on the set of valid values for an
attribute that are
expressed as a subset of the allowed structural values.

For example, let us take the case of US Social Security numbers. In the database
table, the
data type for Social Security number maybe character or VARCHAR(11) aside from the
structural restrictions and semantic restrictions for the data. The structural
restriction may
take the form of 3 digits (0-9) followed by a hyphen (-), followed by 2 digits, a
hyphen,
then 4 digits. The semantic restrictions on the other hand specify the rules about
the
number itself.

In actual implementation, the first 3 digits would refer to the state or area. The
next 2 digits
would refer to the group number which is issued in a certain given order such as
odd
numbers from 01 through 09, followed by even numbers from 10 though 98.

All these definitions and restriction ensures that the data value entered into the
database
table is always correct and consistent structurally. The only problem that would
arise could
only come from the data entry but the structure will always be correct.
Data values need to be clean all the time as they are the source of information
that can give
an organization a better picture of itself and can come up with wide decisions and
moves.

Database

Database is a collection of data which are logically related. Database, as used in


computer
science, is well defined and structured collection of data which are stored
digitally in
computer system. They are designed to be easily stored and retrieved using database

queries, a set of computer codes translated into a language that the database
system can
understand. The computer program that is employed to manage and query a database is

called a database management system (DBMS).


There are many different ways to model data structure inside the database. Some of
these
models include Flat model, Hierarchical model, Network model, Relational model
Object database models and Post-relational database models.

The Flat model, also called table model, is made up of single, two-dimensional
array of data
elements. All members of a certain column are assumed to contain similar values
while all
the member of a rows are assumed to have relationships with one another. Flat
models are
no longer popular today because they hard to manage when the volume of data rises.

The Hierarchical model organizes data into a tree-like structure. The model would
have a
single upward link in each record in order to describe the nesting. It contains a
sort field
used for keeping the records in a particular order in each of the list in the same
level.

Network model stores records linking with other records as the name implies.
Pointers,
which can be node numbers or disk addresses, are used to track all the associations
within
the database.

Relational model is today the most widespread model used in database


implementations.
The relational model deals with table, columns and rows. The table contains
information
about an entity, which is a representation of any thing of interest in the real
life
counterpart. The column contains all attributes pertaining to the entity.

The term relational refers to the fact that various tables in the database have
relation to
other tables and programmatic algorithms make it easy to insert, update, delete and
all
other operations on different tables without sacrificing data quality and
integrity. Databases
implemented using the relational mode as managed by a relational database
management
system (RDBMS).
Object database models are newer types of models under the object-oriented
paradigm.
This model makes an attempt to bring together the database world and the
application
programming.

The object database model tries to avoid computation overhead by creating reusable
objects
based on a class or template. This model attempts to introduce into the database
world
some of the main concepts of object oriented programming such as encapsulation and
polymorphism.

The Post-relational database model attempts to incorporate relations while not


being
constrained by the information principle which requires all information to be
represented by
data values in a relation.
Some of the products which employ the post-relational database model often use a
model
which predates the relational model. Some of these products include PICK (aka
MultiValue)
and MUMPS.

Some of the database internal considerations include aspects related to Storage and

Physical Database Design, Indexing, Transactions and concurrency, Replication and


Security.

Databases are ubiquitous in the world of computing with and they are used in a lot
of
applications that spans virtually the entire range of computer software. They are
the
preferred data storage methods for large multi-user applications and environments
where
large chunks of data are being dealt with. Databases have become an integral part
of many
implementations of web servers. With the fast rise of e-commerce websites today,
databases are becoming indispensable tools for internet business.

Large enterprise data warehouses cannot run without the use of databases. This
sophisticated repository of data needs to be managed effectively by a database
management software application.

Decentralized Warehouse

In Decentralized Warehouse, a central gateway provides access to remote data with


the
help of a logical view. This central gateway processes real time user queries.
Users can
access and also query the remote data via central gateway.

A data warehouse is very large repository of a company's historical and current


transactional data. In order for the data warehouse to efficiently handle the high
volume
data while trying to service potentially high number of data consumer, certain
mechanisms
must be considered in the design and implementation of a data warehouse in order to

smoothly operate the whole system.


One of the techniques in data warehousing is having a decentralized warehouse. As
one of
the major current trends in business is gaining market power through mergers and
acquisitions and through selling off some business units that a large company
thinks are no
longer efficient or represent core competencies, many business set-ups need to
considered
in data warehousing designs.

In the case of mergers and selling of business units, many large business
organizations
often end up in restructuring of activities and data considerations that can be
given benefit
to by implementing a decentralized warehouse.
A decentralized data warehouse separates data management but many key areas of the
business enterprise. For instance, a warehouse management system can be a stand
alone
decentralized system or can be operated with a centrally operated enterprise
resource
planning (ERP) system.

A really large business enterprise may have the following implementation: Goods
Movement
in the Decentralized Warehouse; Goods Receipt in a Decentralized Warehouse; Goods
Issue
in a Decentralized Warehouse; Stock Transfers in a Decentralized Warehouse; Posting

Changes in a Decentralized Warehouse. These are just a few of the example of an


enterprise data warehousing schema but in real life scenario, there could be a lot
more
areas that need decentralized data warehouse.

In order to these decentralized data warehouses to work together in order to


collaborate
with the business intelligence system and therefore give valuable information to
enterprise
data consumers, there has to be an integration mechanism. The integration can be
done by
a Decentralized Warehouse Management System.

A Decentralized Warehouse Management System usually works by defining the warehouse

numbers, assigning warehouse numbers to the combination of plants and storage


locations,
activating the warehouse management system, defining and activating the interface,
defining output types, defining the logical systems for the decentralized
warehouse,
generating distribution model, generating partner profile and defining the
transmission
model within the data warehouse.

Decentralizing a warehouse can be more efficient than having one monolithic


structure in
that each of the decentralized databases is being controlled by a specific
department. This
means that data consumers trying to access specific data will not have to wait for
a single
system to scan through very large bulks of data and wait for a long time before
getting the
desired information.
One of the main considerations with implementing a decentralized warehouse is that
the
network infrastructure. Since data will be coming from different data warehouse, it
means
that huge bulks of data may be shared at regular intervals and this can even
increase when
there are move data consumers using at the same time. The network should be able to
able
to handle traffic management efficiently.

Many commercial decentralized warehouse management software applications are


available
in the market today. Most of these software applications are classified under
business
solutions and its goes under the same category as enterprise resource planning.
Investing in
business solutions may be initially expensive but the return in investment will
definitely be
worthwhile.
Decentralized Database

In Decentralized Database, a big database is partitioned as per business


requirement in
such a way that each smaller database represents a specific data subject.

Today, it is a fact that most business organizations from small to medium sized to
large
multinational corporations, can hardly go into operation without having to rely on
information. The term "data-driven" has already been in wide use and has become all
too
real in cutting edge business operations.

With the fast advancement in information technology, database management systems


are
becoming more and more advanced more many very specific database software solutions

are coming into the market. In the past, it was common to have a central database
to serve
all of the organizations needs.

Many information system designers and architects have been holding the belief that
a
central control is better for database management. From this standpoint, they saw
that a
centralized server handling all data in one logical and physical system is good in
the aspect
of data integrity and less expensive in the aspect of economic due to the cost
associated
with redundant systems.

But because of the coming out of more advance hardware that are relatively cheaper
in
price when the issue of speed and efficiency is accounted for, decentralized
database has
become a better choice.

From the standpoint of practicality in today's business setting, a decentralized


database
offer more speed and flexibility. Business organizations already involve a lot or
processes
such as wholesale distribution, discrete manufacturing, retail, and professional
services
operations, financials, human resource and many more. Each of these business
aspects
produces their own high volumes of data.

If all of the data output from these areas are being handled by one central
database, the
possibility of failure is potentially high. And when a failure occurs, the business
process
would definitely stop. When there is stopping, no matter how long the period, it
could mean
loss in revenue and income for the company.

Decentralizing the database by partitioning it according to a business or end-user


defined
subject area and having the ownership moved to the owners of the subject area can
solve
the problem of database failure and business stoppage.
When the database is decentralized, the each of the database partition can already
be
managed by a specific user or group. For instance, one database may be managed by
the
financial group, another administrative group, and then the rest of the partitions
by the
sales, human resource, customer relationships, procurement, manufacturing
departments
and so on depending on the set up of the company.

With this kind of setting, data integrity may be maintain more securely because
there will
be a better sense of responsibility by each of the department. If something goes
wrong, it
could be very easy to pinpoint which department caused the problem and specific
persons
or group can take the responsibility.

The decentralized database could also significantly boost the access and processing
speed of
the whole system. In a centralized database setup, when one data consumer wants to
view
a particular for, say, sales report, the database will have to scan through the
whole central
database and this could mean slowing of the whole system. With a decentralized
database,
the whole system can immediately lead the data consumer's query to the specific
department where the sale report is being stored.

Finally, with a decentralized database, there can be no reason why the whole system
will go
done or fail and business temporarily halted because all the data are scattered
across
different departments within the organization. This means lesser potential for
revenue loss.

End User Data

End User Data can either be data provided by a data warehouse or the data created
by end
users for query processing.

The technical world of computing and computers has always been divided into two
general
reams. On the one realm are the high priests, the knowledgeable group of people who
how
the ins and outs of computers its most complex details. These people shape the
computer
codes and programs and enable computer behaviors which are rich and valuable.

On the other real are the novice users who are at the mercy of the high priests of
computing, who can be denied or granted access to knowledge, information or
education
from computers.

End user data, in its very essence, different from entered or supplied by the other
realm of
the novice. But this is not the case all the time because the high priests can be
suppliers of
end user data too. So, the very essence of end user data is the data supplied into
any
processes written or developed by the programmers (the high priests) in order to
produce
the desired output.
Although not exclusively true in some cases, the end user data can also refer to
those data
generated for the end users which has been the result of querying a database for
specific
information. For example, if the a user wants to know the how many people there are
in a
specific company department, the data answer of the database from the certain query
could
be said as an end user data.

While end user data entry may be done in many different ways, there are many
available
graphical software tools for end user data entry. In the graphical interface, there
are text
boxes, radio buttons, combo boxes, list boxes, check box where end user data can be

collected easily. The text box can accept any characters and string data into the
system.
The radio buttons can collect data pertaining to a selection but in most cases, the
radio
button can accept only one choice from among the many.

The checkbox is similar to the radio button in the sense that it allows multiple
selections but
it can accept more than one choice. The list box and combo gives the choices in
terms of a
list. These tiny components typically compose an end user data entry interface.

End user data can be said to be the lifeblood of an information system. They are
the very
data from which processes are done to come out with output information that will be
used
by the organization.

End user data may also refer to data not just from human end users but from various
data
sources as well. As common in an enterprise information system such as a data
warehouse
implementation, several physical computers are each running as servers and doing
their
own computation processes.
The output of computer data source may server as end data user input to another and
so
on. As the system of many data sources grow in complexity, end user data may come
from
several sources. One user data may be used into another process in another data
source
computer and then the process output may be thrown into the network to be used
another
computer as yet another end user data.

The mixed use and exchange of end user data is an indication of how complex and
information system is. The more the data sources, the more complex data exchange
become. The internet is composed of various servers each communicating not just
with
each other but with end users using browsers as well. One can never imagine the
amount of
end user data traversing the internet every single second of the day!

End User Data


End User Data can either be data provided by a data warehouse or the data created
by end
users for query processing.

The technical world of computing and computers has always been divided into two
general
reams. On the one realm are the high priests, the knowledgeable group of people who
how
the ins and outs of computers its most complex details. These people shape the
computer
codes and programs and enable computer behaviors which are rich and valuable.

On the other real are the novice users who are at the mercy of the high priests of
computing, who can be denied or granted access to knowledge, information or
education
from computers.

End user data, in its very essence, different from entered or supplied by the other
realm of
the novice. But this is not the case all the time because the high priests can be
suppliers of
end user data too. So, the very essence of end user data is the data supplied into
any
processes written or developed by the programmers (the high priests) in order to
produce
the desired output.

Although not exclusively true in some cases, the end user data can also refer to
those data
generated for the end users which has been the result of querying a database for
specific
information. For example, if the a user wants to know the how many people there are
in a
specific company department, the data answer of the database from the certain query
could
be said as an end user data.

While end user data entry may be done in many different ways, there are many
available
graphical software tools for end user data entry. In the graphical interface, there
are text
boxes, radio buttons, combo boxes, list boxes, check box where end user data can be
collected easily. The text box can accept any characters and string data into the
system.
The radio buttons can collect data pertaining to a selection but in most cases, the
radio
button can accept only one choice from among the many.

The checkbox is similar to the radio button in the sense that it allows multiple
selections but
it can accept more than one choice. The list box and combo gives the choices in
terms of a
list. These tiny components typically compose an end user data entry interface.

End user data can be said to be the lifeblood of an information system. They are
the very
data from which processes are done to come out with output information that will be
used
by the organization.
End user data may also refer to data not just from human end users but from various
data
sources as well. As common in an enterprise information system such as a data
warehouse
implementation, several physical computers are each running as servers and doing
their
own computation processes.

The output of computer data source may server as end data user input to another and
so
on. As the system of many data sources grow in complexity, end user data may come
from
several sources. One user data may be used into another process in another data
source
computer and then the process output may be thrown into the network to be used
another
computer as yet another end user data.

The mixed use and exchange of end user data is an indication of how complex and
information system is. The more the data sources, the more complex data exchange
become. The internet is composed of various servers each communicating not just
with
each other but with end users using browsers as well. One can never imagine the
amount of
end user data traversing the internet every single second of the day!

Metadata Synchronization

Metadata are data about data; each metadata describes an individual data, content
item or
a collection of data which includes multiple content items. Metadata
Synchronization
consolidates related data from different systems and synchronizes them for easier
access.

Metadata are very important components of any data warehouse implementation because

they are of great help in facilitating the understanding, use and management of
data. The
metadata which are required for efficient data management may vary depending on the

type of data and the context where these metadata are being used.

For instance, in a library database system, the data collection may involve the
contents of
book titles which are being stocked so the metadata to use should be abut the a
title that
would often include the description of the content, the book author, the date of
publication
and the physical location.

In the context of the camera, the metadata to use would describe such data as
photographic image, the date when the photograph was taken, and other details that
pertain to the settings of the camera. Within the contest of an information system,
the data
would pertain to the content of the files of the computers and so the metadata to
use may
include individual data item, the name of the field and its length and many other
aspects of
the file.

If we take the scenario of a real data warehouse environment, there would be


millions of
metadata describing the millions of data coming from different data sources.
For example, in a multinational corporation, the company may have several branches
spread in a wide geographical locations. In each of these branches, there may be
data
coming several aspects of the business such as sales, procurement, manufacturing,
human
resource, purchasing and many other complex departments of the modern business
setting.

Each of these departments of the company comes up with their own sets of the data
but
even if these data are primarily needed to ran and manage their own departments,
their
departmental data are also needed by the entire business enterprise in order to
come up
with statistical analysis and reports which will become the basis for future
decisions to move
the company forward and compete in the industry arena.

To efficiently manage the information system in general and the data warehouse in
particular, there has to be a way to "iron out" the disparity of all sorts of data
including
metadata itself. In fact, since the very structure of an enterprise information
system
involves disparate data sources giving out disparate data, there is a world
standard of
processes called ETL which stands for extract transform and load which takes in all
sorts of
data from disparate systems and platforms and transform these disparate into a
unified
format that the business enterprise can understand and thus efficiently process.

The same is true with metadata. They need to be synchronized so that the
information
system would know which metadata comes and is needed by what department and when
the particular given period the metadata will be given.

Given the different geographical sources of the company branches, metadata should
also be
synchronized so that they entire system can get the impression of a barrier free
information
system.
Synchronization is a process that is not just for metadata. It could be for types
of computing
processes as well. This will have to ensure that end users are getting up to date,
relevant
and accurate data so that their decisions may be based on facts.

Information Consumer

Information consumers are everywhere are it has become of life that data and
information
have become driving forces in almost all aspects of our daily operations. With the
ubiquity
of the internet connection, today's information consumers includes people of all
ages and all
walks of life and even non humans like artificial intelligence technologies are
fast become
major information consumers.
Some information systems give certain access privileges to different kinds of
information
consumers. For instance, administrative staff may only gain access to information
related to
admin relate database. Or sales staff may only access sales related data.

In another related aspect of information systems management, it is common to


encounter
access privileges pertaining to who has the right to read only access and who are
the right
to read and write access with the files. This is to ensure that that there will be
breaches to
the files or any unintentional altering of files which can cause disaster in the
entire
information system.

The best way to imagine information consumers which are not humans is to take the
scenario of a data warehousing environment. This environment is a place for many
computer servers running database management systems and other application programs

that produce data whether in flat format or other digital data format.

A data warehouse is a repository of all sorts of enterprise historical and current


transaction
data. And as such, it needs to process and store very high volume of data every
very short
interval. Because of the labor intensity demanded of the high volume data, some of
the data
processing and storing are distributed in other data stores.

As the whole system progresses, each of the data stores as well the central data
warehouse
itself takes turn being information consumers from each other while they also take
turns
being distributors.

Information consumption grows directly proportional with the exponential growth of


data.
Everyday, as new technologies evolve on the internet as in the case of the
emergence Web
2.0, more and more information consumers are coming and they come with bigger
information demands.
The emergence of many social networking sites for instance has cause even ordinary
grade
school pupils to become information consumers despite their being aware of it since
the
moment they try to access the profiles of their friends' social networking pages,
the are
already consuming information.

The ubiquity of e-commerce websites has also produced more information consumers
with
very high demands. Most customers for e-commerce websites make their transactions
on
the internet and most of these transactions need to deal with very sensitive
information
such are bank account details and credit card numbers.

Information systems processing these sensitive data from the back end needs to
implemented high security features to avoid other "bad and unwelcome" information
consumers who may be lurking in some dark corners of the internet waiting to fish
for the
sensitive information.

There are many codes or tools that just sit in one corner of the server as an
information
consumer waiting for the data to come and be processed accordingly. Some of these
tools
are called middleware and they act as an interface to an application with the
server so in
effect a middleware acts both as an information consumer and information
distributor.

In an enterprise organization setting, the most powerful information consumers are


those
on the top positions like the chief executive officer and others holding managerial
posts. But
before information arrives to their desktops, the business intelligence processors
have been
an earlier information consumer getting most of the data from the enterprise data
warehouse.

Intuitive Data Warehouse

A data warehouse is a repository of a business organization's historical data. It


is a large
part of an enterprise data management system which consists of several servers
running on
different kinds of platforms and database management systems.

It is generally practiced that in an enterprise data management system, it is the


data
warehouse house which contains static data while it is the operational data store
that
contains dynamic data that gets frequently updated during the course of business
operations.

To illustrate this further, it important to know that in an enterprise data


management
system environment, there may plenty of servers and database systems which
constitute
various data stores and these servers may be of varying platforms and database
management systems come different vendors.
Each data store gather data based on the departments they are server or on other
special
function that they are designed to do. But during the entire business operation,
these
servers send their data to the operational data store which acts as the unifying
areas were
disparate data from various data stores are extracted and transformed into a
unified
structure based on the enterprise data architecture.

The process of unifying disparate data is referred to as ETL which stands for
extract,
transform and load. The extract and transform are mostly done in the operational
data store
before the transformed data is "loaded" into the data warehouse. With this picture
wherein
the data warehouse only get the loading part, many people get the impression that
the data
warehouse indeed is a mere static repository does not do a lot of things except
accept data
for storage.

In fact, the concept of data warehouse has been taken from the analogy with real
life
warehouses where good are put before the need arise to get them. And so with data,
the
operational data store goes to the data warehouse to get the data and process them
at the
operational data store area. Hence the term operational because it refers to the
data
currently being operated on or manipulated with.

But modern data warehouses are no longer as static as they seem or look. Data
warehouses
today are already managed by software application tools that have the functionality
that
allows the data warehouse itself to track data and perform all sorts of analysis
related to the
movement of data from the warehouse to the other data stores and back.

Many data warehouse employ a technology known as Online Analytical Processing


(OLAP)
which helps in providing answers to various multidimensional analytical queries.
Most areas
of business including business reporting for sales, marketing, management
reporting,
business process management (BPM), budgeting and forecasting, financial reporting
use
OALP for retrieving information from the data warehouse so that the company can
spot
trends and patterns as basis for the corporate decisions.

There are many companies specifically offering data warehousing software solutions
which
come with sophisticated proprietary intuitive functions. Many of these vendors even
offer
integrated solutions that add data warehousing functions with such complex features
as
data transformation, management, analytics and delivery components.

Having an intuitive data warehouse greatly increases overall performance of the


enterprise
data management system because the data warehouse can already share some of the
load
which is supposed to be for the operational data stores which tackles very labor
intensive
processes from the on-going business operations.

Primary Data Source


An enterprise data management system that consists of data stores and data
warehouse
may have several data sources. Primary Data Source is the first data site at which
the
original data is stored after their origination.

Imagine the data warehouse whose database is the repository of all of the company's

historical data. The data warehouse is the corporate memory. And then there are the
Online
Analytical Processing (OLAP) that handles all sorts of data so that the analysis
can be the
basis for wise and sound corporate decisions. And then there the Online
Transactional
Processing (OLTP) which handles online and real time transactions like that of an
automated
teller machine or a retails point of sales. In short, the enterprise data
management handles
very high volume of data every single minute and all throughout the year as long as
the
business is operating.

An enterprise data management information system has a data store that is a dynamic

place for data coming from different data source and delivering disparate data from
different
platforms. This is where the disparate data are being processed in a series of
activities
called the ETL which stands for extract transform load so that the disparate data
can be
formatted in a unified form before being processed.

Speaking of a data store, the data that periodically gets to the data store are
coming from
the data sources.

For instance, let us take the case of the United States Environmental Protection
Agency
which is implementing an Envirofacts Data Warehouse are an example of a data source
and
where the primary data source applies. This United States agency is so large and it
deals
naturally with large volumes of data so its data handling is broken down into many
individual EPA databases and databases are administered by program system offices.
Sometimes, the industry is required to report information to state where it
operates and
sometimes also, the information is being collected at federal level.

So the data sources of the Envirofacts Data Warehouse provide information that
makes it
easy to trace the origin of the information. Some of these data sources are:

Superfund Data Source � This data source are from Superfund sites which hav those
uncontrolled hazardous wastes sites designated by the federal to be cleaned up. In
this data
source are stored information about these sites in the Comprehensive Environmental
Response, Compensation, and Liability Information System (CERCLIS), which has been
integrated into Envirofacts.

Safe Drinking Water Information Data Source � This database stores information
related to drinking water programs.

Master Chemical Integrator Data source � This database integrates various chemical
identifications used in four program system components.

Other data sources Envirofacts Data Warehouse are Hazardous Waste Data, Toxics
Release
Inventory, Facility Registry System, Water Discharge Permits, NDrinking Water
Microbial
and Disinfection Byproduct Information and the National Drinking Water Contaminant
Occurrence Database.

Now, all these data sources contribute seemingly unrelated data which may come in
disparate files formats. This may also come from different geographical locations
from
different federal governments within the United States. The data that they share
finally
converged in a central data warehouse which manages them so they become more
meaningful and relevant to be redistributed or shared to anybody who needs them.
Each of these departments may or may not act as the primary data source. For
example, if
the data originating from the Safe Drinking Water Information Data Source comes
from yet
another source, then the Safe Drinking Water Information Data Source is not a
primary data
source. If data really comes from the actual raw activity of the department where
the real
paper took place, then the department may be a primary data source.

Primary Key

Also known as a primary keyword or a unique identifier, a primary key is key used
in a
relational database which uniquely represents each record. It is a set of one or
more data
characteristics and its value uniquely identifies each data occurrence in a data
subject. It
can be any unique identifier in a database table's records such a driver's license,
a social
security number, or a vehicle identification number. There can only be one primary
key in a
relational database. Typically, primary keys appear as columns in relational
database tables.

The administrator has the power of choice of a primary key in a relational database

although it may be very possible to change the primary key for a given database
when the
specific needs of the users changes. For instance, it may be more convenient to
uniquely to
uniquely identify people by their telephone numbers in some areas than to use
driver
license numbers in one application to uniquely identify records.

There all many kinds of keys in a database implementation but a primary key is a
special
case of unique keys. One of the biggest distinctions of a primary key from other
unique keys
is that the implicit NOT NULL constraint is automatically enforced unlike the case
of the
other unique keys. With this enforced restriction, the primary key will never
contain any
NULL value. Another main distinction of primary keys is that the keys must be
defined using
a certain syntax.
As expressed through relational calculus and relational algebra, the relational
model
distinguished between primary keys and other kinds of keys. The primary keys were
only
added to the SQL standard for the main reason that it gives more convenience to the

programmer or database developers and administrators. The primary keys, as well as


other
cases of unique keys, can be referenced outside its table by foreign keys.

One of the most important things to note in implementing and designing a good
database is
in choosing a primary key. Each database would definitely need a primary keys so it
can
ensure row level accessibility. When an appropriate primary key is being chosen,
one can
already specify a primary key value which lets the person query each of the table
row
individually and modify each of the row without having to alter other views in the
same
table The values composing a primary key column are unique so no two values will
ever be
the same.
Each database table has one and only one primary key which can consist of one or
many
columns and this is very important. But it can also be possible to have a
concatenated
primary comprised of two or more columns. There might be several columns or groups
of
columns in a single table that may serve as a primary key and are called candidate
keys. A
table can have more than one candidate key but there can only be one candidate key
that
can become a primary key for the table.

There are some cases in database design that the natural key that uniquely
identifies a
table in a relation is difficult to use for software development. The case may
involve having
multiple columns or large text fields. This difficulty may be addressed by
employing what is
called a surrogate key. A surrogate key can be used as a primary key. In some other
cases
there may be more than one candidate key for a relation, and no candidate key is
apparently preferred. This can be addressed again by using a surrogate key to be
used as
primary key in order to avoid having to give one candidate key artificial primacy
over the
others.

Data Types

What is Central Data Warehouse

A Central Data Warehouse is a repository of company data where a database is


created
from operational data extracts. This database adheres to a single, consistent
enterprise data
model to ensure consistency in decision making support across the company.

A Central Data Warehouse is a single physical database which contains business data
for a
specific function area, department, branch, division or the whole enterprise.
Choosing the
central data warehouse is commonly based on where there is the largest common need
for
informational data and where the largest numbers of end users are already hooked to
a
central computer or a network.

A Central Data Warehouse may contain all sorts of data for any given period.
Typically, a
central data warehouse contains data from multiple operating systems. It is built
on
advanced relational database management systems or any form of multi-dimensional
informational database server.
A central data warehouse employs the computing style of having all the information
systems
located and managed from one physical location even if there are many data sources
spread
around the globe.

It has been a few decades that most companies' survival depends on being able to
plan,
analyze and react to the fast and constantly changing business conditions. In order
to keep
with rapid changes, business analysts, managers and decisions makes in the company
need
more and more information.

Information technology itself has also rapidly evolved with the changing business
environment and today, more innovative IT solutions have been springing like
mushrooms
on the internet. And with these, business executives and many other critical
decisions
makes have found ways to make do with business data.

Every single day, billions of data are created, moved and extracted from various
sources
whether from company local area network, wide area network or the internet. These
data
come in different formats, attributes and contents. But for the most part, data may
be
locked up in disparate computer systems and could be extremely difficult and
complicated
to make use of.

When developing a central data warehouse, it is critically essential to have a


balance data
warehousing strategy that should answer the needs of the company. Warehouse
designers
should consider the audience, the scope of the service and type of data warehouse.

Central data warehouses are created by installing a set of data access, data
directory and
facilities for process management. A copy of all operational data should be built
from a
single operating system to enable the data warehouse to have a series of
information tools.
Perhaps the most optimal way of data warehousing strategy is to select a user
population
based on enterprise value. From there the company can do issue analysis.

Based on the discovered needs, a data warehouse prototype is built and populated
for end
users to experiment and do appropriate modifications. Once an agreement on the
needs is
arrived at, data can then be acquired from current operational systems across the
company
or from external data sources and then load the data into the warehouse.
A central data warehouse is where the company solely depends for business analysis
and
decisions making. It should have the following attributes:

Decision Making Attributes


Accuracy � the data should be valid and correctly represented in the underlying
schema. It
should reflect real life activities of the business.

Completeness � the data warehouse should have data model whose scope includes even
the most minute and seemingly trivial detail about the company.

Flexibility � data warehouse should be able to manage all sorts of data from
heterogeneous sources and satisfy a wide array of requirements from end users as
well as
from data sources.

Timeliness � data should be submitted on a schedules time bases so the company can
get
the latest updates on trends and patterns in the industry.

What is Active Data Warehouse

Active Data Warehouse is repository of any form of captured transactional data so


that they
can be used for the purpose of finding trends and patterns to be used for future
decision
making.

According to Bill Inmon, a prominent data warehousing practitioner, data warehouse


defined
in terms of subject-oriented, time variant, non-volatile and integrated.

Subject-oriented means that the data captured is organized to have similar data
linked
together. Time-variant data changes are recorded and tracked so that a change
patterns
can be determined over time. Non-volatile means that when data is stored and
committed,
it can be read only and never deleted for comparison with newer data.

An active data warehouse has a feature that can integrate data changes while
maintaining
batch or scheduled cycle refreshes.

Companies use an active data warehouse in drawing an image of the company in


statistical
manner. For example, companies may be able to determine which months during a year
employees have most absences and which branch has the most and least absences in a
given period within a year. Among others, companies will be able to tell sales
pattern like
what particular products sell the most during a certain month and which countries
have the
most sales. With these patterns being found, the company can then formulate
strategies on
how to best optimize sales and generate revenues.
Data warehousing has existed in the late 80s as a type of computer database. It was

developed to overcome the pattern spotting limitations of operational systems which


could
not handle intensive processing load for company wide reporting.

The early data warehouses were stored in separate computer databases designed
specifically for the purpose of management information and analysis. Data came from

several sources including mainframe computers, mini computer and personal


computers.
These data were integrated in one place for faster processing and user friendly
software
applications were developed to present statistical reports from the integrated
data.

As technology evolved, data warehousing methods improved along with greater demands

from company users. Data warehouses had several stages of evolution. At the early
stage,
data is copied from an operational system database into an offline database server
where
processing requirements do not affect the performance of the operational system.
The
offline data warehouse regularly updates data from the operational systems and
store the
data in an integrated data structure.

Real time data warehouse updates data on during actual transaction time in the
operational
system. Integrated data warehouse generate transaction events which are given back
to
operational systems for worker's daily use.

Online transaction processing (OLTP) is the storage system often used for active
data
warehousing. OLTP is a relational database design that breaks down complex
information
into simple data tables. It is very efficient in analyzing and reporting billions
of captured
transactional data into user friendly format. It can also be tuned up to maximize
computing
power although data warehousing professionals recommend having a separate reporting

database in other computer given the fact that millions of data may be processed by
the
OLTP database every second.
Active data warehouse professionals are often called Data Warehouse Architects.
They are
primarily top-notched database administrators who are tasked to handle a huge
amount of
complex data from different sources sometimes coming from different countries
around the
world.

An active data warehouse is often associated with Business Intelligence Systems. In


the
past, it was also referred to as Decision Support System (DSS) and Management
Information System (MIS).

What is Active Metadata Warehouse

An Active Metadata Warehouse is a repository of Metadata to help speed up data


reporting
and analyses from an active data warehouse. In its most simple definition, a
Metadata is
data describing data.
Many companies have spent years and billions of dollars trying to cleanse, profile,
extract,
transform, load and aggregate many different types of data into a Data Warehouse so
they
could use it in any way they want such as generate reports which are very accurate
and
share the business common views.

Many tools which are user-friendly query software applications and near real-time
updating
have been spent money on. To some degree, they have data which are accurate,
accessible
and timely. According to a recent report from a survey, less than 40 percent say
they have
come up wit accurate automated reports. This was because there never was a Meta
data
warehouse.

Most companies use an Active Data Warehouse to capture transaction data from many
different sources. Since millions of data of transaction data may be processed
during any
given second in any data warehouse, storage for data is commonly separated in other

computer from the operational system of the company. This is to ensure optimal
resource
management of the Data Warehouse Server. Also having a separate Active Metadata
Warehouse significantly speeds up searching, analyzing and reporting data from the
data
warehouse.

An example of a Metadata would be to describe data called "rnt3263". Looking at


that word
alone does not seem to make sense. So, the metadata for "rnt3263" may be "the user
identification of a client.

Metadata has various advantages. In its most wide usage, it is useful if we want to
speed up
our searches. Search queries with metadata expedite the process especially in
performing
very complex filter operations. Many web application locally cache Metadata by
automatically downloading them and thus improve speed files access and searches.
Locally,
Metadata can be associated with files as in the case of scanned documents. When the
files
are digitally stored, the user may open the file using a viewer application which
reads the
document key values and store in a Metadata Warehouse or any similar repository.

Bridging a semantic gap is one of the notable uses of Metadata. For example, a
search
engine may understand that Edison is an American Scientist and so when one queries
on
"American Scientists" it may provide a hyperlink to pages on Edison even if the
keyword did
not mention "Edison". This approach is called semantic web and Artificial
Intelligence.

Even in Multimedia files, Metadata is also very useful. For example, they are used
to
optimize lossy compression algorithms as in a video uses Metadata to tell the
computer a
foreground from the background to achieve better compression rate without losing
much
quality as effect of the lossy compression algorithm.
Metadata can be stored both internally meaning it is found within the file itself
or externally
as in a separate file that points to the file it describes. Storing Metadata
externally is more
efficient for searching like in Database queries

There are two general types of Metadata, Structural or Control Metadata and Guide
Metadata. The Structural Metadata is generally used in database systems such as
columns,
tables and indexes. On the other hand, Guide Metadata is used to help users in
looking for
specific things like Natural Language searches.

What is Enterprise Data Warehouse

Enterprise Data Warehouse is a centralized warehouse which provides service for the
entire
enterprise. A data warehouse is by essence a large repository of historical and
current
transaction data of an organization. An Enterprise Data Warehouse is a specialized
data
warehouse which may have several interpretations.

Several terms used in information technology have been used by a so many different
vendors, IT workers and marketing ad campaigns that has left many confused about
what
really the term Enterprise Data Warehouse means and what makes it different from a
general data warehouse.

Enterprise Data Warehouse has emerged from the convergence of opportunity,


capability,
infrastructure and need for data which has exponentially increased during the last
few years
as technology has advanced too fast and Business Enterprises tried to do their best
to catch
up and be on the top of the industry competition.

In order to give a clear picture of an Enterprise Data Warehouse and how it differs
from an
ordinary data warehouses, five attributes are being considered. This is not really
exclusive
they bring people closer to a focused meaning of the Enterprise Data Warehouse from

among the many interpretations of the term. These attributes mainly pertain to the
overall
philosophy as well as the underlying infrastructure of an Enterprise Data
Warehouse.

The first attribute of an Enterprise Data Warehouse is that it should have a single
version of
truth and that entire goal of the warehouse's design is to come up with a
definitive
representation of the organization's business data as well as the corresponding
rules. Given
the number and variety of systems and silos of company data that exist within any
business
organization, many business warehouses may not qualify as an Enterprise Data
Warehouse.

The second attribute is that an Enterprise Data Warehouse should have multiple
subject
areas. In order to have a unified version of the truth for an organization, an
Enterprise Data
Warehouse should contain all subject areas related to the enterprise such as
marketing,
sale, finance, human resource and others.
The third attribute is that an Enterprise Data Warehouse should have a normalized
design.
This may be an arguable attribute as both normalized and de-normalized databases
have
their own advantages for a data warehouse. In fact, may data warehouse designers
have
used denormalized models such as star or snowflake schemas for implementing data
marts.
But many also go for normalized databases for an Enterprise Data Warehouse in the
consideration of flexibility first and performance second.

The fourth attribute is that an Enterprise Data Warehouse should be implemented as


a
Mission-Critical Environment. The entire underlying infrastructure should be able
to handle
any unforeseen critical conditions because failure in the data warehouse means
stoppage of
the business operation and loss of income and revenue. An Enterprise Data Warehouse

should have high availability features such as online parameter or database


structural
changes, business continuance such as failover and disaster recovery features and
security
features.

Finally an Enterprise Data Warehouse should be scalable across several dimensions.


It
should expect that a company's main objective is to grow and that the warehouse
should be
able to handle the growth of data as well as the growing complexities of processes
which
will come together with the evolution of the business enterprise.

Because of the fast evolution of information technology, many business rules have
been
changed or broken to make way for rules which are data driven. Processes may
fluctuate
from simple to complex and data may shrink or grow in the constantly changing
enterprise
environment. Hence, a real Enterprise Data Warehouse should scale to these changes.

What is Functional Data Warehouse

Today's business environment is very data driven and more companies are hoping to
create
competitive advantage over other business organization competitors by creating a
system
whereby they can assess the current status of their operations any at any given
moment
and at the same time, they can also analyze trends and patterns within the company
operation and its relation to the trends and patterns of the industry in a truly
up-to-date
fashion.

The establishment of one large data warehouse addresses the demand for up-to-date
information reflecting the trends and patterns of the business operations and its
relation to
the large world of the industry where the company is doing business in. A data
warehouse is
not just a repository of historical and current transactional data of a business
enterprise; it
also serves as an analytical tool (in conjunction with a business intelligence
system) to give
a fairly accurate picture of the company.
Business companies vary in structure. Some Business companies are composed only of
a
few departments focused on the core business functions such as finance, admin and
human
resources. Some companies are big and their the scope of the operation is very wide
which
may include manufacturing, raw materials purchasing, purchasing and many others.

As the company grows, so will its need for data. While a data warehouse is itself
an
expensive investment, it is not uncommon to see one big organizational company
implementing several different functional data warehouses working together in a
large
information system to function as an Enterprise Data Warehouse.

In large business organizations where there are several departmental divisions, the

Enterprise Data Warehouse is broken down into Functional Data Warehouse. Depending
on
the size of the company and their financial capability, a Functional Data Warehouse
may
serve on department or may server more. There are also companies that have branches
in
many different geographic locations around the globe and their Enterprise Data
Warehouse
may set up differently with different clustering for Functional Data Warehouses.

Despite the breaking down of Enterprise Data Warehouse into several Functional Data

Warehouses, each of these warehouses is basically the same. Each of them still
defined to
be �a subject-oriented, integrated, time-variant and non-volatile collection of
data in
support of management�s decision making process or a collection of decision support

technologies, aimed at enabling the knowledge worker to make better and faster
decisions�.

Breaking down the Enterprise Data Warehouse into several Functional Data Warehouses
can
have many big benefits. Since the organization as a data driven enterprise deals
with very
high level volumes of data, having separate Functional Data Warehouses distributes
the
load and compartmentalize the processes. With this set up, there will no way the
whole
information system will break down because if there is a glitch in one of the
functional data
warehouses, only that certain point will have to be temporarily halted while being
fixed. As
opposed to one monolithic data warehouse setup, if the central database breaks
down, the
whole system will suffer.

Having Functional Data Warehouses will also ensure that data integrity and security
is
maintained because each department or the group of departments represented by the
Functional Data Warehouse will have a sense of ownership and responsibility. This
also
means that if there is a problem with the Functional Data Warehouse, it will be
easy to
pinpoint the responsible department or individual representing the department
maintaining
the Functioning Data Warehouse.

Operational Metadata
Operational Metadata are metadata about operational data. Metadata is basically a
kind of
data that describes another data, content item or another collection of data which
includes
multiple content items. Its main purpose is to facilitate better understanding, use
and
management of data.

The use and requirement of metadata varied depending on the context where it is
being
used. For example, when metadata are employed in a library information system, the
metadata that will be used would be about description of book contents, title, data
of
publication, location of the book on the shelf and other related information. If
metadata are
to be employed in a photography system, the metadata to use would involve
information
about cameras, camera brand, camera models and other.

When used with an information system, the metadata to be used would involve data
files,
name of the field, length, date of creation, owner of the file and other related
information
about the data.
Metadata describe operational data which are subject-oriented, integrated, time-
current,
volatile collection of data that support an organization�s daily business
activities and outputs
of the operational data stores.
They are just as important as the operation data itself because an enterprise
information
and data management system can be greatly enhanced in efficiency when operational
metadata are being employed.

Let us take an example with an enterprise resource planning (ERP). Metadata greatly
helps
in building a data warehouse in an ERP environment. An enterprise data management
system involves Decision Support Systems (DSS) metadata, operational metadata and
data
warehouse metadata. The DSS metadata is primarily used by data end users. The data
warehouse metadata is primarily used for archiving data in the data warehouse. The
operational metadata is primarily for use by developers and programmers.

Since operational metadata describe operation data, they are also very dynamic in
nature.
Since operational data are data that are currently in use by the businesses, they
are
constantly changing as long as transactions are happening and even beyond such as
during
inventories. As such, new transactional data are added and removed any given time
and the
operational metadata needs to catch up with these changes.

For example, in a banking environment, large banks handle thousands of individual


accounts and at any given moments, some of these accounts may change to some
degree.
In order to manage these changes, a complex array of data needs to be handled and
processed in the operational data store and the management is made simple and more
efficient with the help of operational metadata.
The dynamic nature of operational data needs special mechanisms to quickly handle
data
such as finding objects, entities and resources and ignoring other like using
metadata in
order to optimize compression algorithms or performing additional computations with
the
use of data. Operational metadata include the operational database's names of
table,
columns, programs and other related items. Operational metadata describe all
aspects of
the current operation like data, activities, people and organizations involved,
locations of
data and processes, access methods, limitations, timing and events, as well as
motivation
and rules.

Today, there are many vendors that offer many implementations of operational
metadata in
relation to a data warehouse as well as the general setup of an enterprise data
management
system. Many software implementations of operational metadata help provide business
as
well as technical users better control when accessing and exploring metadata in all
other
aspects of the business operation and its IT implementation. Some software
applications
can even help depict visually the interrelationships of data sources and users and
provide
data consumers with data linkage back to the system source.

Centralized Data Warehouse

A Centralized Data Warehouse is a data warehousing implementation wherein a single


data
warehouse serves the needs of several separate business unites simultaneously using
a
single data model that spans the needs of multiple business divisions.

Today, having information means having power. In any a lot of aspects of daily
living,
having relevant information can give us more ease in daily activities. This is made
manifest
by the use of the internet. Because of the information that can be obtained
everyday,
internet users are growing by the day and the time people spend on the internet is
getting
longer as web services are getting more and more sophisticated with applications
that can
gather and aggregate billions of disparate data into useful information.
Businesses are the biggest user of data whether on the internet or within the
corporate IT
infrastructure. Many companies implement a business data repository which is called
data
warehouse.

A data warehouse, often also called an operational data store, is a database of


information
stored according to a defined corporate data model. Data architects and data
modelers
typically work together to come up with an efficient data warehouse.

In a typical business setting wanting to have a data warehouse, data architects try
to define
real world business activities in terms of information technology jargon. Real
business
activities, persons, transactions and events are defined in terms of entities of
data
representation in a database system. As soon as the entities are all defined, IT
professionals
then develop programmatic algorithms to represent business rules, policies, best
practices
and other undertakings within the company.

These data and algorithms are then synthesized in one system in the data warehouse
so the
whole system can simulate real world activities with a lot faster speed, better
efficiency and
less prone to errors.

It is not uncommon these days to have a company that has a presence in different
geographic locations. Having this set up is like taking advantage of the
advancement in
information technology which has broken boundaries. Communication can be very easy
and
fast.

Companies can have the option to have several data warehouses in different
locations.
These data warehouses communication with each other and send, extracts, transforms
and
loads data for statistical analysis. Each warehouse typically has a database
administrator to
manage the data and overcome compatibility problems

Data security is a critical issue in data warehouses from several locations trying
to send and
receive data and constantly make contact with each other. Communication lines can
be
open to sniffers and malicious hackers and crackers may be able to steal important
information and breach privacy. Securing a network is an expensive activity so
companies
will have to spend more on buying appropriate technology measures.

Having a centralized data warehouse has its own advantages. The company will have
to
invest only a central IT team. The central team will be responsible for defining
and
publishing corporate dimensions. This is can especially true if the company has
multiple
lines of business to be combined in one robust framework. The team is also
responsible for
providing cross divisional applications. The need to purchase to software and
database tools
will only be for the central data warehouse and it can be fairly easy to implement
cross
divisional applications.

The main disadvantage though with centralized data warehouse is that if the
warehouse
breaks, it may temporarily mean stoppage for the operations. But of course this can
be
easily overcome by investing on many computers and other hardware for backing up.
These
computers must have to be very powerful because they will be dealing with billions
of
complicated process in the central location of the organization.

Metadata Warehouse
Metadata Warehouse is a database that contains the common metadata and client-
friendly
search routines to help people fully understand and utilize the data resource. It
contains
common metadata about the data resource in a single organization or an integrated
data
resource that crosses multiple disciplines and multiple jurisdictions. It contains
a history of
the data resource, what the data initially represented, and what they represent
now.

A metadata warehouse is just like any data warehouse in that it stores all kinds of
metadata
to be used by the information system. Since today's data driven business
environments are
relying heavily on data, there needs to be separate storage for both data and
metadata in
order for the enterprise data management system to function efficiently.

In the not so distant past, metadata has always been treated as a "second class
citizen" in
the database and data warehouse world. This may be because the primary purpose of a

metadata is to define to the data. But with the evolution of information


technology, the
current emphasis on metadata in the world of data warehouses and software
repository has
elevated to new heights of prominence. Most business organizations now need
metadata
tool for efficient integration and change management.

When implemented and used properly, a metadata warehouse can provide the business
organization with tremendous value so companies need to understand what metadata
warehouse can and cannot do.

There are a lot of large business organizations nowadays that have had some
experiences
with data warehousing implementations. Today, data warehouses often take the form
of
data mart style implementations in many different departmental focus areas like
financial
analysis or customer focused systems that assist business units.

Many business enterprises have various initiatives related to data warehousing


underway
simultaneously and these systems are most likely based on products from various
different
data warehouse vendors in the typical decentralized approach of many companies. To
date,
the approach has mostly worked in that it has allowed reasonably rapid
implementations
and has shown companies that there are benefits to be derived and the potential of
data
warehousing being a business tool can be had at a fraction of the cost of the
enterprise data
warehouse model.

But this approach has got many companies to the legacy data Tower of Babel and some

areas of the business have begun showing signs of stress in the implementation.
Both data
and metadata in this approach are spread across multiple data warehouse systems and
the
administrator are becoming stressed at coordinating and managing the dispersed
metadata.

There need to be consistency in the business rules when they change as a result of
corporate reorganizations, regulatory changes, or other changes in business
practices.
Likewise there should be a way to handle when an application wants to change the
technical
definition.

One of the significant steps to handle the needs stated above is coordinating
metadata
across multiple data warehouses and the way to achieve this is to have a metadata
warehouse.

In an ideal corporate setting, a company should have adopted a repository as a


metadata
integration platform in order to make metadata available across the entire
organizations.
Doing this serves to manage the key metadata across al of the data warehouse and
the
data marts implementations within the business enterprise.

This can also allow all the data users to share common data structures, data
definitions and
business rules definitions from one system to another across the business
organization. The
metadata warehouse can efficiently facilitate consistency and maintainability as it
provides a
common understanding across warehouse efforts promoting sharing and reuse. This can

result to better exchange of key information between business decision managers and

reduced efforts in maintaining the information system as a whole.

Aggregate Data

An aggregate data is the data that is the result of applying a process to combine
data
elements from different sources. The aggregate data is usually taken collectively
or in
summary form. In relational database technology, aggregate data refers to the
values
returned when one issues an aggregate function.

The query function examines the aspects of the data in the table to reflect the
properties of
many groups of rows instead of an individual row. For example, one might want to
find the
average amount of money that a customer pays for something or how many professors
are
employed in a certain university department.
In a larger scale as in a data warehouse, aggregate data gathers information from a
wide
variety of sources like databases around the world to general useful report that
can spot
patterns and trends. A company may have different table customer information,
product
table, prices table, sales tables, employee table and branch tables. Each of these
tables may
general reports base on their records. For example, the products table may generate
report
of all products or the sale table may generate a sales report and so on.

But managers and decision makes need more than that. They may need sales reports
from
different branches so data from two sources, sales and branches may be queried to
get
aggregate data. In the same manner, a manager may also want reports for what
particular
product is the top in sales of a particular employee in a particular branch around,
say,
France. In this case, several table may have to be queried.

Another example where aggregate data used is in an accounting data warehouse. In


this
data warehouse, the historical general ledger transactions are stored. To summarize
the
transactions by account, an aggregate data table would be used.

The use of aggregate data is intensively used not just in business but in all forms
of
statistics as well � whether it is in governance, biodiversity sampling,
pharmaceutical
laboratories and weather watch. Many governments rely on aggregate data taken from
statistical surveys and empirical data to determine the economy and give assistance
to the
less privileged areas. Weather stations share aggregate data to spot patterns in
the
constantly changing weather.

In global business where the internet has become the main conduit, a data warehouse
is for
a company is becoming ubiquitous. These data warehouse aggregate diverse data from
different sources and when used with an electronic tool for analysis, the results
can give
amazing insights into the corporate operation and behaviors of the buying public.

Business intelligence, a form of artificial intelligence, relies heavily on


aggregate data to be
really intelligent. Complex relationships of tables within the database are
programmatically
queried using complex structure query language (SQL) statements. The report, which
consists of aggregate data, can then be electronically passed to decision makers
however
wide the scope of the organization is. Business intelligence can sense trends and
patterns
often without human intervention. Because of the database' capability to sort,
filter, group
and rank data, whether aggregate or not, business intelligence alone can determine
which is
good for the company and give recommendations to decision makers.
In the field of robotics, aggregate data is also very useful. With a top notch
algorithm, a
robot can get individual data and relate them with other data to make new meanings.
As
words mix and match, new "learnings" can be acquired by a robot with aggregate
data.

Gathering and aggregating data is very labor intensive for the computer most
especially if
the data warehouse get very frequently updated. High speed computers are employed
as
stand alone servers just for the purpose of aggregating data with the use of a
relational
database management system.

What is Atomic Data

Atomic data are data elements that represent the lowest level of detail. For
example, in a
daily sales report, the individual items sold would be atomic data, while rollups
such as
invoice and summary totals from invoices are aggregate data.
The word atomic data is based on the atom where in chemistry and physics is the
smallest
particle that can characterized a chemical element. In natural philosophy, the atom
is the
indestructible building blocks of the universe. In the same light, atomic data is
the smallest
data that has details that come up with a complete meaning.

In the field of computer science specifically in computer programming, atomic data


refers to
a data type whether it is an action or an object that can no longer be broken down
into
smaller unites. In other words, the data type is no longer divisible, changeable
and always
whole.

In general, a data type is a classification of a specific type of information that


has
properties. Depending on the programming language being used � whether C, Java or
Assembly Language - the information varies. Generally, there are three types of
data.

They are the integer data type which is a whole number that does not have a
fraction
component, the floating point data type, which can contain a decimal point and the
character, which refers to any readable text. Another atomic data is the Boolean
data type
which refers to two values only � on or off, yes or no, or true or false.

Atomic data types have a common set of properties which include class name, total
data
size, byte order referring to how the bits are arranged as they reside in memory,
precision
which refers to the significant part of a data, offset or the location of the
significant data
within the entire data itself and padding which identifies that data which is not
significant.

In different programming languages, atomic data types can have different


manifestations.
For instance in a structure query language (SQL), an atomic function either
completes its
operation or will totally return to the original state if interruptions like power
failure occur.
In some systems based on the Unix operating systems, the atomic data type cannot be

changed. In another language like Lisp, an atomic data types refers to the basic
unit of a
code that executes.

Relational databases can be the best example of how atomic data are stored and
retrieved
to form a larger set called aggregate data. There is a need to manage and access
data that
describe or represent properties or object whether real or imaginary in all
computer
systems.

A database, which is in its very essence a record keeping system in one example
where
objects are referred to in terms of item information. An object could be a client
or a
corporation having many characteristics. Data inside the database is structured
into a
separate and unassociated atomic data item where each contains relevant
information.
The database has a structure called a relationship where a query is executed to
combine
atomic data into aggregate data and reports are generate for statistical analysis
so that an
organization can draw a profile of many different aspects.

Atomic data can come from several sources. It can come from the same table or can
come
from different tables within the same database. The internet is teeming with atomic
data
traversing the information superhighway every single second. Search engines use a
special
code called crawlers of spiders to index these atomic data to be later used in
ranking pages
when a user types the keywords in search engines.

What is Data Source

Data Source, as the name implies provides data via data site. Data site in turn
stores an
organization's database, data files including non-automated data. Companies
implement a
data warehouse because they want a repository of all enterprise related data as
well as a
main repository of the business organization's historical data.

But such data warehouses need to process high volume levels of data with complex
queries
and analysis so a mechanism has to be applied to the data warehouse system in order
to
prevent slowing down the operational system.

A data warehouse is designed to periodically get all sorts of data from various
data sources.
One of the main objectives of the data warehouse is to bring together data from
many
different data sources and databases in order to support reporting needs and
management
decisions.

Let us take the United States Environmental Protection Agency which is implementing
an
Envirofacts Data Warehouse. Because this is such a large agency dealing with large
volumes
of data, the Envirofacts database is designed to be a system composed of many
individual
EPA databases and databases are administered by program system offices.

Sometimes, the industry required to report information to state where it operates


and
sometimes also, the information is being collected at federal level.

So the data sources of the Envirofacts Data Warehouse provide information that
makes it
easy to trace the origin of the information. Some of these data sources are:
Superfund Data Source � This data source are from Superfund sites which have those
uncontrolled hazardous wastes sites designated by the federal to be cleaned up.

In this data source are stored information about these sites in the Comprehensive
Environmental Response, Compensation, and Liability Information System (CERCLIS),
which
has been integrated into Envirofacts.

Safe Drinking Water Information Data Source � This database stores information
related to drinking water programs.

Master Chemical Integrator Data Source � This database integrates various chemical
identifications used in four program system components.

Other data sources Envirofacts Data Warehouse are Hazardous Waste Data, Toxics
Release
Inventory, Facility Registry System, Water Discharge Permits, Drinking Water
Microbial and
Disinfection Byproduct Information and the National Drinking Water Contaminant
Occurrence Database.

Now, all these data sources contribute seemingly unrelated data which may come in
disparate files formats. This may also come from different geographical locations
from
different federal governments within the United States.

The data that they share finally converged in a central data warehouse which
manages
them so they become more meaningful and relevant to be redistributed or shared to
anybody who needs them.

In a similar manner, business organizations implementing a huge data warehouse may


have
data sources coming from different departments. There may be a data source coming
from
the human resource department, another from the financial and accounting, yet
another
from manufacturing, inventory, sales and many other departments.

For really big companies which operate with various geographical locations around
the
country or around the world, the data sources may from more sources. Data sources
may
be divided in hierarchical fashion.

For instance, a data sources in one geographical branch may be broken down into
various
data sources coming from the different departments within the branch. In the
overall global
data warehouse system, the data sources from the atomic departments become like
twigs in
the global data warehouse tree structure.
The whole data warehouse system with different data sources make the whole system
easy
to manage because when of the data sources breaks down, the whole system will not
halt in
its operations.

Data Type

Data Type describes how data should be represented, interpreted and how the values
should be structured or how objects are stored in the memory of the computer. It
refers to
the form of a data value and constraints placed on data interpretation, The form of
data
value vary and can take different forms such as date, number, string, float,
packed, and
double precision.

The type system uses data type information so that it could check for correctness
of the
computer programs which try to access or manipulate the data.

If one thinks of a computer in a physical perspective, the data storage system


which is
comprised mainly of the hard disk and the random access memory (RAM), all data are
actually stored as a single bit on the surface of these storage areas. Data are
stored based
on digital electronics is represented as bits (alternatives 0 and 1) on the lowest
level. This
system is called the binary system. On the surface, it is just a simple indicator
to say on or
off.

A group of eight of these "on or off or 0 or 1" is called a byte and is the
smallest
addressable unit on the storage device. A "word" is the unit processed by machine
code and
is typically composed of 32 or 64 bits. The binary system can represent both signed
and
unsigned integer values (representing negative or positive values).

For instance, a 32-bit word can be used to represent unsigned integer values from 0
to 232
- 1 or signed integer values from - 231 to 231 - 1. A specific set of arithmetic
instructions
is used for interpreting a different kind of data type called a floating-point
number.

Different language may give different representation of the same data type. For
example,
an integer may have a slight range difference in C language compared to Visual
Basic. In
another instance, a string data type may have different number of range between
Access
and Oracle relational databases.
This is just for the sake of example and the name mentioned may not be the exact
applications having different data type interpretation but this is a common
occurrence in
information systems. But this does not erase the essence of the data type. For this
purpose,
we will describe below the primitive data types.

The primitive data types are as follows:

Integer data types � An integer can hold a whole number but not fractions. It can
also
hold negative values as signed integer. A signed integer holds only non-negative
values. The typical sizes of integers are:

. Byte (composed of 8 bits with a range of -128 to +127 or signed and 0 to 255 for
unsigned)

. Word or short int, composed of 16 bits with a range of -32,768 to +32,767 or


signed and
0 to 65,535 for unsigned)

. Double Word or long int (composed of 32 bits with a range of -2,147,483,648 to


+2,147,483,647 or signed and 0 to 4,294,967,295)

. Long (composed of 64 bits with a range of �9,223,372,036,854,775,808 to


+9,223,372,036,854,775,807 and 0 to 18,446,744,073,709,551,615 for unsigned).

Booleans � Are data types composed of one bit only to signify true (1) or false
(0).
Floating-point � This data type represents a real number which may contain a
fractional
part. They are internally stored in scientific notations.

Characters and strings � A character is typically denoted as "char" and can contain
a
single letter, digit, punctuation mark, or control character. A group of characters
are called
a string.

Many other data types such as composite, abstract and pointer data types but they
are very
specific to the implementing software.

Demographic Data

Demographic data are data output of demography which is the study dealing with the
human population. Demographic data can be related to the Earth, the same as
geographic
data. Demographic Data usually represent geographical location, identification, or
describe
populations.
This field of science and research can be applied to anything about the dynamic
nature of
the human population including how it changes over time and what factors are
affecting the
changes. This study also covers aspects of human population such as the size,
structure,
distribution, spatial and temporal changes in response to birth, death, aging or
migration.

Demographic data which are most commonly used include crude birth rate, general
fertility
rate, age-specific fertility rates, crude death rate, infant mortality rate, life
expectancy, total
fertility rate, gross reproduction rate and net reproduction ratio.

Demographic data can be used in analyzing certain patterns and trends related to
human
religion, nationality, education and ethnicity. These data are also the basis for
certain
branches of studies like sociology and economics.

Collection of demographic data can be broadly categorized into two methods: direct
and
indirect. Direct demographic data collection is the process of collecting data
straight from
statistics registries which are responsible for tracking all birth and death
records and also
records pertaining to marital status and migration.

Perhaps the most common and popular methods of direct collection of demographic
data is
the census. The census is commonly performed by a government agency and the
methodology used is the individual or household enumeration.

The interval between two census surveys may vary depending on the government
conducting. In some countries, a census survey is conducted once a year or once
every two
years and still others do census once every 10 years. Once all the data collected
are in
place, information can already derived from individuals and households.
The indirect method of demographic data collection may involve only certain people
or
informants in trying to get data for the entire population. For instance, one of
the indirect
demographic data methods is the sister method. In this method, a researchers only
asks all
the women on the number of their sisters who have died or have had children who
have
died at what age they died.

From the collected data, the researchers will draw their analysis and conclusions
based on
indirect estimates on birth and death rates and then apply some mathematical
formula so
they can estimate trends representing the while population. Other indirect methods
of
demographic data collection may be to collect existing data from various
organizations who
have done a research survey and collate these data sources in order to determine
trends
and patterns.

There are a lot of ways for demographic methods for modeling population processes.
Some
of these models are population projections (Lee Carter, the Leslie Matrix),
population
momentum (Keyfitz), fertility (Hernes model, Coale-Trussell models, parity
progression
ratios), marriage (Singulate Mean at Marriage, Page model) and disability
(Sullivan's
method, multistate life tables).

With today's advancement in information technology and the mass development of


computers which has caused the dramatic decrease in the price, many agencies are
now
employing the services of computer information systems to process demographic data
into
meaningful, useful and relevant information than can made as basis for wise
decisions.

In fact, it is now a lot easier to get demographic data that can cover the whole
planet while
data users can drill down deep into the database to get more demographic data
pertaining
to very specific geographical area. With the popularity of the internet, looking
for
demographic data with corresponding analyses has become a lot easier and faster.

Legacy Data

Legacy data comes from virtually everywhere within the information system and
support
legacy systems. The many sources of legacy data include databases, often relational
but
hierarchical, network, object, XML, and object/relational databases as well. Legacy
data is
another term used for disparate data.

Some files such as XML documents or �flat files� such as configuration files and
comma-
delimited text files may also be sources of legacy data. But the biggest sources of
legacy
data are those from the old, updated and antiquated legacy systems.

A legacy system refers to an existing group of computers or application programs


which
have been old and outdated by companies still refuse to give them up because they
still
serve well.

These systems are usually large and companies have invested so much money in
implementing legacy systems in the past that despite some potentially problematic
identified by IT professionals, many still want to keep them for several reasons.

One of the main problems with legacy systems is that they often run on very slow
and
obsolete hardware parts which, when broken, would be very difficult to look for
replacements. Because of the general lack of understanding of these old
technologies, they
are often very hard to maintain, improve and expand. And because they are old and
obsolete, chances the operations manual and other documentations may have been lost

through the years.


Despite the emergences of newer technologies with individual parts relatively
cheaper,
many companies still have compelling reasons why they are keeping such old and
antiquated system whose data adds to the disparity in data warehouse systems.

One of the biggest reasons is the legacy systems were implemented to be large and
monolithic in nature and coming up with a one time redesign and reimplementation
would
be very costly and complicated. If legacy systems are taken out at one single
moment, the
whole business process would be halted for sometime because of the monolithic and
centralized nature of these systems.

Most companies cannot afford any business stoppage especially in today's very fast
paced
data driven business environment. What worsens the situation even more is that
legacy
systems are not very understood by younger IT professional so redesigning them to
adopt
to newer technologies would take so long and intensive planning.

That is why it is very common to see data warehouses nowadays which are a
combination of
new and legacy systems. The effect would be having legacy data which are very
incompatible with the data coming from the data sources using newer technologies.

In fact, different new technology vendors are encountering differing disparity data
related
problems with using legacy systems. IBM alone has enumerated some typical legacy
data
problems which include among others:

. Incorrect data values

. Inconsistent/incorrect data formatting

. Missing data
. Missing columns

. Additional columns

. Multiple sources for the same data

. A single column being used for several purposes

. The purpose of a column is determined by the value of one or more other columns

. Important entities, attributes, and relationships hidden and floating in text


fields

. Data values that stray from their field descriptions and business rules

. Various key strategies for the same type of entity


. Unrealized relationships between data records

. One attribute is stored in several fields

. Inconsistent use of special characters

. Different data types for similar columns

. Different levels of detail

. Different modes of operation

. Varying timeliness of data

. Varying default values and other Various representations.

Legacy data and the problem regarding data disparity they bring to a data warehouse
can
be solved by the process of ETL (extract, transform, load). This is a mechanism of
converting disparate data not just from legacy systems but all other disparate data
sources
as well before they are loaded into the data warehouse.

Foredata

Foredata is a very new term that stands for "Developed From Fore" meaning
beforehand, up
front, at or near the front. In fact not many are aware of the existence of such
word but
the underlying function of the foredata has always been there and has existed as
early as
the database system has existed years ago. Foredata are all data about the objects
and
events, including both praedata and paradata.

In a data warehouse implementation, every data that a data consumer interacts with,

regardless of whether he is a high ranking official or just a rank and file


employee, is
foredata.

Foredata are the upfront data which are used for describing a data architecture's
objects
and events. They are also used for tracking or managing the said objects and events
in the
real world as they really represent also these said objects and events.

But foredata are no different from the data inside the data warehouse or from its
various
data sources. The foredata is only a term to represent the way data are being used
although
they are structurally the very same data circulating around from one data source to
another
or the same data being stored in the data warehouse until someone queries them for
specific information.

To some degree, a foredata could be considered some kind of a replica of the data
which
are being used in the backend processing of the system. For instance, as the
definition goes
that "foredata are the upfront data that an organization sees about objects and
events", any
report generated by the company's data consumer are foredata in that their
momentary
purpose is to present the data and not to be input for a functional backend
processing.

Since the term foredata is really very new, there are others from the IT profession
who
differentiate foredata from the other kinds of data in that foredata refers to the
data which
is current, revolving and active. This is in contrast to the data warehouse data
which are
dormant or in some sort of archived state.

Foredata in general refer to the data after the all those disparate data coming
from various
data sources have already undergone through the process of extract, transform and
load
(ETL). This is because the raw data before ETL may have come from other sources and
they
have not yet been stripped or their attached formatting and other information.

Once the data start getting into the first ETL stage which is the extract, the are
already
stripped to the core. After that they are transformed and this can be done by
adding XML
tags and other attributes which make them fit into the business rules and data
architecture
that they are intended to be used.

If we go back to the above definition "Foredata are all data about the objects and
events,
including both praedata and paradata and they are the upfront data that an
organization
sees about objects and events", these extracted and transformed data will see fit.
This will also distinguish foredata from all other data within the data warehouse
and
enterprise information system. As we all know, the other data within the
information system
are flat files, networking protocol associated data, and multimedia data.

Foredata constitute elements for reporting which is the most essential purpose of a
data
warehouse. Companies need to make sure that they get the latest trends and patterns
so
that they can evaluate the efficiency of their business operation strategy and
reformulate
some policies are revise some product management in order for the company to gain
competitive advantage.

Integrated Historical Data


An integrated data resource is composed of many different data sources that have
been
applied with several tools to overcome disparities. For instance, without the aide
of
integration tools, a business enterprise may have several database systems in each
of the
departments within the business organizations.

These database systems may be relational in one department and non-relational in


another.
On one relational database management systems may come one vendor in some
departments and in other departments, they may be using RDMS from another vendors.

And in some more cases, some departments may be using flat files or legacy data.
With the
help of an integration software tool, the problem arising from data disparity will
be
minimized at least and eliminated at most.

An operational data store is a place or logical area which is basically a database


that
handles integration of disparate data from multiple sources so that business
operations,
analysis and reporting can be facilitated while the business operation is
progressive.

Since data comes from various disparate sources, data integration at this data
store are
being cleaned, resolved for redundancy and enforced with the corresponding business
rules.
This is the place where most of the data used in current operation are located
before they
transferred to the data warehouse for temporary or long term storage or archiving.

The operational data store is a very busy area. Every single minute, data comes in
from
various sources which are progressively handling other transactions and goes to
another
database and information systems that need the transformed data. Since this place
requires
labor intensive applications, a mechanism should be done so as not to overload the
operational data stored and not cause it to break down from handling and
concurrently
processing large quantities of data.
Every one in a while in a regular interval, those data which are not momentarily
used for
the operation should be moved to another place. For the sake of clarity, let us say
that the
reason why the area is called operational data store because it is the repository
of the
currently operational data which are used for current operations.

So, when the operational data store sees that data is not needed at the moment and
the
data store has already transformed the data by adding corresponding format in line
with
data architecture, the data will then to be moved to the data warehouse or more
specifically, an integrated historical data portion of the data warehouse.

Integrated historical data are just like any other enterprise data which has
generally passed
through the process of extract, transform load (ETL). Since they are basically the
same
data, they are also contained in a database inside the data warehouse. The only
distinct
thing about then that differentiates them from operational data is that they more
at rest.
The term historical data is very apt because these data are really relatively past
in the they
have already served their intended purpose. The are placed one area or may be in
different
areas but connected by some application tools so that they can be easily retrieve
when a
need arises.

Historical data are very important especially in the area of statistical analysis.
For instance,
if the company want to know the sale trend within the last few months, operational
data
alone cannot address this need and the business intelligence system will have to
get data
from the integrated historical database.

Redundant Data

Redundant Data as the name suggests is data duplication. It means, same data of a
single
organization is stored at multiple data sites

Dealing with redundantly data means that a company has to spend a lot of time,
money and
energy. Since, as mentioned, these redundant data are unknown to the organization,
they
can crawl into the system and give the system unwanted and unexpected results such
as
slowing down the entire system process, giving inaccurate data output and affecting
data
integrity very negatively. Redundant data can also create a risk to information
quality if the
different databases are not updated concurrently.

Data redundancy is costly to address as it requires additional storage,


synchronization
between databases, and design work to align the information represented by
different
presentation of the same data.

The problems associated with redundant data can be addressed by data normalization.

Normalized tables generally can contain no redundant data because each attribute
only
appears in one table. Also, normalized tables do not contain derived data and
instead, the
data contained can be computed from existing attributes which has been selected as
an
expression based on the said attributes.

Having normalized tables can also greatly minimize the amount of disk space used in
the
implementation while making the updating very easy to do. But with normalized
tables, one
can be forced to use joins and aggregate functions which can sometimes be time
consuming
to process. An alternative to database table normalization would be to have new
columns to
contain redundant data as long as the trade offs involved are fully understood.

A correctly designed data model can avoid data redundancy by keeping any attribute
only in
the table for the entity which it describes. In case the attribute data is needed
in a different
perspective, then a join can be used although using a join may take time. If the
join really
greatly affects the performance in a negative way, then it can be eliminated by
duplicating
the joined data in another table.
But despite all the negative effect and impressions associate with redundant data,
there is
also some positive impact that redundant data may bring. Redundant data can also be

useful and may even be required in order to satisfy service-level goals for
performance,
availability, and accessibility.

It has been shown in the different representations of the same data by data
warehouses,
operational data stores, and business intelligence systems that redundant data is
essential
in providing new information. The important thing is to know is that in some cases,
when
redundant data is managed well, it can give some benefits to the entire information
system.

In fact, data redundancy is actually a standard computer term referring to the


computer
storage property wherein several disk arrays typically in RAID arrays provide fault
tolerance
such that when some of the disks in the system fail, all or part of the data stored
on the
array can be recovered. The cost which is often associated with data redundancy is
a
reduction of disk capacity; implementations require either a duplication of the
entire data
set or an error-correcting code to be stored on the array.

There are many special hardware available in the market today especially designed
for
handling redundant data. A redundant data storage hardware can help decrease a
system
downtime by removing some of points of failure. Some storage arrays can provide a
system
with redundant power supplies, cooling units, drives, data paths, and controllers.
While
servers attached to the redundant data storage can include multiple connections,
providing
path failover capabilities.

Integrated Operational Data

Integrated Operational Data also known as operational data stores is a volatile


collection of
data that support an organization�s daily business activities.

Integrated Operational Data are the output of the operational data store. An
operational
data store is an architectural construct which is part of the larger enterprise
data
management system. It is subject-oriented, integrated and time-current.

In an integrated enterprise data management, there are several computer server


hosting
database systems and these computers are connected by a network so they can work
together to achieve an efficient operation. Although having a network of computer
servers
has more benefits compared to implementing an enterprise data management system in
a
monolithic structure, there are several great challenges that need to be tackled.
One of the biggest problems with a networked enterprise data management system is
in the
area of integration. First and foremost, the environment is networked and so the
system will
have to deal with different kinds of network protocol. But the problem associated
with
networking pales in comparison to problems arising from system and data disparity.
Different computer servers on the network may be running on different platforms.
Within
each computer, there could be different database systems. And some departments may
be
still be using flat files and legacy data and these are equally important that need
to be
processed by the system.

An integrated operational data store answers all these problems related to system
disparity.
For sake of clarity and simplicity, the term operational data store literally means
that it is a
storage for all data currently used in operation. Perhaps the easiest term we can
find as the
opposite to the term operation is the term archive.

In technical description, an integrated operational data store is a subject-


oriented,
integrated, volatile, current-valued, detailed-only collection of data in support
of the needs
of an for up-to-the-second, operational, integrated, collective information.

An example of how an integrated operational data store operates can be illustrated


by a
banking environment wherein a large bank may be managing thousand of individual
accounts. At any given money hundreds of these accounts may be changing status. And

then there are large customers of the bank which has many accounts as well. In
order to
manage these status changes involves a complex array of customers, an operational
data
store handles the processes.

An integrated operational data store works closely with the data warehouse. In fact
the data
warehouse itself is a data store. But while the operational data store deals with
current
data, the data warehouse usually stores data for storing and archiving. They are
both
database systems with the operational data store acting as an interim area for a
data
warehouse. The operational data store contains dynamic data constantly updated
through
the course of business operations while the data warehouse generally contains
relatively
static data.

In a large business enterprise, data demand is very high and so it is not uncommon
to find
an information system with several operational data stores is designed for very
quick
performance of relatively simple queries on small amounts of data like finding
tracking
status of shipments instead. Several operational data stores share the load of
enterprise
data processing. An integrated operational data store connects all scatted
operational data
store with a software tool so that they work as one large efficient system.

Operational data stores need to implemented with top of the line and robust
computer
hardware and sophisticated software tools due to the nature of its processes that
involve
complex computing of very high quantity of data coming various sources. They need
to
work very fast because they exist to give very up to date information to data
consumers.

Spatial Data

Spatial data is about information that has several dimensions. It is sometimes


referred to as
aspatial data. It includes both geospatial data and structo-spatial data. It is all
about
information wherein location is of some benefit or importance although is not
always about
location on the planet's surface but in other entities as well like the body
system.

In short, spatial data has anything to do with any multidimensional frame. This
frame may
include engineering drawings which are referenced to a mechanical object, medical
images
referenced to the human body or architectural drawings which are referenced to a
building.

But spatial data is more widely used in geographical databases. In fact, geo-
information
which is a short term for geographical information, is a specialized data that has
a
specialized field of study. Geographic information is created by manipulating
spatial data by
a computerized system which may include computers and networks as well as standards

and protocols for data use and exchange between users within a range of different
applications.

There are so many popular fields of science that use spatial data and these fields
include
land registration, hydrology, cadastral, land evaluation, planning or environmental

observation. Spatial data, or in geographic information field is also called


geodata, comes in
many different format such as coordinates in text form as well as maps or any
images taken
from the air or from space such as remote sensing data.

Spatial data are stored in a spatial database which is of a special kind because
some
extensions may be considered for it to be capable of storing, handling, and
manipulating
spatial data. The out geoinformation output is processed with a special kind of
computer
program called a geographic information system (GIS) which has become very popular
these days with the rising ubiquity of the likes of Google Maps. A spatial
information system
is an environment where GIS is operated along with machines, computers, network
peripherals and people.

A spatial database describes the location and shape of geographic features, and
their spatial
relationship to other features. The information which is in the spatial database
consists of
data in the form of digital co-ordinates, which describe the spatial features. The
information
can pertain points (for instance location of museums), lines (representing roads)
or
polygons (may represent district boundaries). The information is typically in
different sets of
data in separate layers which can later be combined in various ways to be used for
analysis
or production of maps.
Spatial analysis is a not a new technical method and it may have arisen during the
early
attempts at surveying or cartography. There are also many other fields in science
that have
contributed to the development of spatial analysis. The science of biology has its
contribution through botanical studies of global plant distributions and local
plant locations,
studies of movement of animals, studies of vegetation blocks, studies of spatial
population
dynamics, and the study of bio-geography.

During a cholera outbreak in the past, a research mapping was done thus
epidemiology also
contributed to the development of spatial data. Statistics has also contributed
especially in
the field of spatial statistics. The same can be said of the contribution of
economics with
spatial econometrics.

Today, computer science and mathematics are some of the biggest users and
developers of
spatial data. Many of today's business organizations use spatial data to map out
the
progress of their business operations. Maps and GIS images are becoming ubiquitous
in the
internet and mashups applications are easily becoming available that even personal
websites can have their useful and fancy functionalities.

Data Structure

What is Data Cluster

Clustering in the computer science world is the classification of data or object


into different
groups. It can also be referred to as partitioning of a data set into different
subsets. Each
data in the subset ideally shares some common traits.
Data cluster are created to meet specific requirements that cannot created using
any of the
categorical levels. One can combine data subjects as a temporary group to get a
data
cluster.

Data clusters are the products of an unsupervised classification of different


patterns
involving data items, observations or feature vectors. The clustering process has a
broad
appeal and usefulness as on the first steps in exploratory data analysis as
reflected in many
contexts by researchers from various disciplines. Clustering is widely used despite
its
difficulty being combinatorial in its very nature.

Data clustering is used in many exploratory process including pattern-analysis,


decision
making, grouping. It is also heavily used in machine learning situations like data
mining,
image segmentation, document retrieval and classification of data patterns.

There are two general algorithms used in data clustering. These categories are
hierarchical
and partitional. Hierarchical algorithms work by finding successive clusters with
the use of
previously established clusters. Hierarchical algorithms can be further
subcategorized as a
agglomerative ("bottom-up") or divisive ("top-down"). On the other hand,
partitional
algorithms work by determining all clusters at once and them partitioning them.

Within the data clustering taxonomy, the following issues exist:

Agglomerative vs. Divisive: This issue refers to the algorithmic structure and
operation of
data clusters. The agglomerative approach starts with each pattern in a distinct
cluster
(singleton) and then successively does merging of rest of the data until a certain
condition
is being met. The divisive approach starts with all clusters patterns within a
single cluster
and then splits them until a condition is satisfied.

Monothetic vs. Polythetic: This issue refers to the sequential or simultaneous use
of
features in the process of clustering. Most data clustering algorithms are
polythetic in
nature. This means that all features are done in computation of distance patterns.
The
monothetic approach is simpler. It considers features sequentially and then divides
the
given group of patterns.
Hard vs. Fuzzy: Hard clustering is done by allocation each pattern into one cluster
during
the clustering operation and in its final output. On the other hand, a fuzzy
clustering is done
by assigning degrees of membership in many clusters to each input pattern. A fuzzy
clustering method can be converted to hard clustering method by assigning each
pattern to
another cluster having the largest measure of membership.

Deterministic vs. Stochastic: This issues can be said to be relevant to a


partitional
approach. This is designed to optimize squared error functions. With the use of
traditional
techniques or using a random search of each space that contains all possible
labels,
optimization can be realized and optimized.
Incremental vs. Non-incremental: This issue can be encountered in cases when the
pattern set to be clustered is very large and some constraints are met with regards
to the
memory space or time affecting the algorithm's architecture.

The advent of data mining where relevant data need to extracted from billions of
disparate
data within one or more repositories has furthered the development of clustering
algorithms
designed to minimize number of scans and therefore effect in lesser load for
servers.
Incremental clustering is based on the assumption that patterns can be considered
one at a
time and have them assigned to other existing clusters.

The process of data clustering is sometimes closely associated with such terms as
cluster
analysis, automatic classification and numerical taxonomy.

What is Data Attribute

In the realm of computer science, a logical data model is the accurate


representation of a
company's data. These data need to be logically represented because later on they
will be
the basis for data modeling. Data modeling in turn will be the basis for database
implementation as the computer needs to understand business entities and activities
from a
digital perspective.

At the logical data model, a data modeler needs to describe all data in the most
detailed
way possible. This should be regardless of how the physical database will be
implemented.
The logical data model includes identification of all entities are relationships
among them. It
also lists all the attributes for each entity which is being specified.

In fact, the steps for designing a logical data model are:


1. Identifying all the entities based on business activities
2. Specifying the primary keys for all identified entities
3. Finding and defining all relationships between different entities
4. Finding attributes for each entity
5. Resolving many to many relationships
6. Normalization
A data attribute is an instance or occurrence of any attribute type. A data
attribute value is
a characteristic of or any fact describing the occurrence of an entity. For
instance, an
entity's color maybe "red" or "blue" and other color that correctly describes the
entity.

Each type entity will have one more data attributes. In logical data modeling, data

attributes should always be cohesive from the perspective of the domain. This is
often called
a judgment call for the data modeler or data architect. Getting to the deepest
level of detail
can make a real significant impact on the development and maintenance efforts
during the
future of the implementation.

Data attributes will always exist for an entity regardless of whatever is being
represented by
the entity in the real business situation. For instance in the business scenario, a
logical data
model may have an entity of Customer. The data attributes to the Customer entity
may
include but not limited to first name, middle name, last name, address, age,
gender,
profession, and many more.

Data processing is about data attribute values. These data attribute values
represent the
most tangible or least abstract areas of data processing and they are the core of
any
information management systems.

In relational database management systems, data attributes should be managed well


so
that redundancy of these data will not affect the whole database system in a
negative way.
Determining the proper uses of data attributes is very important in database
normalization.
Data normalization is the process wherein the data attributes within the data model
are
organized in a cohesive way with the entity types so as to increase the performance
of the
database by reducing processing time by eliminating redundant data.

As a general rules, there are 3 of data normalization.

The first normal form (1NF) states that any entity type is in the first normal form
when it
contains no repeating or redundant groups of data.

The second normal form (2NF) states that an entity is in the type of the second
normal
form when it is in the 1NF and when all of the non key attributes are fully
dependent on the
primary key.

The third normal form (3NF) states that any entity is in the third normal form when
it is
in the 2NF and when all of its attributes are directly dependent on the primary
key.

As can be seen here, knowing the correct data attribute of an entity and how
arrange them
in tables and defining the correct relationships can give a database performance a
great
improvement.

What is Data Attribute Group


In data modeling, a logical data model is the representation of business data into
a data
model that can be the basis for the physical database implementation. It identifies
a data
"periodic table" which will be the basis for the business organization's functions,
processes
and task to be performed. Data modelers design a logical data model in order to be
able to
establish a data processing environment where the basic data is captured only once,
stored,
and then shared to data consumers who are authorized by the company for generating
statistical reports or the public who may want information.

At its most basic level, a logical data model defines things about which data is
kept such as
people, places events. These are technically known in database term as entities.
The world
relationship is another database term used in a logical data model to mean the
relationship
or connection between the entities. Finally, the term attribute refers to the
characteristics of
the entities.

There are certain rules to follow in using attributes in logical data modeling. The
rules below
are as follows:

1. An attribute should posses a unique name and the same meaning must be
consistently
applied to the name

2. An entity may have one or more than one attributes. Every attribute is owned by
exactly
one entity in a key based or fully attributed model. This is also referred to as
the Single
Owner Rule.

3. Any number of migrated attributes may be owned by an entity as long as the


migrated
attribute is a part of the primary key of a related parent entity. In principle,
the primary key
must always migrate while a non key attribute must never migrate.
4. An entity instance can not have more than one value for an attribute which is
associated
with the said entity. This is also known as the No-Repeat Rule or First Normal Form
Rule.

5. An attribute which is not part of a primary key can be null or meaning not
known. In the
past, this was known as the No-Null Rule but is no longer required now. The data
modelers
in the past refused to take a non key attribute which could be set to null .

6. Models should no not constrain two distinctly named attributes in which the
names are
synonymous. Two names are said to be synonymous if both as alias for one another,
whether directly or indirectly. Also, they are said to be synonymous if there is a
third name
for which both names are aliases.
Attributes may be multi valued, composite of derived. A multi valued attribute can
have
more than one value for at least one of the entity's instance. As an example, a
software
whose entity is called application may have a multi value attribute called platform
because
different instances of the same application may run on different platforms. To
illustrate
further, the application may be a document processor which can run on Microsoft,
Apple and
Unix platforms.

A composite attribute may contain two or more attributes. An address can be a


composite
attribute consisting of street address, city, region, state and so on.

A derived attribute is an attribute whose value is taken from other data and may be
a result
of a formula. A person's age can be an attribute derived from another attribute
which is the
birthday. Derived attributes are very common in business data warehouses where
atomic
data are aggregated heavily to form the report about the company profile.

What is Data Cardinality

In the implementation of a structure query language (SQL), the term data


cardinality is
used to mean the uniqueness of the data values which are contained in a particular
column,
known as attribute, of a database table.

There are actually three types of data cardinality each dealing with columnar value
sets.
These types are high-cardinality, normal-cardinality, and low-cardinality.

High data cardinality refers to the instance where the values of a data column are
very uncommon. For example, a data column referring to values for social security
numbers
should always be unique for each person. This is an example of very high
cardinality. Same
goes with email address and user names. Automatically generated numbers are of very
high
data cardinality. For instance, in a data table column, a column named USER-ID
would
contain values starting with an automatically increments every time a new user is
added.

Normal data cardinality refers to the instance where values of a data column are
somewhat uncommon but never unique. For example, a CLIENT table having a data
column
containing LAST_NAME values can be said to be of normal data cardinality as there
may be
several entries of the same last name like Jones and may other varied names in one
column. At close inspection of the LAST_NAME column, one can see that there could
be
clumps of last names side by side with unique last names.

Low data cardinality refers to the instance where values of a data column are not
very unusual. Some table columns take very limited values. For instance, Boolean
values
can only take 0 or 1, yes or no, true or false. Another table columns with low
cardinality are
status flags. Yet another example of low data cardinality is the gender attribute
which can
take only two values � male or female.

Determining data cardinality is a substantial aspect used in data modeling. This is


used to
determine the relationships

Several types of cardinality defining relationships between occurrences of entities


on two
sides of the line of relationships exist.

The Link Cardinality is a 0:0 relationship and defined as one side does not need
the other
to exists

The Sub-type Cardinality is a 1:0 relationship and defined as having one optional
side
only.

The Physical Segment Cardinality is 1:1 relationship and it is demonstrated that


both
sides of the relationship are mandatory.

The Possession Cardinality is a 0:M relation (zero to many) relationship on both


sides.

The Child Cardinality is a 1:M mandatory relationship and is one of the most common

relationships used most databases

The Characteristic Cardinality is a 0:M relationship which is mandatory on both


sides.
The Paradox Cardinality is 1:M relationship which is mandatory to one side. An
example
would be a person table and citizenship table relationship.

The Association Cardinaltiy is a M:M (many to many) relationship which may be


optional on
both sides.

A data table's cardinality with respect to another data table is one of the most
critical
aspects in database design. For instance, a database hospital may have separate
data
tables used to keep track patients and doctors so a many to one relationship should
be
considered by the database designer. If the data cardinality and relationships are
not
designed well, the performance of a database will greatly suffer.

What is Data Characteristics


Data characteristics are defined during data modeling, a process where a data model
is
created by applying a data model theory in order to create a data model instance. A
data
model theory is a formal description of a data model.

Data modeling is also a process of structuring data and organizing data so that the
data
structures will then be the basis for the implementation of a database management
system.
Also in addition to organizing and defining data, the data modeling process also
implicitly or
explicitly imposes limitation and constraints on the data within the structure.

Data models are based on business rules. This is because business rules are
abstracts and
intangible concept that the database management system, which is basically a
computer
system, cannot understand. Business rules converted to data models convert data
into
format that the computer will finally be able to understand and thus implement.

Here is an example of a draft business rule that will be the basis of a data model:

A. The aspect being modeled is a Product Line

B. The things of interest, herein referred to as "Things" include


(1) Products
(2) Product Categories
(3) Product Characteristics

C. These "Things" are related as follows:


(1) A product can be in one and only one Product_Category;
(2) a product can have zero, one or many Product_Characteristics.
D. The other characteristics of the "Things" include
(1) A product which can have either zero or one typical_buying_price and
(2) A product can have either zero or one typical_selling_price.

E. Sample data may include products to be determined by the company

F. Typical inquiries may include typically selling price for a certain number of
products

It is apparent from the draft business rule that data characteristics are
everywhere. As
mentioned earlier, data characteristics can either be developed directly through
measurement or indirectly through derivation, from a feature of an object or event.
For instance, if a product is a T-shirt, the data characteristics that are
developed by direct
measurement are
(1) material composition of the T-shirt,
(2) size range of available T-shirts,
(3) style of the T-shirt and
(4) supplier of the T-shirts.

On the other hand, some of the data characteristics which can be developed through
derivation may include
(1) bulk price of the T-shirts which can be derived depending on the number of
orders and
(2) shipment price of the T-shirts which also depends on the number of orders.

Data characteristics are very important in an area of data modeling called entity-
relationship
model (ERM). The ERM is a representation of structured data where final product is
an
entity-relationship diagram (ERD).

From the data models where the data characteristics are defined, relationships
among the
data, or more technically known as "entities" defined. Data characteristics are
also known
as "attributes" in data modeling jargon. Relationships could be as simple as "An
employee
entity may have a attribute which a social security number.

There are various types of relationships in ERM. Some may have many to one, one to
many, many to many or one to one. Database implementers need to give ERM design a
very careful consideration because any slight failure can result in weak data
integrity and
the resulting flaw could be hard to trace. As recommended by database experts, it
is always
good to draft a database plan using plain English language and data characteristics
should
also be defined likewise.

What is Data File

In a logical data model, the conceptual data model which is based on the business
semantic
is being defined. Thus, entities and relationships and corresponding table and
column
design, object oriented classes, and XML tags, among other things are being laid
regardless
of the database will be physically implemented.

A data file is a physical file. This means that this is a file that is represented
as real bit in
the storage hardware of the computer.

Dealing with data files in a large data warehouse is not as simple as dealing with
them on a
stand alone computer. Large data warehouses are managed by relational database
management systems. In relational databases, entities refer to any data that can be
of
interest and these entities have attributes.
For example, a CUSTOMER is an entity in the database. The customer could have
attributes
such as First Name, Middle Name, Last Name, Customer Number, SSS Number and a lot
more. When an entity is entered into the database, the database management system
connects an entity with its attributes in different ways called a relation.

An entity may have multiple attributes such as the number of places that he has
lived all
this life. All these information are saved as data file in a database management
system.

Today's data warehouses also make intensive use of extensible mark up language
(XML)
which is general purpose mark up language. XML is primarily used to facilitate
sharing of
structured data across several information systems which may have disparate servers
such
as the internet. XML is also used to encode documents and to serialize data so they
can be
easy to process.

XML has in fact been used by many as an excellent alternative to relational


databases. In a
distributed system or in data warehouses getting data from several sources, having
relational database files could mean that one file from a server may not be
compatible with
the other servers. Using XML overcomes the problem with portability because XML
files are
actually standard text files so different servers reading them could understand the
files.

Since XML can make its own mark ups, data warehouses could utilize an XML data file
to
store information about an entity and use the information later. XML data files may
reside
anywhere within the computer the storage. When information about an entity is
needed
from an XML data file, an XML needs to be processes using programming language in
conjunction with either SAX API or DOM API.

A transformation engine or a filter can also be used. Newer techniques for XML
processing
include push parsing, data binding and non extractive XML Processing API.

An entity can also be represented by manual data files. In fact, there are many
instances
that manual data filing is used instead of a database system or XML. For example
documents files such as last will and testament or long contract files have to be
stored
separately as manual data files.

Also, large video or photo files pertaining to a person need to be stored as data
files too.
But there has to be a mechanism to reference these manual files so they relate in
ownership
to a data entity. Both the database management system and the XML technique can be
used to do the referencing.

What is Data Element


In some aspects of information technology, a data element is referred to as any
named unit
of data which may or may not consist of other data items. In many electronic record

keeping applications, a data element is a combination of bytes or characters which


refer to a
separate item of information like name, address and gender.

A data element has a definition done in metadata. The definition may either be a
phrase in
human readable form or a sentence which is associated with the data element within
the
dictionary. Having a good data element definition can add greater benefits in such
process
as mapping of one set of data into another set of data.

Data is the main component of a data warehouse. Most businesses today are heavily
reliant
on information from a data warehouse. The term data-driven business is very much in
use
today with the ubiquity of the internet.

Data warehousing is a complex undertaking which very careful planning through


several
stages of implementation. An active data warehouse has a capacity to search for
trends and
patterns in evaluation data so that a company can strategize in order to gain
competitive
edge over its competitors. So in order to gain relevant information from a data
warehouse,
the data must be well structured.

A data warehouse should be based on a common data architecture which is a formal


and
comprehensive data planning which is the basis for a common context from which data

resources are integrated. The data within the architecture are in turn based on the
common
data model.

Data modeling is the process of turning data into representations of the real life
entities,
events and objects that are of interest to the organization. So that the data
warehouse can
come up with consistent data, a data dictionary should also be set up.

A data dictionary, in technical terms, refers to a set of metadata (metadata is


information
about a data) which contains definitions and representations of data elements. From
the
perspective of a relational database management system, data dictionary is a set of
tables
and views which can be read and never altered as it holds the definition of all
data elements
used in the data architecture as well as the physical implementation of the
database.

The data dictionary, aside from containing the definitions of all data elements,
also contains
usernames and the corresponding roles and privileges, schema objects, integrity
constraints, stored procedure and triggers, information about the general database
structure and space allocations.

The way data elements are stored within the database may vary depending on the
database
design and the relational database management software application. But data
elements are
always the same in that they are atomic units of data containing identifications
such as data
element name. A data element should have a clear definition and a representation of
one or
more terms.

Data elements can be used depending on the application employing them. But their
usage
can be discovered by inspecting the applications or the data files of the
applications through
Application Discovery and Understanding which can be done manually or
automatically.

The process of Application Discovery and Understanding (ADU) involves analyzing of


artifacts of a software application and then determining the structures of the
associated
metadata in the form of lists of business rules and data elements. After the data
elements
are being discovered they can be registered in the registry for metadata

What is Data Key

While implementing a Data Warehouse, there are a multitude of complex things to


consider.
Data Management is one of the most important aspects that need to be considered,
aside
from other things like physical server and network components.

In Data Management, the design of the Data Structure determines the smooth and
successful implementation of the database that will power the Data Warehouse.

Since a Data Warehouse is a rich repository of all sorts of data � from company
data history
to data from outside data sources � it is always to a good idea to classify these
data in
order to get the relevant information and generate statistical reports for the
company.

A subject area is the summary of all things that a Business Enterprise is


interested in. It
pertains to the collection of data information which has a relation to high level
function of
the organization. A conceptual model defines a subject area as the limit between
systems or
areas of interest within the organization. An example of data subject area would be

Employee Subject Area which contains all entities and other data attributes
pertaining to
employees.

Despite the segregation of the high volume of disparate data coming from various
data
sources in a Data Warehouse, there is still no assurance of effectively looking for
the most
relevant data without any help from tools and other retrieving techniques. Using
Data Keys
for data subject can greatly help in the retrieval or important and relevant data
from the
database within the Data Warehouse.

Since Data Keys uniquely identify data occurrences in each data subject within the
data
resources, when a data consumer tries to look for a data, he is no longer
challenged by
sifting through the heavy volume. Instead, his search will be narrowed down because
of the
use of a key.
For example, without the help of a Data Key, when a data consumer want to look for
the
buying trend of say, people within the age range of 20-30 years old, he may be
confronted
with data coming all kinds of customers within the database.

But when the data consumers use a Data Key to identify the people within the 20-30
years
old, his search will definitely be narrowed down. As a result, his search will be a
lot faster
and the computer server will not be burdened with intensive processing load.

A Data Warehouse itself is a subject-oriented information system which is designed


for
company support. It is an environment that organizes and provides reports in a
manner that
the business people can understand.

There are several layers in a Data Warehouse Architecture and one of the layers
include
extraction, cleansing and transformation of source information. This is the layer
where Data
Keys are attached to certain data so that it would be easy to find only the
relevant data in
the warehouse.

The Data Key is one of the important aspects of data structures used in general in
the
information technology field. In cryptography, a data key is a variable value which
is added
to a block of text or string. The encrypted data can only be opened by the key.

Data Key can also be an actual physical object that can store digital information
and
required to gain data access. A key analyzer is an associated program or mechanism
to
enable a computer to process data.
What is Cardinality

Cardinality is the term used in database relations to denote the occurrences of


data on
either side of the relation. In the common data architecture, cardinality is
documented with
data integrity but not with data structure.

There are several types of cardinality defining relationships between occurrences


of entities
on two sides of the line of relationships.

The Link Cardinality is a 0:0 relationship and defined as one side does not need
the other to
exists. For example, in a person and parking space relationship, it denotes that I
do not
need to have a person to have a parking space and I don�t need a parking space to
have a
person either. It also denotes that a person can only occupy one parking space.
This
relation need to have one entity nominated to become the dominant table and use
programs or triggers to limit the number of related records stored inside the other
table in
the relation.
The Sub-type Cardinality is a 1:0 relationship and defined as having one optional
side only.
An example would be a person and programmer relation. This is a 1:0 relation
meaning that
a person can be a programmer but a programmer must always be a person. The
mandatory
side of the relation, in the case the programmer side, is dominant in the
relationship.
Triggers and programs are again used in the controlling the database.

The Physical Segment Cardinality is 1:1 relationship and it is demonstrated that


both
sides of the relationship are mandatory. Example may be a person and DNA patters.
This
relationship show that a person must only have one set of DNA patterns while the
DNA
patters as dictated by nature can only be applied on one person.

The Possession Cardinality is a 0:M relation (zero to many) relationship on both


sides.
For example, a person may own no phone or maybe plenty of phones but a phone may
have
no owner but has a potential to be owned by a person. In database implementation, a

nullable foreign key column in the phone table is used to reference the person in
its table.

The Child Cardinality is a 1:M mandatory relationship and is one of the most common

relationships used most databases. An example would be a person table and


membership
table relationship. This relationship denotes that a person can be a member or not
but a
person can also be a member of many organizations. The foreign key in the
membership
table has to be mandatory and not null.

The Characteristic Cardinality is a 0:M relationship which is mandatory on both


sides. An
example would be a person and name table relationship. This denotes that a person
should
have at least one name but may also many names. The database implantation for this
cardinality involves a nullable foreign key in the name table to the person table.
The Paradox Cardinality is 1:M relationship which is mandatory to one side. An
example
would be a person table and citizenship table relationship. The Paradox is similar
to the
Physical Cardinality. A person must have a citizenship and citizenship must have a
person.
But in this case, a person may have multiple citizenships.

The Association Cardinaltiy is a M:M (many to many) relationship which may be


optional
on both sides. An example would be a person table and employer table relationship
where a
person may work for several employers or no employer at all. On the other hand, an
employer may have no employee too but can have a several employees as well. A
database
implementation for this is to create a third associate entity.

What is Common Data Structure


In big data warehouses such as those used by business organizations which may have
many
branches around the world and which may have diversified products and services,
different
kinds of data flood the warehouse every single day.

These data may come from other warehouse data sources, or simply freshly entered
from
staff within various departments or any data coming from company subscriptions.

These data could highly likely come in different formats but the purpose of having
a data
warehouse is to give the company a clear statistical report of industry trends and
patterns
so data warehouses should have a mechanism of coming up with analysis and reporting

tools.

For a business to have an intelligent system which relies on the data supplied by
the data
warehouse, a well defined business data architecture is a very important
consideration. Just
as when we are building our house, to facilitate smooth flow of the construction
and to
make sure that all the materials, interior setup and design and other
specifications, a good
plan or blue print is essential to that carpenters, masons, electricians and other
builder
professionals who different areas of specialization can agree on one standard and
the house
will not go into disarray.

The same is true within the business organization. Different companies can have
different
perspective of the world transactions. For instance, for an organization offering
flower shop
services, the word transaction is definitely entirely different from an
organization offering
computer services.

Even in a homogeneous company, disparity with interpretations of the same business


could
still exist within a written well defined data architecture. In non-automated
systems, one
clerk may specify a transaction of the same nature differently from another clerk.
A
software application can overcome this problem but software applications cannot
function
without data. So, a Common Data Structure is important for software applications to

function properly.

The fact is that in the design of software programs, the choice of Data Structures
is the top
consideration in design. Many IT professionals have experiences that tell that
building large
systems has shown that the degree of difficulty in software implementation and the
performance and quality of final output is heavily dependent on choosing the best
Data
Structure. So as early as the planning stage, the definition of a Data Structure is
already
given much time on.

The Data Structure is the technical interpretation of real life business


activities. In real life
scenario, businesses include entities like persons, products, kinds of services.
This will be
translated into Data Structure so that database or any software applications know
to store
them.
For instance, a person's name may be stored as a string of characters while the age
may be
stored as integer. This way, if very specific Data Structures are followed, the
system can
save on data space storage on disk at the same time, the algorithm for processing
may be
optimized for speed and less load for the computer.

In today's data warehouses where distributed systems are common, a Common Data
Structure can make it easy to share information between servers in distributed
systems.
Distributed systems are composed of many computer servers each trying to process
business events and sharing the results to be aggregated and used as statistical
report for
the company.

If a Common Data Structure exists, problems with cross compatibility and


portability will be
greatly overcome as disparate systems will share the same view on a data following
a
common structure.

What is Data Entity

Data Entity represents a data subject from the common data model that is used in
the
logical data model.

A Data Model has three theoretical components. The first is the structural
component which
is a collection of data structures which will used to represent entities or objects
in the
database. The second is the integrity component referring to collection of
governing rules
and constraints on data structures. The third is the manipulation component
referring to a
collection of operators that can be applied to data.

Data Entity is one the components defined in a logical data model. A logical data
model is a
representation of all of the organizations data which are being organized in terms
of data
management technology. In today's database technology, there exist choices for
logical
data models relating to relational, object oriented or XML. Relational refers to
the Data
Entity as described in terms of tables and columns in the database. Object oriented
Data
Entity refers to the terms used in classes, attributes and association. XML Data
Entity refers
to terms described in tags similar to the web's HTML.

But whether it is relational, object oriented or XML, a Data Entity is the


computational
equivalent of a real life person, object, events or activities. A Data Entity
should strictly
adhere to the structures identified and defined in the conceptual data model
because it
describes the semantics of a business.

An entity defined in the conceptual data model may be any kind of thing about which
a
company wants to contain formation, attributes pertaining to the information and
relationships among the entities. For example, a company may have a person entity
stored
as "CUSTOMER" Data Entity in its database. Or it could be the other way around as
having
an abstract entity "PERSON" which may represent different real life entities such
as
customers, vendors, managers, suppliers and many more which are being defined in
the
conceptual data model. Conceptual data model is about definition of abstract
entities and
these definitions are being done in a natural a language.

The Data Entity is actually defined in the logical data model which is actually the
underlying
layer to a physical implementation of a database. Based on the abstract entities
from
natural language defined entities of the conceptual data models, the placement of
data
entities in columns is specified. The logical data model allows an end user to
access and
manipulate a relational database without him or her having to know the structure of
the
relational database itself. As such, the first part in creating a logical data
model is in fact
specifying which tables are available and then defining the relationships between
the
columns of those tables. It is in the logical data model where the definition of
the structure
for the data field of the Data Entity having a master level and a plurality of
detail levels can
be found.

As good illustration of a Data Entity is a relational model. From the abstract


entity defined in
the conceptual data model, a logical Data Entity is laid out in columns of a
relational table.

The table is the basic structure of a relational model and this is where
information about the
Data Entity, for instance, an employee, is represented in the columns and rows. The
values
of a named columned are called attributes of the Data Entity. The term relation
refers to the
many tables within the database. For example, a column in the employee table
enumerates
all the attribute pertaining to the employee such as name, gender, age, address,
marital
status and others. The row is actually an instance of an entity represented in the
relation.
What is Data Restructuring

Data Restructuring is the process to restructure the source data to the target data
during
data transformation. Data Restructuring is an integral part in data warehousing. A
very
common set of processes is used in running large data warehouses. This set of
process is
called Extract, Transform, and Load (ETL).

The general flow of ETL involves extracting data from outside sources, then
transforming
based on business rules and requirements so that the data fit the business needs
and
finally, data is loaded in to the data warehouse.

If one looks closely at the process, the data restructuring part comes before the
loading.
This is extremely necessary. For one, in a data warehouse environment, high volume
levels
of data come into the data warehouse usually at very short intervals. In most
cases, the
data could come from disparate sources � this means that the server where data
comes
from maybe ran by different software platforms so the data may be of different
format; or
that sources may be based on different data architectures which may not be
compatible
with the data architecture of the receiving data warehouse.
When all the data coming from the different sources, there is need for the data to
be
restructured so they comply with all the business rules as well as the overall data

architecture of the data warehouse. Data restructuring makes the data structures
more
sensible to the database behind the data warehouse.

Data structure analysis includes making sure that all the components of the data
structures
are closely related, that closely related data are not in separate structures, and
that the
best type of data structure is being used. The data may be a lot easier to manage
and
understand when it is a representation which tries to abstract its relevant
similarities.

Often, in data warehouses, data restructuring involves changing some aspects of the
way
wherein the database is logically or physically arranged. There are many reasons
why data
restructuring should be performed. For instance, data restructuring is done to make
a
database more desirable by improving performance and storage utilization or to make
an
application more useful in order to support decision making or data processing.

There are generally four types of data restructuring operations namely:

. Trimming

. Flattening

. Stretching
. Grafting

In trimming, the extracted data from the input is placed in the output without
having to
change any of the change in the hierarchical relationships but some unwanted
components
of the data removed.

In flattening, the operation produced a form from a structure branch of an input by

extracting all information at the level of the values of the basic attributes of
the branch.

The stretching operating can produce a data structure output which has hierarchical
levels
than the input.

Finally, a grafting operating involves combining two hierarchies horizontally to


form a wider
hierarchy by matching common values.

One of the most important roles that data restructuring plays is in the field
information
processing applications. At the moment data is extracted from the data sources and
then
new fields are being created and placed in the output, the data structure of the
resulting
output sometimes does not resemble that of the input.

Sometimes, some query facilities which are designed for simple retrievals are not
adequate
enough to handle many of the real world scenarios so some programming may be
required.
But programming may not be for everyone, even for database administrators. Making
the
most of data restructuring may actually help eliminate some of the needs for
programming.
With a properly restructured data within a relational database, simple queries may
actually
be enough even in retrieving relatively complex and aggregated data structures.

Data Structure and Components

Data Structure represents both physical and logical data contained in common data
architecture. It includes data entities, data subjects, its contents, relationships
and
arrangement.

Data component refers to a component of the metadata warehouse that contains the
structure of data within the common data architecture.

In general,

. data structures define what and how data will be stored in the database or data
warehouse

. how long the data will be stored


. if it is no longer needed how will it be disposed of or archived

. who will be responsible for collecting and ensuring quality

. who will has access to data

A real enterprise wide data warehouse has very complex data architecture. A data
warehouse is a repository of all enterprise related data coming from various data
sources
such are those coming from different departments (i.e. Finance, Administrative,
Human
Resource Departments).

Some of the high volumes of data will be stored in large legacy of package system
wherein
data structure may be unknown. Other enterprise data may be contained in
spreadsheets
and smaller personal databases such as Microsoft's Access and these aforementioned
data
may not be known by the IT department.
Some of the key information may be residing in some external information systems
which
are maintained by third party service providers or business partners.

Without a well defined data architecture and data structure, there can be very
little control
over the realization of high level business data concepts while data will likely be
highly
dispersed and of poor quality.

Another negative effect is that most data be redundant across the system and may
result in
conflicts in the organizational and business processes.

Data structures can be defined with high level data models which describe the
business data
from a logical perspective and not dependent on some actual system. This mode may
be
comprised of a canonical class model of the business entities and their
corresponding
relationships and the semantics, syntaxes and constraints of a superset of business

attributes. These high level data models defining data structures often exclude
class
methods but in most cases the methods may be summarized into one business data
object
which is responsible for managing the structure.

Data structures also include refers to the way that data are stored in terms of
relational
tables and columns as well as how they can be converted into objects in object-
oriented
classes and how they can be structure with XML tags.

It is important to be very clear about the data structure because today's


information
technology advancement, there can be a dozen of ways to implement a single of
database
tables such as an architecture where some of the rows may be on a computer in the
United
States while the other rows may be in New Zealand.
Data structures also pertain to the relationships between conceptual entities and
the real
data objects of the information system.

A metadata is basically any data about another data and they are very useful in
facilitating
a better understanding, management and use of data. Therefore a common practice of
many data warehouse implementations to include a metadata warehouse to enhance the
performance of the whole system.

A metadata warehouses usually act as the interface in the data exchange between the
data
warehouse and the business intelligence. Since the metadata warehouse does not
really
contain the full data but just a description, the data structure is contained in
the data
component. The data components are very useful in data warehouses implemented in
distributed heterogeneous environments.
Data Structure Integrity

Data integrity in general is the measure of how well the data is maintained within
the data
resource after it has been created or captured. Data Structure Integrity is a
subset of this
data integrity that guides data relations. To say that a data has high integrity
means that
data has functioned in the way it was intended to be.

A data structure integrity rule defines the specification of a data cardinality for
a data
relation in a circumstance where there are no exception that apply. This rule make
the data
structure a lot easier to understand.

A conditional data structure integrity rule is slightly different in that this only
applies for a
data relations when there are conditions or exceptions that apply. This data
structure
integrity rule also shows that there is an option for coded data values which are
typically
difficult to show on an entity relation diagram.

To better illustrate the use and benefits of having achieved data structure
integrity, let us
say that we have two tables within the database. The first table, let's call it
"Persons" table
contains data in a list of names. Let us have another table and call it
"PhoneNumbers".

In the real world, people may have one, more than one or no telephone at all. In
database
terms, the two tables would the have three kinds of relationships: one to one; one
to zero;
and one to many relationships. This literally means that every person with the
"Persons"
table may have one, zero or many phones number within the "PhoneNumbers" table.
It is worthy to note that every phone number within the "PhoneNumbers" table has
one and
only person within the "Persons". By not applying the data structure integrity
rule, the two
tables may result in data being mixed up in the complicated relationships. It could
result in
data redundancy which can significantly slow down the whole system and result in
data
inconsistencies.

Data structure integrity can be achieved by designing a database which is


consistent, logical
and stable. This can be done by including declared constraints in the overall
design of the
database which is often referred to as the local schema.

Database normalization is also one of the biggest factors that can help an
implementation
achieve data structure integrity. In database normalization, the tables are made
sure there
is no redundant data and some desirable properties of the tables are selected from
a
logically equivalent set of alternatives.
Referential integrity rules in relational databases make sure that data is always
valid. There
can be any referential integrity rules depending on the needs or requirements of
the data
model based on the business rules. The only thing to take careful notice of is that
in the
end, the data structure integrity is always maintained in the effort of meeting all
the
requirements.

The use of very precise rules for data integrity greatly solves a lot of problems
pertaining to
data quality which are very prominent in a lot of data warehousing implementations
in both
public and private sector organizations.

Precise rules for data integrity reduce the impact of bad information and allow
many
organizations to make the most use of their limited resources to more value added
undertakings. They also help many business organizations in their quest for
identifying
accountability in the area of data warehouse management which is as equally
important as
other areas like human resources, finance and sales.

With careful planning � from the data architecture to the physical implementation �
data
structure integrity can surely be achieved to give the company quality data as
basis for
sound decisions.

Data Storage

Enterprise Storage

Enterprise Storage is basically computer storage on a grand scale. Before


understanding the
concept of enterprise storage, let us first go through an overview of the basic
computer
storage.

A computer storage is a component of a computer which records certain media and


store
the data digitally to be used for computing at some other time. Aside from the
purpose of
storing data, an enterprise storage is also responsible for storage for online
random access
storage and data protection, backup for offline sequential access storage and data
protection, archiving offline storage contents, and disaster recovery.

An enterprise storage within a business organization is typically not implemented


within one
computer storage only. In fact, it is scattered across several high end computers
and data
may even be redundantly stored. This is because business data is always critical
for
everyday operations.
Having more than one storage can ensure that if one of the data storage fails,
there will
always be a ready back up. If there was no back up, any failure could mean that the
whole
system will fail and the whole business operation will temporarily come to a halt.
In big
business most especially, a temporary halt could mean loss in large amounts of
money from
revenues.

Because of the scattered nature of data storage within the whole enterprise
storage, a
network is installed to connect the different data storage computers. A network
administrator works in close collaboration with the enterprise storage
administrator to
overcome pressure arising from providing secure and resilient storage to users,
group and
computer resources within a multi-platform heterogeneous environment. Some of the
network protocols used in an enterprise storage are CIFS, NFS, HTTP/DAV, FTP,
iSCSI.

Because of the complexity of setting up and managing an enterprise storage, there


are
many software applications specifically designed for enterprise storage management.

A storage archive manager is features functionalities such as automatically back up


data
from work which are in progress. While back up, it offers some degree or complete
transparency to both the users and the application. It takes care of all archive
files and
complete file systems which can be written on multiple server platforms in a
heterogeneous
environment. It can even operate with servers on other off-site locations with
added
security for protection.

An enterprise back up software functions as an automating tool for easy backup,


recovery,
and storage management services on many system within the network. It also
integrates
various disparate platforms, applications software and operation systems such as
UNIX,
NetWare, Microsoft Windows NT, or Apple Macintosh systems.
An enterprise storage operations manager application software offers integrated,
heterogeneous and open storage area for management. An administrator for an
enterprise
storage will not have to worry with dealing with various kinds of enterprise
storage software
tools because everything is already integrated and comes with an easy to use
graphical
interface. It become easy for the administrator to monitor in one place and easily
spots any
problem within the many nodes of the storage network.

An enterprise storage could very well work in conjunction with a data warehouse.
But
whereas a data warehouse is in nature very dynamic as it deals with both historical
and
current transactional data.

An enterprise storage may be tasked to contain data which are not activity used at
any
moment and the data warehouse will only fetch data as need arises. But
nevertheless,
companies should invest on very stable and robust data storage hardware to prevent
any
problem from data integrity, security and availability.
Enterprise System Connection Architecture (ESCON)

Enterprise System Connection Architecture (ESCON) is an IBM mainframe channel


architecture which is commonly used to attach storage devices. In particular, it is
a serial
optical interface connecting IBM mainframe computers and peripheral devices like
tape
drives and other storage devices. This architecture is the very railroad system for
data in a
large enterprise storage system.

The ESCON can offer a communication rate of about 17 MB/second over distances of up
to
43 kilometers using half duplex medium. This technological architecture was
introduced
around 1990 by IBM so that it could replace the much older and slower copper-based
Bus &
Tag channel technology of 1960-1990 era mainframes.

The copper-based Bus & Tag channel technology was very unwieldy as the shielded
copper
cable allowing a throughput of 4.5MB/s or 45Mb/s; approximately equivalent to a T-3

connection is being installed all over the data storage and processing center.
Supplanting
ESCON is the Fiber Connectivity (FICON) which is substantially faster as it runs
over a fiber
channel.

Light-weight fiber optic cables having multimode, 62.5 micron supporting distances
of up to
3 kilometers and single mode, 9 micron that can support up to 20 kilometers are
being used
for the ESCON technology. It also uses signal regenerators such as the General
Signal
Networks CD/9000 ESCON Director or CX Converter.

The ESCON system takes care of the structure of a high speed backbone network used
in a
data storage and processing center. This center also saves a gateway to other
networks
attached which have lesser speeds. Some of the essential primary configurable
elements of
the ESCON network are the fiber optic links, the ESCON channels, the ESCON
Director, and
the ESCON control units.

On the other hand, the software support functions include the ESCON Manager program
for
ESCON Director and the ESON Dynamic Reconfiguration Management configuration
control.
With switches through the ESCON Directors, customers could create a high speed,
switched,
multi-point topology for dynamic connectivity of inter-data center applications.

The ESCON system was built to address some of the major concerns of the company
related
to interconnection of systems, control units, and channels are system disruption,
complexity, cable bulk, cable distance, and the need for increased data rates. The
general
advantages from using ESCON are reduced cable bulk and weight, greater distance
separation between devices, more efficient use of channels and adapters, higher
availability,
and having a axis for growth and I/O performance.
Hence, the ESCON has the capacity to permit operating systems and applications to
run
unchanged on computers as well as permit the in insertion of additional control
units and
systems into running configurations without having to turn off the power, thus
avoiding
scheduled and unscheduled outages for installation and maintenance.

ESCON can also improve the interconnection capability within the data processing
centers
which also embraces the intersystem connections and device sharing between systems.
It
can allow an increased number of devices to be accessible by channels which can be
very
useful for large companies with large enterprise storage needs and big data
warehouse.

Since today's companies could be said to be efficient if they are data-driven, many
data
storage are place strategically in different locations to complemented the business

intelligence system. ESCON can allow the extension of the distance for direct
attachment of
control units and direct system-to-system interconnection in the enterprise storage

environment. It can also provide significantly higher instantaneous data rates for
simultaneous data consumer serving as well as data sources gathering.

As the data warehouse or enterprise storage system grows, so will the need for
computer
grow. Since ESCON uses optical interface, it can significantly reduce the bulk and
number of
cables required to interconnect the system elements of a data storage and
processing
complex.

Historical Database

In a data warehouse implementation, million and million of data could be processed


every
single minute and millions more get shared from data source to another.

Dealing alone with the current value of the data may benefit a company by having to
expend on less on additional software and hardware acquisition but more often,
investing on
dealing with the historical perspective the data within the data warehouse can have
more
benefit not having a historical database.

True, having a historical database means that a business organization will have to
buy
additional computer servers with more computing power, random access memory and
larger
storage because processing the historical perspective of data can really a daunting
and labor
intensive work. But in the long run, the return of investment can be surprisingly
big.

In the world of information technology, time moves so fast. It is not surprising


that a new
gadget, say a cellular phone, which costs a thousand dollars today may depreciate
to a few
hundred dollars within the next six month because technology innovation brings in
new
models at very short intervals. The same could be true in dealing with data. The
volume of
data being dealt with today will be surely become a distant history by the next
day.
Keeping a historical database has a lot of advantages. One of the biggest
advantages is that
there can a performance monitor in the operation of the entire data warehouse so
that the
system can be given early treatment should a problem be detected and a
corresponding
diagnosis has been identified.

The performance monitor is a very valuable tool which can help eliminate
troublesome profit
destroyers such as oscillations and swings.

Because a historical database provides a historical perspective on the data, it


would be very
easy to troubleshoot problems that occur in the system. For instance, if an output
data no
longer reflects the business rules algorithm, there would be two ways to verify the
cause of
the problem.

The first would be to check for the code logic for any logical bug. And if there is
no problem
with the code, the other methods would be to check the value of the data entered.

In any database or programming implementation, a data may not cause an immediately


problem so a currently emerging problem may have been cause by some data entered in

the past. With a problem like this, the solutions would definitely come from the
historical
database.

A historical database is also very useful in spotting trends and patterns about the
operations
of the company and how well the company is performing compared to ther players in
the
same industry.

A data warehouse is generally a repository of two things: historical data and


current
transactional data. Of course, if any company wants to see a trend, for instance,
in the sales
performance of a particular item, it does not only try to spot the pattern from a
very short
period.

In fact, most companies will want to see a "panoramic" view of a performance not
just one
item but of all the items and all the business events and transactions including
the trending
in sales, income, human resource, manufacturing all other facets of the business.
So going
back to the trending in the sale of a particular item, the data to be considered
may include
sales from as far back as two years. Data from this past are already stored in the
historical
database so as to overload the current transactional database.

A historical database is usually in compresses format because unlike the current


transactional, the historical database is less frequently accessed.
Data Replication

What is Data Replication

Data Replication is a set of data copied from a data site and placed at another
data site
during Data Replication. It is also a set of data characteristics from a single
data subject or
data occurrence group that is copied from the official data source and placed at
another
data site. Data Replicates are not the same as redundant data.

Data Replication also refers to a formal process of creating exact copies of a set
of data
from the data site containing the official data source and placing those data in at
other data
sites. Another aspect of Data Replication is the process of copying a portion of a
database
from one environment to another and keeping the subsequent copies of the data in
sync
with the original source. Changes made to the original source are propagated to the
copies
of the data in other environments.

Data Replication is a common occurrence in large data warehouses to help the system

function efficiently and guard against entire system failure. Many data warehouse
systems
use Data Replication to share information in order to ensure consistency among
redundant
resources like hardware and software components.

In some cases, it could be could be a data replication instance if the same data is
stored on
multiple storage devices or computer replication in cases where the same computing
task is
being executed many times. In general, a computational task is being replicated in
space
such as being executed on separate devices. It could also be replicated in time
such as in a
case where it is being executed repeatedly on one device.

Data Replication is transparent to an end user. So a data consumer would really not
know
where among the data sources the data he or she is using is coming from because she
only
gets the impression of one monolithic data warehouse. The access to any replicated
data is
usually uniform with access to a single, non-replicated entry.

There are in general two types of Data Replication. These two are active and
passive
replication. An active Data Replication refers to the process wherein the same
request is
being performed at the every data replica. On the other hand, passive Data
Replication is
done with each single request being processed on one replica and then the state is
being
moved to many other replicas. It at any given time on of the master replicas is
designated
to handle the processing of all requests, this is what is being referred to as
primary-backup
scheme (master-slave scheme). This scheme is predominantly used in high-
availability
clusters.

On the other hand, if any of the replica processes a request and then distributes a
new
state, this is what is being referred to as multi-primary scheme (called multi-
master in
database field). In this scheme, some form of distributed concurrency control such
as
distributed lock manager need to be employed.

In the area of distributed systems, Data Replication is one of the oldest and most
important
aspects. Some of the best known replication models in distributed systems include:

Transactional Replication � This model is used in data replication for


transactional data
used in relational databases or some other kinds of transactional storage
structure.
Typically, this type of replication employs the one-copy serializability model
which defines
legal outcomes of a transaction involving replicated data.

State Machine Replication � This replication model assumes deterministic finite


state
machine and a possibility of an atomic broadcast in every event. In a lot ways,
this model
has similarities with the transactional replication but this one is based on a
distributed
computing problem referred to as distributed consensus. Many people sometimes get
this
model mistaken for active replication.

Virtual Synchrony Replication � This is actually a computation model employed in


cases
when a group of processes work together in order to replicate in-memory data or to
coordinate actions
What is Chained Data Replication

Chained Data Replication is a process of replication where non-official data is


replicated into
another non-official data. If data are replicated from nonofficial data, they are
considered
duplicated data, not replicated data.

In statistics, the term official data refers to data collected in different kinds
of surveys
commissioned by an organization. It also refers to administrative sources and
registers.
These data are primarily used for purposes like making policy decisions,
facilitating industry
standards to be followed, outlining business rules and best practices among many
other
things.

Non-official data on the other hand are data coming from external sources. In
business,
these non-official data may come from other data sources that are randomly selected
by an
organization. A lot of new markets today like e-commerce and mobile technologies
are
commonly using more detailed data by non-official sources.
It is common for many business organizations also to publish business data outputs
on their
official websites and offer data as freely accessible to anybody. For another
company to get
these data, they are already getting non-official data integrated into their data
warehouse.
Problems on reliability can arise in these cases as the data managing expertise
from one
organization to another can vary.

As a company's warehouse grow in bulk with a variety of data sources contributing


both
official and non-official, it is extremely important to employ a mechanism in order
to get the
relevant data needed by the company. Also because a great bulk means more labor for
the
hardware system within the data warehouse, there should be a mechanism to manage
all
the data so that the data warehouse will not break down and suffer the business
operation
of the company.

Chain Data Replication involves having the non-official data set distributed among
many
disks which can provide for load balancing among the servers within the data
warehouse.
Blocks of data are spread in many clusters and each cluster can contain a complete
set of
replicated data. Each data block in each cluster is a unique permutation of the
data in other
clusters.

When a disk fails in one of the servers, any access for data from the failed server
is
automatically redirected to the other servers having disks containing the exact
replica of
non-official data.

In some chain replication implementations, computers allow replicas and disks to be


added
online without having to move around the data in the existing copy or affecting the
arm
movement of the disk. For instance, in any unforeseen even where there arises a
need to
increase number of replicas because of the high increase in demand, additional
replica can
be loaded fro tape and stored automatically into the newly installed disks.

During the disk installation and loading of replica, services which are providing
by the
existing array of disks are not affected as there are no additional I/O requests to
array of
disks and replicas are generated by the loading process itself. When the loading of
replica is
done, new replica can already start servicing data requests from various sources.

In terms of load balancing, Chain Data Replication works by having multiple servers
within
the data warehouse share data request processing since data already have replicas
in each
server disk.

Everyday, data warehouses are constantly extracting, cleansing, transforming


business data
and loading them into the warehouse. Today's businesses are not just powered by
company
intranets but the internet as well. If is extremely important to have a powerful
data
warehouse infrastructure because with the internet, the company is exposed to
billions of
user every single minute. Chain Data Replication will surely be a beneficial
methods to make
use of all these data.

Automatic Data Replication

Automatic Data Replication is the process wherein created data and metadata
automatically
replicates based in the request of the client at a specific data site. A data site
could maintain
many computers working as a system to manage one or more data warehouses. These
warehouses are repositories of millions of millions of data and more are gathered,
aggregates, distributed and updated every second.

Data replication makes use of redundant resources like hardware or software so that
the
whole data site system can have improved reliability and performance and become
tolerant
to unexpected problems arising from load intensive processes.

Data replication can be done manually or automatically as programmed or set by the


database administrator. Data replication can be stored on the same storage devices
or
spread across several multiple storage devices within the same data site or across
other
data sites in different geographic locations.

Data replication is commonly used in distributed systems. A distributed system is


composed
any computers trying to process different parts of a program. These computers
constantly
communicate with each other over a network so that their processing job can be
synchronized and collated to come up with the desired output.

Data replication in distributed systems comes in three methods. The transactional


replication model is used to automatically replicated data used in transactions.
The state
machine replication is mainly used to achieve fault tolerance by having copies of
some
deterministic tasks executed on multiple nodes. The virtual synchrony replication
functions
by having a group of processes cooperate so they can replicate in-memory data.
Data replications used on a lot of database management systems have master-slave
relationships between the original and the replicated copies. When the logs of the
master
are updated, all the slaves also follow. The slave receives the update, it send a
message to
the master that the slave receive more subsequent updates.
In a multi-master replication in database management systems, updates can be
submitted
to any node and then spreads out to other serves. This can result to faster updates
but may
be impractical to use in some situations because of its complexity and the
potential conflict
in some cases.

Active storage replication is done by having updates from data block devices
distributed to
many separate physical hard disks. The file system can be replicated without any
modification and the process is implemented either in a disk array controller in
the hardware
or in the device driver software.
Data replication is also employed in distributed shared memory systems. In this
system,
many nodes share the same page of the memory which means data is being replicated
in
different nodes. This is used to boost speed performance in large data warehouses.

Search engines where the biggest data warehouses of data and metadata are index
every
second employ the most intensive use of automatic data replication as they services
the
public internet users around the world.

Load balancing despite being different from data replication is often associated
with data
replication because it only distributes loads of different computation in many
machines.

Back up, while the process involves making copies of data, is different from data
replication
in that the data saves cannot be changed for a long period even if the replicas are

constantly updated.

Both load balancing and back up are important processes in large data warehouses.
Many
business companies invest on data warehouses with automatic data replication mainly
to
take advantage of enhance availability of specific and general data and to have
disaster
recovery protection. Other benefits of having a data warehouse with automatic data
replication include tolerance from disaster, ease of use and management and more
robust
system.

Data Quality
What is Data Denormalization

Data Denormalization is a process in which internal schema is developed from


conceptual
schema.

The data denormalization, although done by adding redundant data, is actually a


process of
optimizing a relational database's performance. This is often done with relational
model
database management system which is poorly implemented. At the logical level, a
true
relational database management system would allow for a fully normalized database
while
providing physical storage of data which is designed to function at very high
performance.
Database normalization is very important for designing relational database tables
so that
duplication of data will be prevented and the database can be guarded against
logical
inconsistencies. But data denormalization, although the term sounds the opposite,
actually
complements normalization in the database optimization process.

It is common that a normalized database uses stores different but related data in
separate
logical tables. These tables are called relation. In big data warehouses, some
these relations
are physical contained on separate disk files. Thus, logically, issuing a query
that gets
information from different relations stored on separate disk files can be slow.
This can be
even slower if many relations are being joined.

To overcome this problem, it is good to keep the logical design normalized while
allowing
the database management system to store separate redundant data on disk so that the

query response may be optimized. While doing this, the DBMS should be responsible
for
ensuring that the redundant replicas are kept consistent all the time. In some SQL
software,
this is called indexed views while in others, this is called materialized views.
For this matter,
the term view is the information laid out in a format which is convenient for query
with the
index ensuring that the queries against the view are being optimized.

Data Denormalization is an important aspect of data modeling which is the process


of
creating and exploring data oriented structures taken from real life activities of
an
organization.

There general three categories namely the Conceptual data model which is used to
explore
domain concepts with project stakeholders; Logical data model which used to explore
the
relationships among domain concepts; and the Physical data model which used to
design
the internal schema of the database with focus on data columns of tables and
relationships
between tables.

Data denormalization is a substantial part of the physical data modeling process.


The rules
of data normalizations are bent on minimizing redundant data and not on improving
the
performance of data access. Denormalizing certain parts of the data schema can
improve
database access speed.

Denormalizing the logical data with extreme care also can result to an improvement
in
query response. But this can come with a cost. It will be the responsibility of the
data
designer to ensure that the denormalized database will not become inconsistent.
This can be
achieved by creating database rules called constraints. These constraints specify
the
synchronization measures of redundant copies of data. The real cost in this process
is the
increase in logical complexity of the design of the database as well as the
complexity of
additional constraints. They key to denormalizing logical data is exerting extreme
care as
constraints can create overhead of updates, inserts and deletes which may cause bad

performance compared to its functionally normalized counterpart.


It should be noted that a denormalized data model is not the same as unnormalized
data
model which refers to the model which has not been normalized at all.
Denormalization
must be done only after a satisfactory level of normalization and after any
required rules
and constraints have been created to prevent anomalies n the overall design.

What is Data Cleansing

Data warehouses, where a rich repository of company data may be found, are being
run by
database management systems that need to see one homogenous data in order for it to

flow smoothly and process data to be able to come up with statistical report about
company
trends and patterns.

But the problem arises because data warehouses gather, extract and transform data
from a
variety of sources. This means that data may come from a server that has totally
different
structure of hardware and the software behind the server format data differently.
When this
arrives to the data warehouse, it mixes data from yet other servers which are of
disparate
systems.

This is where data cleansing comes in. Data cleansing is also referred to as data
scrubbing,
an act of detecting and subsequently either removing or correcting a database'
dirty data.
These dirty data refers to data which are out of date, incorrect, incomplete,
redundant or
formatted differently. It is the goal of the data cleansing process to not just
clean up the
data within the database but also to bring inconsistencies into different sets of
data that
have been lumped together from separate databases.

Despite being interpreted as similar by many people, data scrubbing and data
cleansing
differs in that data cleansing, validation almost invariable means that data is
rejected from
the system right then and there at entry time. In contrast, data scrubbing is done
in
batches of data.

After the data has been cleansed, the data set will be consistent and can already
be used
with similar data in the system so the database can already have a standard process
to
utilize the data. It is a common experience with data warehouse implementation to
detect
and remove inconsistent data which have been supplied from different data
dictionary
definitions of similar entities within different stores. Other data problems may
have been
due to errors during end user entry activities or corruption during the process of
data
transmission or during storage after receipt from the source.

A good way to guarantee that data is correct and consistent is having a pre-
processing
period in conjunction with data cleansing. This will help ensure that data is not
ambiguous,
incorrect or incomplete.

In real practice, the data cleansing process involves removal of typographical


errors and
validation and correction of values by comparing data against a known list of
entities. The
validation process may be very strict as in the case of address rejection when Zip
or Postal
Codes are invalid compared against the list. It could also be fuzzy as in the case
when a
record is corrected when it partially matches an existing known record.

Data cleansing in an important aspect in the goal of achieving quality data in data

warehouses. Data are said to be of high quality if they fit the purposes to serve
correct
operations, decision making and planning of the organization implementing the data
warehouse. This means that the quality of data is gauged by how realistically they
represent
real world constructs to which they are designed to refer to.

What is Data Accuracy

Data warehouses of an organization are filled with data which would reflect all the
activities
within the group. Data may come from various sources and gathered using routing
business
processes. It is imperative that the processes in the data warehouse should be
precise and
accurate because the usefulness of data goes far beyond the software applications
that
generate it.

All companies have been depending heavily on data from the business data warehouse
for
decision support systems. Data are frequently integrated with many other
applications and
connected with external applications over the internet so data is continually
expanding at
such tremendous proportions.

Data quality has been a persistent problem for many data warehouses. Data managers
or
administrators have found it a cumbersome task to fix erroneous data or changed
processes
to ensure accuracy and less important data have been overlooked. Business companies

have taken great efforts to have data warehouses with data quality requirements and
they
make intensive assessment an integral part of any data project.
In order to achieve Data Accuracy and good quality, data professional should
understand
the fundamentals of data which are quite simple.

IT professional unanimously agree that Data Accuracy is a strong foundation in the


data
quality dimension. If there is wrong data in the warehouse, a wave of negative
effect flows
through the whole system.

The quality of data has many dimensions. These include accuracy, timeliness,
completeness,
relevance, easily understood by end users and trusted by end users.

Of these dimensions, Data Accuracy is the most important as it represents all


business
activities, entities and events. Two important requirements should be met for a
data to be
accurate. First, it has to be or the right value. Second, it has to precisely
represent the
value in consistent form in accordance with the business data model and
architecture.
There are several sources and causes of data inaccuracy. The most common of these
causes
come from initial data entry of users. In simple terms, it means that the user
entered the
wrong value. This could also be that typographical errors were committed. This is
an aspect
that can be overcome by having skilled and trained person do the data entry. Or
since
mistakes can happen to everybody, data inaccuracy from data entry can be overcome
by
having programmatic components in the application detect typo errors. For instance,
many
applications have spell checks or some web forms like combo boxes offer a list of
possible
values there can be no mistake in typing.

Data Decay can lead to inaccurate data. Many data values which are accurate can
become
inaccurate through time; hence data decay. For example, people's addresses,
telephone
numbers, number of dependents and marital status can change and if not updated, the
data
decays into inaccuracy.

Data Movement is another cause of inaccurate data. Data warehouses extract,


transform
and load data very frequently within a short period. As data moves from disparate
system
to another, it could be maybe altered to some degree especially if the software
running the
database is not very robust.

Data Accuracy is a very important aspect in data warehousing. While the problem can
still
persists, companies can have measures to minimize if not eliminate data inaccuracy.

Investing in high powered computer systems and top of the line database systems can
have
long term benefits to the company.

What is Consistent Data Quality

Consistent Data Quality refers to the state of a data resource where the quality of
existing
data is thoroughly understood and the desired quality of the data resource is
known. It is a
state where disparate data quality is known, and the existing data quality is being
adjusted
to the level desired to meet the current and future business information demand.

Data is the most important component of a computer system. A common concept in


computer science is called Garbage in Garbage Out (GIGO) which refers to the fact
that no
matter how sophisticated and perfect any software application or computer systems
is, if
the data entered is not the correct data or not of good quality, the output will
always be
garbage. In programming, poor data quality may cause a bug which is hard to trace.

Data are said to be of high quality, according to JM Juran, "if they are fit for
their intended
uses in operations, decision making and planning". In business intelligence, data
are of high
quality if they accurately represent the real life construct that they represent.

Data warehouses are the main repositories of company business data which include
all
current and history data. Business intelligence mainly relies on these data
warehouses so
they can know the industry trends. With the information recommended by the business

intelligence, a company can already strategize to gain competitive edge over


competitors.
For instance, if they know that the products or services of the competing company
is
gaining strong acceptance among the customer and the effect of this is reflected in
the
analysis of the company's business intelligence, the decision makes can try to come
up with
innovations to cope up with the competitor.

And so companies should make a strong emphasis on having consistent data quality so
they
do not get garbage information from the data warehouse. Marketing efforts typically
focus
on name, address and client buying habits information but data quality is important
in all
other aspects as well. The principle behind quality data encompasses other
important
aspects of enterprise management like supply chain data and transactional data.

The difficult part with dealing with data is that it may sometimes be very
difficult or to an
extreme case, impossible to tell which is good quality data and which is bad
quality data.
Both could be reported as identical through the same application interface. But
there are
some guides to improve and have consistent quality data within the business
organization.

. It is important to involve the users. People are the main doers of data entry so
they can
be used as the first line of defense. On the other end, people are also the final
consumers of data so they could also be in the last line of defense to have
consistent
data quality.

. Having somebody or a group of skilled and dedicated staff to monitor the business

processes is a good move for the company. Data may actually start as good data but
will
turn bad through time as it decays. As an example, a project prospects list will
definitely
get out dated. Decayed data, data which become irrelevant through time, are hard to

detect and could cause damage and lots of monetary losses. A good business process
monitoring ensures timely and accurate update. It is also important to streamline
process when possible so that the number of hands touching the data will be
minimized
and the chances of corrupting data will be greatly reduced.
.

. The use of a good software can help maintain consistent data quality. There are
may
credible software vendors where a company can buy application from.

What is Data Generalization

Data Generalization is the process of creating successive layers of summary data in


an
evaluational database. It is a process of zooming out to get a broader view of a
problem,
trend or situation. It is also known as rolling-up data.

There are millions and millions of data stored in the database and this number
continues to
increase everyday as a company heads for growth. In fact, a group of process of
process
called extract, transform, load (ETL) is periodically performed in order to manage
data
within the data warehouse.

A data warehouse is a rich repository of data, most of which are historical data
from a
company. But in modern data warehouses, data could come from other sources. Having
data from several sources greatly helps in the overall business intelligence system
of a
company. With diverse data sources, the company can have a broader perspective not
just
about the trends and pattern within the organization but of the global industrial
trends are
well.

In order to get a view of trends and patterns based on the analytical outputs of
the business
intelligence system can be a daunting task. With those millions of data, most of
which
disparate (but of course ironed out by the ETL process), it may be difficult to
generate
reports.

Dealing alone with big volumes of data for consistent delivery of business critical

applications can already affect the network management tools of a company. Many
companies have found that existing network management tools could hardly cope up
with
the great bulk of data required by the organization to monitor network and
applications
usage.

The existing tools could hardly capture, store and report on traffic with speed and

granularity which are requirements for real network improvements. In order to keep
the
volume down to speed up network performance for effective delivery, some network
tools
discard the details. What they would do is convert some detailed data into hourly,
daily or
weekly summaries. This is the process called data generalization or as some
database
professionals call it, rolling up data. Ensuring network manageability is just one
of the
benefits of data generalization.
Data generalization can provide a great help in Online Analytical Processing (OLAP)

technology. OLAP is used for providing quick answers to analytical queries which
are by
nature multidimensional. They are commonly used as part of a broader category of
business
intelligence. Since OLAP is used mostly for business reporting such as those for
sales,
marketing, management reporting, business process management and other related
areas,
having a better view of trends and patterns greatly speeds up these reports.

Data generalization is also especially beneficial in the implementation of an


Online
transaction processing (OLTP). OLTP refers to a class systems designed for managing
and
facilitating transaction oriented applications especially those involved with data
entry and
retrieval transaction processing. OLAP was created later than OLTP and had slight
modifications from OLTP.
Many companies who have been using the relatively older OLTP cannot abandon OLTP's
requirements and re-engineer for OLAP. In order to "upgrade" OLTP to some degree,
the
information system department needs to create, manage and support a dual database
system. The two databases are the operational database and the evaluational
database. The
operational database supplies data to be used to support OLTP.

The evaluational database on the other hand will supply data to be used to support
OLAP.
By creating these two databases, the company can be able to maximize the
effectiveness of
both OLAP and OLTP. The two databases will differ in the characteristics of data
contained
within and how the data is used. For instance, in the "currentness" attribute of
data, the
operational data is current while the evaluational data is historic.

What is Data Naming Convention

Data Naming Convention refers to a convention established to resolve problems with


Traditional data names. Many of these conventions are in use today, such as the Of
Language,

. entity�attribute�class

. role�type�class

. prime�descriptor�class

. entity�adjective�class

. entity�attribute�class word
. entity�description�class

. entity keyword�minor keyword�type keyword

. entity keyword�descriptor�domain

Having a data name convention is important because they are a collection of rules
which
when applied to data could result in a set of data elements which are described in
a
standardized and logical fashion.

In the general area of computer programming, a data naming convention refers to the
set
of rules followed in order to choose the sequence of characters which will be used
as
identifiers in the source code and documentation. Following a data naming
convention, in
contrast to having the programmer choose any random name of their choice, make the
source code very easy to read and understand and enhance the source code appearance
for
easy tracing of bugs.
The data naming convention defined in this article focus on database implementation
which
powers a data warehouse.

The rules for developing a name convention are described by The International
Standard
ISO 11179, Information Technology-Specification and Standardization of Data
Elements.
These rules include standards for data classification, attribution, definition and
registration.

Data elements are the product of a development process which involves many levels
of
abstraction. These levels come from the most general to the most specific
(conceptual to
physical). The objects within each level are called the data element components but
their
name simply became components. To use the Zachman Framework, the highest levels of
definition are contained within the business view and the development progresses
down to
the implemented system level.

At each different level, components are defined and combined. Each component
contributes
its name or part of its name to the final output based on naming conventions.
Three kinds of rules exist for a data naming convention. The semantic rules refer
to the
description of data element components. The syntax rules refer to the prescribed
arrangement of components within a given name. The lexical rules refer to the
language
related aspects of names.

The semantic rules state that:

1. The terms for object classes should be based on the names of object classes
which can
be found in entities and object models properties.
2. The terms to be used for properties should be based on property names found in
attributes and object model properties.
3. When need arises, qualifiers may be added to describe data elements.
4. Representation of the data element's value domain may be described using the
representation term.
5. There can only be one representation term present.

The syntax rules state that:

1. Unless it is the subject of a qualifier term, the term for object class should
occupy the
leftmost position in the name.
2. A qualifier term should precede the component it qualifies.
3. The property term always follows the object class term.
4. The representation term is always placed in the rightmost position.
5. Redundant terms use will result to deletion.
The lexical rules state that:

1. Nouns should be in singular terms while verbs should be in present terms.


2. No special characters should be used.
3. Words should be separated by spaces.
4. All words can be in mixed case.
5. Abbreviations, acronyms and initial may be allowed.

These are just general rules although many software developers and database
management
software vendors also set their own rules. But in general, these three rules are
used in a
wide array of software implementations.

What is Data Optimization

Data Optimization is a process that prepares the logical schema from the data view
schema.
It is the counterpart of data de-optimization. Data optimization is an important
aspect in
database management in particular and in data warehouse management in general. Data

optimizations is most commonly known to be a non-specific technique used by several

applications in fetching data from a data sources so that the data could used in
data view
tools and applications such as those used in statistical reporting.

A logical schema is also a non-physical dependent method of defining a data model


of a
specific domain in terms of a particular data management technology without being
specific
to a particular database management vendor. In more simple terms, the logical
schema
refers to the semantics describing a particular data manipulation technology and
these
descriptions could be in terms of tables, columns, XML tags and object oriented
classes.
Data views are tools for creating effective reports based on accurate queries. To
have a data
view, the database management system need to retrieve the desired data and display
the
expected output. Since the database, especially those databases dealing with high
volumes
such as those used in data warehouses, need to retrieve large bulks of data,
getting a data
view may be a slow and complex process. Employing data optimization can reduce the
complexity of the process while trying to optimize the needed resources by reducing

physical processing needs.

In some database applications, the database management system itself is loaded with

features to make querying data views easy by directly executing the query and
immediately
generating views. Some database applications have its own flexible language for
mediating
between peer schemas extending from known integration formalisms to more complex
architecture.
Data optimization can be achieved by data mapping, an essential aspect in data
integration.
This process of data optimization includes data transformation or data mediation
between a
data source and its destination, and in this case, the data sources could refer to
the logical
schema and the destination the data view schema. Data mapping as a means of data
optimization could translate data between various kinds of data types and
presentation
formats into a unified format used in different reporting tools.

Some software applications offer a graphical user interface (GUI) based tool used
in
designing and generating XML based queries and for data views. Since data can come
from
a variety of sources of from a heterogeneous data source, running queries with this
tool can
be an effective means of generating a data view. Using graphical data view can free
a data
consumer from having to focus on the intricate nature of query languages as they
tool can
provide pictorial and drag and drop mapping approach.

Being free from all the intricacies associated with query languages means that one
can focus
more on the information design and conceptual synthesis information which could
come
from many different disparate sources. Since high level tools need to shield end
users from
the back end intricacies, it needs to manage the data from the back end
efficiently.

Having a graphical tool may have its benefits but its downside is that the graphics
could add
load to the computer memory. So, graphical tools need so much data optimization in
order
to balance the load toll from the graphical components.

There are several modules available designed for data optimization. These modules
can be
easily "plugged" to existing software and the integration may be seamless. Having
these
pluggable data optimization modules can definitely make database related
applications focus
more on the development of graphical reporting tool for non technical data
consumers.

What is Data Normalization

Data Normalization is a process to develop the conceptual schema from the external
schema. In its very essence, data normalization is the process of organizing data
inside the
database in order to remove data redundancy. The presence of many redundant data
can
have very undesirable results which include significant slowing of the entire
computer
processing system as well as negative effect on data integrity and data quality.

The process of normalization includes the creation of tables and establishing


various
relationships between the tables. The relationships should be based on certain
rules
designed to protect data and ensure that the database is flexible by having no
redundancy
and inconsistent dependency.
Data normalization follows few rules and each rule is called a normal form.

The first normal form, denoted as 1NF, is geared towards eliminating repeating
groups in
individual tables. This can be done by creating separate tables for each set of
related data
and attributes and giving each set of related data in the table with a primary key.
During
this normalization form, multiple fields should not be used in a single table to
store similar
data.

For instance, in tracking an inventory item that may possible come from two
different
sources, an inventory record may contain separate fields for Vendor code 1 and
Vendor
code 2. This is not a good practice because when there is another vendor, it is not
good to
add a Vendor Code 3. Instead, a table should be created separately for all vendors
and a
link the table to inventory using an item number key.

The second normal form, indicated by 2NF, is geared towards elimination of


redundant data.
If an attribute depends only on part of a multi-valued key, then it has to be
removed to a
separate table. Second normal form can be achieved by creating separate tables for
sets of
values which apply to many records and then relating these tables using a foreign
key.

As an example, let us take a customer's address in an accounting system. Many other


tables
use this customer address such s orders, shipping, invoice, collections and account

receivable tables. Instead of having to store the customer's address as a separate


data in
each of those tables, a normalized form would be to store it in one place and have
the other
table link to it.

The third normal form, indicated by 3NF is geared towards the elimination of
columns which
are not dependent on a key. If an attribute does not contribute to a description of
the key,
then it has to be removed to a separate table. A value in a record which is not
part of the
key of the record should not belong to the table. Generally, whenever the contents
of a
group of fields may apply to more than one record in a table, it is good to place
these fields
in a separate table.

Let us take the case in an employee recruitment table wherein a candidate's


university
name and address is indicated. But there is also a need for a complete list of
universities to
be used for group mailings. It is good to create a separate table for universities
and link the
table to candidates table with a code key for university because if the university
information
is stored in the candidates table only, there would be no way to list universities
with current
candidates.

Other data normalization forms include Boyce-Codd Normal Form (BCNF), Fourth Normal

Form (4NF � Isolating Independent Multiple Relationships), Fifth Normal Form (5NF -

Isolating Semantically Related Multiple Relationships), Optimal Normal Form (ONF)


and
Domain-Key Normal Form (DKNF).
What is Data Quality

Data Quality indicates how well data in the data resource meet the business
information
demand. Data Quality includes data integrity, data accuracy, and data completeness.

Today's business organization cannot function at its optimum without relying on


information. Data sources supplying information such as data warehouses and data
marts
are fast becoming ubiquitous in various business environments around the globe
today. This
data warehouses and data marts work together with business intelligence systems so
that
companies get a picture of the industry trends and its relation to the performance
of
business operations.

But inaccurate and inconsistent data is a great hindrance to having a company


understand
the current and future business situation. No matter how advanced the business
intelligence
system is, if there is not way to guarantee high quality of data, the final
information can
cause a disaster in the decisions of the company as well as in a variety of other
negative
effects such as lost profits, operational delays, customer dissatisfaction and much
more. As
they say, garbage in is always garbage out. No matter how advance the business
intelligence algorithm is, if the data to begin with is not accurate, then the
final output will
definitely not be accurate as well.

An effective strategy in order to come up with quality data should be integrated in


data
management. In fact, the primary goals of the data manager should be to ensure that
the
data source infrastructure can efficiently transform data from its raw state into
consistent,
timely, accurate and reliable information that the business organization can
utilize. The very
foundation of data management could be generally categorized into 5 aspects: data
profiling, data quality, data integration, data augmentation and data monitoring.

Data Profiling is about inspecting data for errors, determining inconsistencies,


checking for
data redundancy and completing incomplete information. At this point, the database
manager can already have an overview of the data based on its profiles.

Data Quality is the process wherein data is corrected, standardize and verified.
This process
needs very meticulous inspection because any mistake at this point could send waves
of
errors along the way.

Data Integration is the process of matching, merging and linking data fro a wide
variety of
sources which usually come in disparate platforms.

Data Augmentation is the process of enhancing data information from internal and
external
data sources.
Finally, Data Monitoring is making sure that data integrity is checked and
controlled over
time.

In real life implementation, data quality is a concern for professionals who are
involved with
a wide range of information systems. These professionals know the technical in and
out of a
variety of business solution systems ranging from data warehousing and business
intelligence to customer relationship management and supply chain management.

According to a study in 2002, in the United States alone, it was estimated that the
total cost
of efforts dealing with problems related to achieving and maintaining high data
quality is
about US$600 billion every year. This figure shows that concern over data quality
is such a
serious aspect so much so that many companies have begun to set up data governance
teams solely dedicated to maintaining data quality.

Aside from the formation of dedicated data quality teams in many companies to
address
problems related to data quality, several software developers and vendors have also
come
up with tools. Many software vendors today are marketing tools used in the analysis
and
repair of poor quality data. There are also many service providers specializing in
data
cleaning on contractual and data consultancy firms also offer advice on avoiding
poor quality
data.

Data Quality Activity

Data Quality Activity is an activity in the data architecture component that


ensures the
maintenance of high-quality data in an integrated data resource. Data Quality
Process is a
process that documents and improves data quality by using both the deductive and
inductive techniques. It is a systematic process of examining the data resource to
determine
its level of data quality and ensuring that the data quality is adjusted to the
level necessary
to support the business information demand.
Implementing a data warehouse may done in a variety of ways depending on the
technical
people � data architects, data modelers, database developers, programmers, etc �
behind
the implementation.

But whatever the approach used in the implementation is, data quality task should
always
be integrated into the project design and planning. This particular aspect of data
warehouse
implementation is one of the most crucial and important fact.

When data is poor to begin with, a chain effect of errors could follow down the
line and this
could cause tremendous disaster not just on the data warehouse but the whole
business
enterprise in general. Careful planning and design at the beginning of each project
can
guarantee that the right data quality activities are integrity in the entire
project.
There are various types of data quality activities. Some of the most common ones
are the
following:

Data Justification - This activity involves determining issues related to the


impact of data
quality during the project presentation of the business problem and the
corresponding
solution.

Data Planning � In this activity, data quality is considered in setting the project
scope,
schedule and deliverables. Also included are data quality deliverables in the
project charter.
It is in this activity that planning for data quality control throughout the
project is specified
in the aspect of data creating, data transmission from one database to another or
from one
application to another and data setup, configuration and reference.

Data Designing � This activity involves data quality profiling tools in creating or
validating
data information models. At this point, the technical people makes sure that those
analyzing
the data can have access to the data specifications and business rules which define
the
quality.

Development and Testing � This is an iterative process involving data assessment


whose
results will be provided for data cleansing and transformation rules.

Data Deployment � After an intensive data quality assessment and data extracts are
confirmed to be accurate and consistent, data will finally be loaded to the
warehouse and
other data stores.

Post Production � This activity involves constant monitoring and implementing


metrics in
order to regularly check data quality and take quick action whenever a need arises.
The Data Quality Process is more systematic and orderly and complements with the
aforementioned activities in order to achieve high quality data. The data quality
process
helps the data warehouse function to give the company quality data through a
comprehensive structured approach that takes data from the source undergoing some
key
steps in the data cleansing process until the data finally arrives at its target
destination and
functions in the way that it was intended to be.

In general, a data quality process involves the following generic states: accessing
of data,
interpreting of data, standardization of data, validation, matching and
identifying,
consolidation, data enhancement and data delivery or deployment.

Of course there are other stages in the data quality processes but the inclusion or
revision
of some stages in the process depends on the software application developer. But
whatever
kind of software application being used to power a data warehouse, the basic thing
to
remember is that good information is the highly dependent on the quality of data
input.

What is Data Refining

Data Refining is a process that refines disparate data within a common context to
increase
the awareness and understanding of the data, remove data variability and
redundancy, and
develop an integrated data resource. Disparate data are the raw material and an
integrated
data resource is the final product

Data refining process may be composed of many different subsets depending on the
database or data warehousing implementation. The process of data refining is one of
the
most important aspects of data warehousing because unrefined data can cause a heavy

disaster on the final statistical output from the data warehouse will then be used
by a
company's business intelligence.

In a data warehouse, there is a collective process called Extract, Transform, and


Load (ETL).
Data extracting is the process of gathering data from various other data sources.
The data
will then be transformed in order to fit business needs. Finally, then the data has
been
made to abide the business rules and the data architecture framework, it will then
be loaded
into the data warehouse.

Data refining does not apply to one particular aspect of the data warehouse
implementation.
In fact it applies to the many stages � from the planning to data modeling to the
final
integration of systems in the data warehouse to the functioning of the entire
business
intelligence system.

Beginning with the data modeling, data refining occurs when at the conceptual
schema
development, the semantics of the organization are being described. All abstract
entity
classes and relationships are being identified and carefully made sure that the
entities will
be base on real life events and activities of the company. In this case, data
refining goes
into action but eliminating unnecessary things to interest. The same goes true
during the
logical schema development where the tables and columns, XML tags and object
oriented
classes are being described and data refining makes sure that the structures to
hold data
are well defined.

An Entity-Relationship Model (ERM) is a data modeling technique where a


representation of
structured data is defined and data refining is very important. This is a stage in
information
system design where models are used in describing information needs or the type of
information that is to be stored in a database during the requirements analysis. It
makes
sure that data are not redundant and relationship integrity is maintained so that
any insert,
delete or update processes can be easy managed without sacrificing the final data
quality by
broken integrity. In this aspect, data is refined by making sure that all
relationships
between entities and corresponding attributes are secure and accurate.
Data refining also takes place during the database normalization, technique used in

designing relational database tables so that duplication of information is


minimized. As a
result, the database is safeguarded from certain types of logical inconsistency.

In data mining, there is a process called data massaging. This process is used in
extracting
values the numbers, statistics, and information found within a database and to
predict what
a customer will do next. Data Mining works in several stages starting with
collection of data,
then data refining, and taking the final action. Data collection may be gathering
of
information fro website databases and logs. Data refining involves comparing of
user
profiles with recorded behavior and dividing the users into groups so that
behaviors can be
predicted. The final is action is the appropriate action taken by the data mining
process or
the data source and that action is answering a question on the fly or sending
targeting
online advertisements to a browser or any other software application being used.

What is Data Refreshing

Data Refreshing is the process of updating active data replicates based on a


regular, known
schedule. The frequency and timing of data refreshing must be established to match
business needs and must be known by clients.

Today, companies operate in an information-centric and fast-paced world. As such,


data and
information is very plentiful and readily available. An increasingly high number of
people
demand for the most current data especially where the data is sensitive or changes
very
quickly. This is very much applicable in situations where companies or individual
use data in
order to control or to some degree affect the nature and very behavior of some
objects such
as a piece of equipment located at a remote location.

With such given circumstance, it a common necessity to have to refresh data being
used by
a data-using entity. These data-using entities include things such as a data
display, a
database, a dynamically generated web page, or the like at an interval that is
appropriate
for the data. Setting the interval for data refreshing should be done carefully: a
very short
interval will typically result in inefficient allocation of network bandwidth and
processor
resources. On the other hand, setting the data refreshing interval too long might
result in
stale data.

Some applications have data which are based on a central data warehouse. For
example,
many web applications today such as the weather news and financial and stock market

heavily depend on the periodical refreshes of their web documents so that the end
user can
be offered with the last complete information.

Too often, sometimes, the interval between data refreshing is very small and this
results in
heavy burden on the server as well as on the network resources especially if the
data
includes multimedia files. Sometimes also, the updated web documents could
encounter
inordinate delays making it difficult to retrieve web documents in time. In some
implementations of data refreshing for web applications and those involving
browsers and
websites, there are small scripts embedded within an internet browser. These
scripts would
then allow a user to find out if the refreshing of a multimedia web document will
be received
in time.

In another implementation method usually employed in enterprise data warehouses,


data
refreshing is done for adaptively refreshing a data using system. This data
refreshing
system involves a data source and a data using device for utilizing data from the
data
source. Initial refresh interval is set on the data using the device and a stable
communication link is established between the data source and the data using
device.

The system is installed with a criteria monitoring tool for monitoring at least one
criterion
which is related to the refresh interval. For generating an updated data refresh
interval
based at least in part on the monitored criteria, a special processor is provided.
Data
refreshing with the system is based on the data refresh interval.

It is often a common practice in data warehouse to distribute the data in several


other data
source servers in order to optimize computing. A data site could maintain many
computers
working as a system to manage one or more data warehouses.

Since the load is distributed but the whole data warehouse system need to be
synchronized
in order to get information which reflects the real picture of the company, the
servers need
to constantly communicate with each over the network and give regular updates. Part
of
this communication tackles with data refreshing. Data refreshing should be done in
order to
synchronize data and make sure that the final output will always be correct,
consistent and
very timely.

What is Data Scrubbing


Data Scrubbing is a technique to correct error by using a background task in order
to
periodically inspect the memory for errors. The technique helps to decode, merge,
filter and
even translate the source data so that the data for data warehouse remains valid.

The error correction used is often the ECC memory or another copy of the data.
Employing
data scrubbing greatly reduces the possibility that single correctable errors will
accumulate.

To illustrate the need for data scrubbing, let us take this example. For example,
if a person
is being asked: "Are Joseph Smith of 32 Mark St, Buenavista, CA and Josef Smith of
32
Clarke St., Buenvenida, Canada the same person?", the person would probably answer
that
the two most probably are the same. But to a computer without any aid of a
specialized
software application, the two are totally different guys.

Our human eyes and mind would spot and justify that the two sets of records are
really just
the same but there was a mistake or inconsistency taking place during the data
entry. But
in the end, it is the computer that handles all data there should be a way to make
things
perfect for the computer. Data scrubbing weeds out, fixes or discards incorrect,
inconsistent, or incomplete data.

Computer cannot reason. The operate along the concept of "garbage in garbage out"
so no
matter how sophisticated a software application is, if the data input is not of
high quality,
the output data will not be of high quality also.

With the widespread popularity of data warehouses, enterprise resource planning


(ERP) and
customer relationship management (CRM) implementations nowadays, the issue of data
hygiene has become increasingly important. Without data scrubbing, the staff in a
company
may face the sad prospect of merging data which are corrupt or incomplete from
multiple
databases. If one thinks about it, a single very small bit of dirty data may seem
to be very
trivial but if this gets multiplied by thousands or millions of pieces of
erroneous, duplicated
or inconsistent data, it could turn into a huge disaster. And a tiny bit of dirty
data will highly
likely multiply.

There are many sources of dirty data. Some of these most common sources include:

. Poor data entry, which includes misspellings, typos and transpositions, and
variations in
spelling or naming;

. Lack of companywide or industry-wide data coding standards;

. Data missing from database fields;

. Multiple databases scattered throughout different departments or organizations,


with the
data in each structured according to the idiosyncratic rules of that particular
database;

. older systems that contain poorly documented or obsolete data.

Today, there are hundred of specialized software applications that are developed
for data
scrubbing. These tools have complex and sophisticated algorithms capable of
parsing,
standardizing, correcting, matching and consolidating data. Most of them have a
wide
variety of functions ranging from simple data cleansing to consolidation of high
volumes of
data in the database.

A lot of these specialized data cleansing and data scrubbing software tools also
have the
capacity to reference comprehensive data sets to be used to correct and enhance
data. For
instance, customer data for a CRM software solution could be used in referencing
and
matching to additional customer information like household income and other related

information.

Despite the fact that data hygiene is very important in getting useful results from
any
application, it should not be confused with data quality. Data quality is about
good (valid) or
bad (invalid). In other words, validity is the measure of the data's relevance to
the analysis
which is at hand. Data scrubbing is often confused with data cleansing; but they
have
similarities to a certain degree.

Disparate Data

Disparate Data are heterogeneous data. They are neither similar nor can be easily
integrated with an organisations database management system. It differs in one or
more
aspects of an information system.

Disparate data may be characterized by these basic problems:


1. In an organization implementing a database system, there is no one, complete,
integrated inventory of all its data.
2. The real substance, meaning and content of all the data within the
organizational data
resource is not readily known or well defined.
3. There is very high data redundancy in all over the organization
4. There is a very high variability of data formats and contents

A data warehouse could be a prime example of a place where disparate data come
together.
The goal of a data warehouse to facilitating the bringing together of data from a
wide
variety of existing databases such as from data marts and other data warehouses as
well so
that the data warehouse can support management and reporting needs.

Now, the reality is that databases and other data sources are not implemented in
the same
way. Some database may be managed by say Microsoft SQL Sever while other may by
managed by Oracle or MySQL.

While the underlying technology of these commercial relational database management


systems may be basically the same, they can differ in the final data outputs
because they
have their own specific and often proprietary formatting. So when each of these
relational
database management systems send their data to the data warehouse, they may sending

disparate data converging at the warehouse area.


Different data sources may also be implemented using different platforms.
Relational
databases are not the only sources of data feeding a data warehouse. There may be
other
data such output of running programs from computer servers. Different data sources
may
be powered by different operating systems. Some may be running on Unix and the many

different distributions of Linux. Some may be on MacOS, others on Windows and many
other different platforms.

Still another cause of having disparate data is the different requirements and
different data
available through the states of the lifecycle � it could be less at the start and
then more at
the end. Different users within the company may have different needs for data like
suppliers
versus customers, operator versus planner, commercial versus government.
In a data warehouse, there is a process known as ETL which stands for extract,
transform
and load. The transform part is the part which takes care of managing disparate
data.

Data Transformation is very important and needs to be executed with precision.


During the
earlier part when the data needs to be extracted from the various data sources with

different platforms, data identification for the transformation process beings.


During this
stage, the system identifies the data needed at the target location, such as an
operational
data store or a data warehouse, and the source data needed to produce the target
data.

Once everything is in place and the data has been identified, data extraction take
place by
taking the desired data from data sources and placing them in data depot for
refining. The
data depot refers to some sort of working place or staging area so that disparate
data can
be refined before getting loaded into the database.

The process of data extraction technically includes any conversion between database

management systems. Data refining is the actual work of transforming disparate data

before they are finally integrated to the data warehouse under common data
architecture.
When disparate data are transformed into the data defined by the architecture, real

integration begins

Disparate Databases

Disparate Databases are heterogeneous databases, In such database systems are


neither
compatible electronically nor operationally.

Technically speaking, a database is repository of data that can provide for a


centralized and
homogeneous view of data to be used for multiple applications. The data in a
database are
not just randomly placed there but they follow a certain structure according to the
definition
of the database. This structure is called a schema which is specified in a data
definition
language and can be manipulated using operations specified in a data manipulation
language.

Both the definition and manipulation algorithms employed in a database are being
based on
a data model which contains the definition of the semantics of the constructs and
operations
supported by these languages.

Comprehensive studies of molecular biology data often involve exploring multiple


molecular
biology databases, which entails coping with the distribution of data among these
databases, the heterogeneity of the systems underlying these databases, and the
semantic
(schema representation) heterogeneity of these databases.

Early attempts to manage heterogeneous databases were based on resolving


heterogeneity
by consolidating these databases either physically, through integration into a
single
homogeneous database, or virtually, by imposing a common data definition language,
data
model, or even DBMS, upon heterogeneous databases. These attempts failed because
they
were requiring a very difficult to attain degree of cooperation and a costly
replacement of
applications that were already based on the existing databases.

A data warehouse is a prime example of where disparate databases may be working


together to form an information system. As its very nature, a data warehouse is a
repository of present and historical data of a company and these data will be used
by
intelligence systems. But a central database alone may not be able to handle the
enormous
requirements posed by high level volumes of data so different data may be coming
from
different data sources. These data sources may be managed by different kinds of
database
implementations include relational database management systems. Some can even come
from flat files from legacy systems.

From these different kinds of data sources come differences in structure, data
semantics
and constraints supported or query language. This could be because different
database
implement based on different data models which could provide different primitives
such as
object oriented (OO) models that support specialization and inheritance and
relational
models that do not. For instance, a database may be using the set type in CODASYL
schema
but this schema could support supports insertion and retention which could are not
captured
by referential integrity alone.

Disparate databases result in disagreement about meaning, interpretation or


intended use
of data. As such, some conflicts may arise which are related to naming conventions,
data
representation where database use different values to represent the same concept,
precision conflicts in that databases may used the same data values from domains
with
different cardinalities for same data, metadata conflicts wherein the same concept
may be
represented at schema and instance level, missing attributes and many more.

Problems arising from the data results of disparate databases can be easily
remedied today
with many kinds of software tools that can manage data transformation. A popular
process
known as ETL (extract, transform, load) has become a standard in data warehouse
application in order to manage disparate data from various sources and transform
then into
a uniform format that the data warehouse can understand and work with.

Aside from enterprise data warehouses, the internet is the biggest example of
disparate
databases being handles to give relevant information. In dynamic websites, when one
looks
at a webpage with a browser, the viewer may not have any idea at all that the data
he is
looking at comes from hundreds of disparate databases.

Disparate Metadata Cycle

Disparate Metadata Cycle is a kind of cyclic process in which disparate metadata


gets produced rapidly. With data warehouse implantations becoming more ubiquitous
these days due to the fact that businesses can no longer function without being
supplied by
relevant data, the use of metadata has become relevant too. As a short
backgrounder, a
metadata is a data describing another data in the goal of facilitating the use,
management
and understanding of data in a large data warehouse.

For instance in a library system, metadata to be used will surely include


description of book
contents, authors, data of publication and the physical location of books in the
library. If the
context of use is about photography, the metadata which will be used are for
description of
the camera, model, types, photographer, date photograph was taken, location where
photograph is taken and many other things.

But managing metadata is not as simple as making it. Management can all the more
become complex when dealing with large data warehouses which get data from various
data
sources powered by disparate databases.

The problem of disparate metadata has been a big challenge in many areas of
information
technology. Such technologies as data warehousing, enterprise resource planning
(ERP),
supply chain management and all other application dealing with transactional
systems with
disparate data as wells as duplication and redundancy. In the case disparate
metadata, the
most common problems are missing metadata relationships, costly implementation and
maintenance and poor choice of technology platforms.

A metadata management project is more about systems-integration problem than an


application-development effort. Although it may possible for an organization to
simply build
its own metadata management tools, many opt to buying a collection of vendor-
provided
products that work together to move the data from source applications into the
warehouse
environment.

Each of these metadata management tools can support several sets of data warehouse
functions which se or create subsets of the data warehouse metadata which have been

taken from various disparate databases.


In general practice, many tools require access to the same metadata but in many
instances,
each product is self-contained and provides its own metadata management facilities.

Consequently, the definition of metadata may be redundant across the data warehouse
tool
suite and this can result in disparate data. But as part of the cycle, the metadata
must be
integrated for them to be useful to the user of the data warehouse metadata.

Integration metadata is in fact just similar to general data warehouse data


integration. In
the same manner that the data warehouse system integrates data from disparate
databases, the warehouse metadata management environment must integrate disparate
metadata.

The system is simply configured for collecting the disparate metadata from the
source tools
then after integration, metadata is and disbursed it back to any tools that use it.
It should
be wise though to determine which tool is the most appropriate source for each
metadata
object as various data warehouse tools may be capturing the same metadata at any
given
time.

Managing a disparate metadata cycle can really pose an incredible challenge. This
is mainly
because the only common attribute in identifying different versions of the same
metadata
objects is its name.

If different tools have been used in recording the same metadata attribute, certain
rules
must be established regarding which tool should maintain the master version in
order to
maintain consistency. So despite all these tools, disparate metadata needs to be
repetitive
process until such integration can take place. And the same goes disparate data,
the cycle
will always take place as long as the data warehouse is still in use.

Disparate Operational Data

Disparate Operational Data is generally used to support day-to-day business


transactions. They reflect current status of an organizations operational data.

Dealing with disparate operational data is a very common everyday process in a data

warehouse implementation. Since a data warehouse is built to contain the repository


of
current and historical data of a business organization and one of its main goals is
to supply
the needed input for a business intelligence system, a data warehouse needs to
handle very
large volumes of data at very intervals.

A single central database alone cannot handle an intensive data warehousing load so
it
needs to have many other physical servers share the load as well as store other
data. For
instance, in a large business organization, some data may come from the financial
department, others from the human resource department, manufacturing department,
sales
departments (like from point of sales department stores) and many other data
sources.

But still, despite the shared load, it would still be too heavy for the data
warehouse system
to simultaneously handle all the data. So it needs a "temporary" area to for
current data
values to be worked on. This areas is what is commonly referred to as the
operational data
store (ODS).

An operational data store is actually just another set of relational databases


which contain
data that is extracted from a regular basis, say nightly basis, from different
sections of the
business organization. For example, in the case of a school operational system, the
data
extracted may come from students, personnel, financial aid, admissions, and the
Billing and
Accounts Receivable System.
The operational data store is designed for integrating data from multiple sources
so that
operations, analysis and reporting can be efficiently facilitated. Since the data,
as
mentioned, comes from a variety of sources, the integration would include involves
cleaning, redundancy resolution and business rule enforcement. This data store is
usually
designed to handle low level or atomic and indivisible data such as transactions
and prices
which is in contrast to aggregated or summarized data such as net contributions.
The
aggregated data are usually found stored in the data warehouse.

It is very common that the data sources could be of disparate nature. The reality
of data
warehousing system is that different data sources are powered by different
databases
systems. For instance, some databases may be run by Oracle, others by MySQL or
Microsoft
SQL Sever and many other commercial relational database vendors.

Even if the underlying frameworks of these relational databases are basically


similar, each
of them still has implemented its own exclusive or proprietary formatting. So the
different
outputs of these databases once they get to the operational data store could still
be
disparate operational data.

Another cause of having disparate operational data does not lie on the relational
database
management system itself but in the overall design of the enterprise information
system.
For instance, a company that is implementing a database system may not have defined
a
single, complete, integrated inventory of all its data. Or maybe the real
substance, meaning
and content of all the data within the organizational data resource is not readily
known or
well defined. And still yet, there exists very high variability of data formats and
contents in
the company's information system.

Handling of disparate operational data is being managed by many commercial software


application tools. A common process in data warehouse called ETL which stands for
extract,
transform, load is commonly done to ensure that all types of disparate data and
metadata
are being transformed before they get loaded to the data warehouse.

Denormalized Data

In any database implementation, it is always advised to normalize the database for


optimal
performance by reducing redundancy and maintaining data integrity. The process of
normalization involves putting one fact in its appropriate place to make updates
optimized
at the expense of data retrievals. As opposed to having one fact in different
places, a
normalized is faster to retrieve.

A normalized database has one fact in one place and all related facts about a
single entity
are being placed together so that each column of an entity would refer non-
transitively to
only the unique identifier for that said entity. A normalized data should have
undergone the
first three normal forms generally.
However, there are cases that performance can be significantly enhanced in a
denormalized
database physical implementation. The process of denormalization involves putting
one fact
in many places in order to speed up the data retrieval at the expense of the data
modification process.

Although not all the time recommended denormalization, in the real world
implementation is
sometimes very necessary. But before jumping into the process of denormalization,
certain
questions need to be answered.

Can the database perform well without having to depend on denormalized data?
Will the database perform well when used with denormalized data?
Will the system become less reliable due to the presence of denormalized data?

The answers will obviously give you the decision whether to denormalize of not.

If the answer to any of these questions is "yes," then you should avoid
denormalization
because any benefit that is accrued will not exceed the cost. If, after considering
these
issues, you decide to denormalize be sure to adhere to the general guidelines that
follow.

If you have enough time and resources, you may start actual testing and comparing a
set of
normalized table and another set of denormalized ones. Then load the denormalized
set of
tables by querying the data in the normalized set and then inserting or loading
into the
denormalized set. Have the denormalized set of tables in read-only restriction and
achieve a
noted performance. But it should be strictly controlled and scheduled in the
population
process in order to synchronize both the denormalized and normalized tables.
Base on the exercise about, it is obvious that the main reason for denormalizing is
when it
more important to have speedy retrieval than having a slower a data update. So if
the
database you are implementing caters to the needs of an organization which has more

frequent data updates, say, a very large data warehouse where data comes and goes
from
data source to another which makes it obvious that there are more updates than
retrievals
from data consumers then a normalization should be implemented without having to
deal
with denormalized data. But when the retrieval from data consumers is more
frequent, say,
a database that maintains less dynamic accounts but more frequent access because of
a
service exposed on the internet, then having to deal with denormalized data
populated in
the database should be the way to go.

Other reason for denormalization may include situation when there are many
repeating
groups existing which need to be processed in group instead of individual manner;
there
many calculations which will be involved on one or several columns before a query
can be
addressed; there is a need for many ways to access tables during the same period;
and
when some columns are queried with a large percentage of time. Denormalized data,
although many have the impression of being the cause of a slow system, are actually
useful
in certain conditions.

Existing Data Quality Criteria

The most general sense, data quality is an indication of well the data is in terms
of integrity
and accuracy as it is stored in the data resource in order to meet the demand of
business
information. Other indicators of good quality data pertain to completeness,
timeliness and
format (though this may never apply anymore with today's advanced data cleansing
and
transformation tools).

In a large data implementation, there may be various components working together as


a
system in order to smoothly and efficiently hand the processes of extracting,
transformation
and loading data from various data sources into the data warehouse as and share the

transformed data to be used by the business intelligence system so that data


consumers
and high ranking company officials and decision makers can draw decision support
from the
information.

But no matter how sophisticated the data warehousing system or data resources are,
there
can never be an achievement of high quality data when in the first place, the data
input is
not accurate. As it is aptly put in the information technology world: "Garbage in
is garbage
out". To avoid this, there must be a means of ensuring that the input data is clean
and of
high quality and so there should be an existing set of data quality criteria.

Different organizations have different needs. In fact, it is common to find these


days one
single company which operates in different industries. Moreover, business
organizations are
operating not just in one location but in different locations spread throughout the
planet.
So, criteria for data quality could really vary depending on the business needs.

For instance, a business organization solely operating in the pharmaceutical


industry needs
data pertaining to chemicals, raw materials for medicines and labeling of
pharmaceutical
products with names that sound long and scientific. These data needs are in the
addition to
the general data needs such as name of staff, customers and business partners.

A data documentation would define all of the data associated with the
pharmaceutical
organization's business rules, entities and the corresponding attributes. It should
also very
specifically define the conventions for name labeling of the products.

In comparison, a company operating in the mining industry which is implementing a


data
warehouse or any other kind of data resource would have a different set of existing
data
quality criteria. Perhaps some of their criteria would include all names pertaining
to different
compositions of mineral deposits, geographic locations of where specific minerals
and other
mine deposits can be found and how rich the deposits are, what the legislations are
from
one country to another in terms of mining contracts, and many other relevant data
needs
for the mining industry in general and the mining company in particular.

Having a set of existing data quality criteria is almost similar to keeping a data
dictionary.
But while a data dictionary is very broad as it tries to define all data and all of
its general
aspects, the existing data quality criteria is very specific. It may define how the
data will be
structured, and how it will be dealt with in terms of physical storage and network
sharing.

But like the data dictionary, the existing data quality criteria can partially
overcome the
problem associated with data disparity from the sharing of different data formats
from
different data sources platforms. This will essentially complement the process of
extracting,
transforming and loading but its final effect will be clean, uniform and high
quality data
output in the data resource.

Reverse Data Denormalization

Database reverse engineering is an important process in improving the understanding


of
data semantics. Many aspects of the database evolution especially those that
pertain to old
and legacy databases where the semantics of their data have been lost through the
years
need a database reverse engineering process to understood in detail.

Today, there are many processes currently undertaken to re-engineer legacy systems
or the
federation of distributed databases. There have been many works done wherein a
conceptual schema that is often based on an extension of the Entity-Relationship
(ER.)
model is being derived from a hierarchical database, a network database, or a
relational
database.
Reverse data denormalization is just one of the broad aspects of a database reverse

engineering which has two major steps. The first step involves eliciting the data
semantics
from the existing system. In this step, the various sources of information can be
relevant for
tackling this task, e.g., the physical schema, the database extension, the
application
programs, but especially expert users, are being elicited.

The second involves expressing the extracted semantics with a high level data
model. This
task consists in a schema translation activity and gives rise to several
difficulties since the
concepts of the original model do not overlap those of the target model.

Most of the methods used in database reverse engineering within the context of
relational
database mainly focus on the work done with schema translation since they assume
that the
constraints such as functional dependencies or foreign keys are
available at the beginning of the process. But then in order to be more realistic,
those
strong assumptions may not in apply in all cases as there are also old versions of
database
management systems which do not support such declaration.

There have been many recent reverse data denormalization works that are
independently
proposing for the alleviation of the assumptions aforementioned. In a Third Normal
Form
schema, they key idea would be to fetch the needed information from the data
manipulation
statements which are embedded in certain application programs but there may be no
need
for constraining the relational schema with a consistent naming key attribute.

The Third Normal Form requirement has remained to be one of the major limits for
the
current methods in database reverse engineering. As it has been shown, during a
database
design process, the relational schemas are sometimes either directly produced in
the First
Normal Form or in the Second Normal Form or denormalized at the end of the database

design process.

In some cases, the denormalization occurs during the implementation of the physical

database or during the maintenance phase when the attributes are added to the
database.

Data denormalization is a process wherein an attempt is made to optimize a database

performance by adding redundant data despite the fact that it is generally


recognized that
all relational database design should be based on a normalized logical data model
since
normalization proposes a way to develop an optimum design from a logical
perspective.

There are cases when it is really necessary to add redundant data because current
database
management systems that implement the relational model are doing the implementation

poorly.

A common approach to data denormalization is to denormalize the logical data design


and
when done with care, this process can achieve significant improvement in query
response.
On the other hand, reverse data denormalization is mostly used for understand the
structure of database especially when there is a need to fix or troubleshoot a very
complex
problem.

Normalized Data

Normalized Data is the data in the data view schema and the external schema which
have
gone through data normalization.

Maintaining a data warehouse means dealing with millions of data as the data
warehouse
itself is the main repository of a company's historical data or its corporate
memory. Thus,
data should be well managed and one of the ways to effectively manage a data
warehouse
is by reducing data redundancy.

One of the techniques employed commonly for reducing data is by using database
normalization. This technique is used mainly for relational database tables in
order to
minimize the duplication of data and in doing such, the database can be safeguarded

against some types of structural or logical problems such as data anomalies.

Let us take the case wherein a certain piece of information has multiple instances
in a table.
This case would result in having the instances not being kept consistent when an
update is
done on the table thus leading to a loss of data integrity. When a table is
normalized, it
becomes less prone to data integrity problems.

When a database is normalized to a certain high degree, more tables are being
created to
avoid data redundancy in one table but there would also be a need for having a
larger
number of joins and this can result to reduced performance.

As a general rule, database applications that involve a lot of isolated


transactions such those
in an automated teller machine need to be more highly normalized while those
database
applications that do not need mapping of complex relationships employ less
normalized
tables.

The degrees of normalization of database tables are described in the database


theory in
terms of normal forms. Some of the normal forms inlude first normal form, second
normal
form, third normal form, Boyce-Codd normal form, Fourth normal form, Fifth normal
form,
Domain/key normal form and Sixth normal form.

As earlier mentioned, normalized data are the data used for the data view schema. A
data
view schema is the logical or virtual table composed of the data query results on a

database. But they are not like ordinary tables in a relational database in that a
data view is
not a part of the physical schema but it is instead a dynamic and virtual table
whose
contents come from collated or computer data.
A data view can be a subset of the data contained in a table and can join and
simplify
various tales into one virtual table. The data contents may be aggregated from
different
table resulting from computation operations such as average, products and sums.

Also mentioned earlier is the fact that normalized data are also used for external
schema.
An external schema is designed for supporting user views of data and providing
programmers with easy access to data from a database.

The data that users see are in terms of an external view which is defined by an
external
schema of the database. The external schema basically consists of descriptions of
each of
the various types of external record in that external view as well as a definition
of the maps
and connection between the external schema and the underlying conceptual schema.
Because of the very nature of normalized data wherein redundancies are greatly
reduced or
eliminated, the database as well as the entire information system greatly benefits
in that
the systems become a lot easier to manage and maintain. If there were so much data
redundancy scattered throughout the entire system, additional overhead cost through
the
purchase of additional hardware and software would be needed to make sure that data

consistency and integrity is attained.

Optimized Data

Optimized data are essential to an efficient running, maintenance and management of


a
data warehouse in particular and an information system of a company in
general. Optimized Data are data in the logical schema and conceptual schema that
have
been through data optimization.

Although optimized data may come from different IT considerations, they are
primarily the
result of a general data optimization process which prepares the logical schema
from the
data view schema.

Data optimization, from the context of a data warehouse is optimizing the database
being
used. Most data optimizations in this respect are commonly known to be on-specific
technique used by several applications in fetching data from a data sources so that
the data
could used in data view tools and applications such as those used in statistical
reporting.

Since optimized data are data in the logical schema and conceptual schema, let us
first
know what a logical schema is. A logical schema is a non-physical dependent method
of
defining a data model of a specific domain in terms of a particular data management

technology without being specific to a particular database management vendor. It


contains
the semantics that describe a certain technology for data manipulation. The
descriptions
may include terms of tables, columns, XML tags and object oriented classes.

On the other hand, a conceptual schema is a mapping of concepts and their


relationships. It
is also non-physical describing an organization's semantics representing assertions
about its
nature. This schema describes the things that are of significance to the
organization in
terms of entity classes, attributes and relationships.

Optimized data adhere to the semantics described in these two mentioned schemas.
They
work according to the rules and specifications and not violate each of these. These
data can
be said to have been mapped to the semantics of both the conceptual and logical
schemas.

In a business enterprise, optimized data help address excessive use of exchange


feeds and
high-end solutions as well as vendor penetration in an organization. It also solves
certain
problems related to high costs related to data licensing, technology, fragmented
sub-
optimized processes and regulatory and audit requirements.
Data centers need to have optimized data in order for it to be efficient, secure,
flexible and
agile service driven environment. In order for a business to have immediate
paybacks and
long term transformations, its enterprise management system must emphasize IT
operation
as a strategic business driver working on optimized data to ensure smooth flow of
operation.

Enterprise data optimization can be achieved by having a careful planning and


identifying of
key issues and initiatives need to optimize the data center. IT professionals
behind the data
center need to establish a longer roadmap and outlook of projects and integrate
tactical
solutions into the overall data management strategy.

There are many vendors that manage data centers to be working on optimized data.
These
vendors make sure that data across the entire organization are optimized while
addressing
issues on scalability for the future. They can also manage various data sources
into a
reliable, integral and consistent manner while delivering high data volumes with
low latency
to multiple applications within the enterprise.

While having a robust infrastructure that ensures data are optimized may entail a
high cost,
the benefits many be tremendous and long term. Developing and executing a good data

management strategy must be performed with a focus on the elimination of risks


associated
with change. By optimizing data, a company can also optimize supply and demand
management as well as improve data distribution. Information system using optimized
data
does not just give the benefit to the database system but to all running programs
as well.

Data Profiling
What is Data Pivot

Data Pivot is a process of rotating the view of data. In databases where there is
high level
volume of data, it is often very difficult to get a view of a particular data or
report. A pivot
table helps overcome this problem by displaying the data contained in the database
by
means of automated calculations which are defined in a separate column side by side
with
data column of the requested data view.
There are several advantages to using pivot tables. One advantage is that a pivot
table
summarizes the data contained in a long list into a compact format. A pivot table
can also
help one find relationships within the data that are otherwise hard to see because
of the
amount of detail. Yet another advantage is that a pivot table organizes the data
into a
format that is very easy to convert into charts.

A pilot table also includes many functions including automatic sort, count, and
total the data
stored in a spreadsheet and create a second table displaying the summarized data. A
user
can set up changes of the summary structure by graphically dragging and dropping,
the
name pivot or rotating gave the concept its name.

Typically, a pilot table contains rows, columns and data or fact fields. Most
application
offering data pivot features can easy invoke the feature with a pivot table wizard
in few
steps. Commonly, the first involves specifying where the data is located and
whether a chart
as well as table should be displaced.

The second step is simply about identifying the list range. It is common to have an
insertion
point so that the wizard can just define the list range in an automated way. And
then the
final step is just specifying the graphical layout of the final data pivot view.

There are generally two ways to make a pivot table layout - Discrete and.
Continuous
Variables. Commonly, a discrete variable has a relatively small number of unique
values and
these unique values are of different names.

For example, discrete variables include such as values as department names, model
names,
or customer names in a company database. Discrete values are more suitably used as
row
and column variables in a data pivot table but they can of course be used also as
data
fields. However, if discrete values are used as data field, the data pivot table
can only
display a summary by count.

On the other hand, a continuous variable can take on a large range of values. Some
examples of continuous variables include units sold, profit margin and daily
precipitation. In
general, it may not be a good idea to use a continuous variable as a row or column
variable
in a data pivot table because the result would be an impossibly large table.

For instance, to analyze income for, say, 500 firms, using the firm income as a
column
variable in the data pivot table you possibly make the software application run out
of display
space. A continuous variable is commonly used as the data field in a data pivot
table in
cases where one wants to see the sum or average or other summary calculation of its

values for different levels of discrete variables.


There is a method available in order to significantly reduce the amount of data to
be
displayed. This method is called "Grouping".

For example, dates used as row or column headers can be grouped. Data pivot table
is an
indispensable too in data warehouses as disparate data of high volumes need to be
formatted in order to get industry and business operations related reports that
reflect trends
and pattern.

Lightly Summarized Data

In Lightly Summarized Data the evaluational data is summarized by removing one, or


a few,
data characteristic from the primary key of the data focus.

Any company implementing a data warehouse is investing in large amount of money in


the
hope of getting relevant information that will help the company come up with very
sound
decisions to give them competitive edge in today's data driven business
environment.

But a data warehouse, as can be expected of a system that handles very large volume
of
data, is often implemented with many other different databases hosted in various
computer
systems which are called data sources, data stores or data marts.

All these systems give disparate data to the warehouse to be processed according to
the
business data architecture and business rules. As such the data warehouse needs
very
intensive loads so it needs to have a mechanism whereby it can serve its very
purpose
which is to give relevant information from among the millions and millions of data
inside it.
Aside from data archives and all the systems of records and integration and
transformation
programs, the data warehouse also contains current details and summarized data.

The heart of the data warehouse is its very current detail. This is the place where
the
biggest of bulk of the data resides and the current details is being supplied by
directly from
the operational systems which may be contained either as raw data or aggregated raw
data.

The current details are often categorized into different subject areas which
correspond to
representations of the entire enterprise rather than a given application. The
current detail
has the lowest level in terms of data granularity from among the other data in the
warehouse.

The period represented by the current detail depends on the company data
architecture but
if is common to set the current details to cover about two to five years. The
refreshing of
current details occurs as frequently as necessary to support the requirements of
the
enterprise.

One of the most distinct representations of current details in particular and data
warehouse
in general is the aspect of lightly summarized data. All enterprise elements such
as region,
functions and departments do not need the same requirements for information so an
effectively designed and implemented data warehouse can supply customized lightly
summarized data for every enterprise element. Access to both detailed and
summarized
data can be had by the enterprise elements.

Data warehouse is designed and implemented such that data is stored and generated
to
many levels of granularity. To illustrate the different levels, let us imagine a
cellular phone
company that wants to implement a data warehouse in order to analyze user behavior.

In the finest granularity level are the records of customers kept about every call
description
record during a 30-day period. During the next level which is the lightly
summarized data
history, statistical information by month for that customer such as calls by hour
of day, day
of week, area codes of numbers called, average duration of calls and other related
information are stored.

Finally, at the highly summarized level which is the next level of granularity, the
records
that may be contained include number of calls made from a zip code by all
customers,
roaming call activity, customer churn rate and this can be used other statistical
activities.

Data warehouses are typically implemented with having different databases handling
the
different levels of data granularity such as raw data, lightly summarized data and
highly
summarized data in a large information system with federated database. Lightly
summarized data have fine granularity. To have maximum efficiency, a stable network
infrastructure should be implemented.

Data Migration

What is Data Propagation

Data Propagation is the distribution of data from one or more source data
warehouses to
one or more local access databases, according to propagation rules. Data warehouses
need
to manage big bulks of data every day. A data warehouse may start with a few data,
and
starts to grow day by day by constant sharing and receiving from various data
sources.
As data sharing continues, data warehouse management becomes a big issue. Database
administrators need to manage the corporate data more efficiently and in different
subsets,
groupings and time frames. As a company grows further, it may implement more and
more
data sources especially if the company expansions goes outside its current
geographical
location.

Data warehouses, data marts and operational data stores are becoming indispensable
tools
in today's businesses. These data resources need to be constantly updated and the
process
of updating involves moving large volumes of data from one system to another and
forth
and back to a business intelligence system. It is common for data movement of high
volumes to be performed in batches within a brief period without sacrificing
performance of
availability of operation applications or data from the warehouse.

The higher the volume of data to be moved, the more challenging and complex the
process
becomes. As such, it becomes the responsibility of the data warehouse administrator
to find
means of moving bulk data more quickly and identifying and moving only the data
which
has changed since the last data warehouse update.

From these challenges, several new data propagation methods have been developed in
business enterprises resulting in data warehouses and operational data stores
evolving into
mission-critical, real-time decision support systems. Below are some of the most
common
technological methods developed to address the problems related to data sharing
through
data propagation.

Bulk Extract � In this method of data propagation, copy management tools or unload
utilities are being used in order to extract all or a subset of the operational
relational
database. Typically, the extracted data is the transported to the target database
using file
transfer protocol (FTP) any other similar methods. The data which has been
extracted may
be transformed to the format used by the target on the host or target server.
The database management system load products are then used in order to refresh the
database target. This process is most efficient for use with small source files or
files that
have a high percentage of changes because this approach does not distinguish
changed
versus unchanged records. Apparently, it is least efficient for large files where
only a few
records have changed.

File Compare � This method is a variation of the bulk move approach. This process
compares the newly extracted operational data to the previous version. After that,
a set of
incremental change records is created. The processing of incremental changes is
similar to
the techniques used in bulk extract except that the incremental changes are applied
as
updates to the target server within the scheduled process. This approach is
recommended
for smaller files where there only few record changes.
Change Data Propagation � This method captures and records the changes to the file
as
part of the application change process. There are many techniques that can be used
to
implement Change Data Propagation such as triggers, log exits, log post processing
or
DBMS extensions. A file of incremental changes is created to contain the captured
changes.
After completion of the source transaction, the change records may already be
transformed
and moved to the target database. This type of data propagation is sometimes called
near
real time or continuous propagation and used in keeping the target database
synchronized
within a very brief period of a source system.

Data Mart

What is Data Mart

Data Mart is a subset of the data resource, usually oriented to a specific purpose
or major
data subject, that may be distributed to support business needs. The concept of a
data mart
can apply to any data whether they are operational data, evaluational data, spatial
data, or
metadata.

A data mart is a repository of a business organization's data implemented to answer


very
specific questions for a specific group of data consumers such as organizational
divisions of
marketing, sales, operations, collections and others. A data mart is typically
established as
one dimensional model or star schema which is composed of a fact table and multi-
dimensional table.

In comparison, a data warehouse is also a repository of organizational data


implemented as
a single repository serving enterprise wide data across many if not all subject
areas. The
data warehouse is the authoritative repository at atomic level of all fact and
dimensional
data.
Despite some arguments on the similarity or difference between a data mart and a
data
warehouse, many still consider a data mart as specialized version of a data
warehouse.

The data mart, like the data warehouse, can also provide a picture of a business
organization's data and help the organizational staff in formulating strategies
based on the
aggregated data and statistical analysis of industry trends and patterns as well as
part
business experiences.

The most notable difference of a data mart from a data warehouse is that the data
mart is
created based on a very specific and predefined purpose and need for a grouping of
certain
data. A data mart is configured such that it makes access to relevant information
in a
specific area very easy and fast.
Within a single business organization, there can more than one data mart. Each of
these
data marts is relevant or connected in some way to one or more business units that
its
design was intended for. The relationship among many data marts within a single
company
may or may not involve interdependency.

They may be related to other data marts if they were designed using conformed facts
of
dimensions. If one department has a data mart implementation, that department is
considered to be the owner of the data mart and it owns all aspects of the data
mart
including the software, hardware and the data itself. This can help manage data in
a huge
company by having a modularization method such that a department should only
manipulate and develop its own data as they see if without having to alter data
from other
department's data marts. Then other departments need data from the data mart owned
by
a certain department, proper permission should be asked first.

In other data mart implementation where there is strict conformed dimension, some
shared
dimensions exist such as customers and products and business ownership will no
longer
apply.

Data marts can be designed with star schema, snowflake schema or starflake schema.
The
star schema is the most simple of all the styles related to data mart and data
warehousing.
It consists only of few fact tables.

The snowflake schema is a variation of the star schema and the storage method is of

multidimensional nature. The starflake schema is a hybrid mixture of both the star
and
snowflake schemas.

Data marts are especially useful to make access to specific frequently access data
very
ease. It can give a collective picture or a certain aspect in the business by a
specific group
of users. Since data marts are smaller compared to a full data warehouse, response
time
could be lesser and the cost of implantation could also be less expensive.

Mini Marts

A data warehouse is a business organization's corporate memory or the main


repository of
historical data and it typically contains all the raw materials from the decision
support
system of the management.

The data from the data warehouse are being used by the business intelligence system
as on
of the main bases for company decisions and that the critical factors that leads to
the use of
a data warehouse is that the company's data analysts can perform several queries
and
analysis varying degrees of analysis such as data mining on the data and other
related
information without affecting negatively or slowing down the operational system.
Given the tremendous requirements for a data warehouse, some IT professionals
recommend breaking down the huge data warehouse into data mini marts.

A data mini-mart is actually small (mini) and specialized version of data warehouse
and as
such, a mini-mart can actually contain a snapshot of operational data which are
very useful
for business people so that they can strategize based on data analysis of past
trends,
patterns and experiences.
The main difference between a full blown and big data warehouse and a mini mart is
that
the creation of the mini mart is predicated on a predefined and specific need for a
certain
configuration and grouping of select data and such configuration tries to give
emphasis on
easy access for relevant data and information.
In a business organization implementing a data warehouse, there may be many data
mini
marts and each one is relevant to one or more business units that the mini mart has
been
designed to serve.

In a business organization's data warehouse, the mini marts may or may not be
related or
dependent on each other. In cases where the mini marts are designed with the use of

conformed facts and dimensions, then they will be related to each other.

In some data warehousing implementations, each business unit or departmental


division is
assigned as the owner of its mini mart and this ownership includes taking care of
the
hardware, software and the data itself.
Implementing mini marts in a data warehouse has tremendous benefits. One of the
biggest
benefits is the modularization of the entire big system. In the ownership scenario,
the data
and the physical as well as software infrastructure will be better controlled
because it would
be easy to pinpoint responsible people or business unit.

With mini marts, each department can use, manipulate and develop as well as
maintain
their data in any way that they see fit and without having to alter the information
inside the
other department's mini marts or the data warehouse.
Other benefits associate with implementing mini marts include getting very easy and
fast
access to data which are frequently needed. If there was no mini mart, a data
consumer
would have to go through the vast repository inside a central data warehouse and
with high
volumes of data involved, querying may take a very long time and may even slow the
entire
system.

Having mini marts can also create a collective view by a group of users and this
view can be
a good way to know how the company is performing on business unit level as well as
in its
entirety. Because a mini mart is obviously smaller than an entire data warehouse,
response
time would be greatly improved while creation and manipulation of new data would be
very
easy and fast as well.

Data Management

What is Data Redistribution

Data Redistribution is the process of moving data replicates from one data site to
another to
meet business needs. It is a process that constantly balances data needs, data
volumes,
data usage, and the physical operating environment.

It is not uncommon to have a data warehouse serving a company but the data
warehouse
also constantly interacts with other data sources. In many cases when a company is
so
large that it not only has several departments spread out in several floors of an
office
building but also has several branches spread out across different locations as
well, it is a
good idea to break up the data within the warehouse.

These data can be data replications which are being moved from data site
representing the
branch or departmental data into another data. The advantage of having this set up,
which
is the very essence of data redistribution, is that specific data needs can be near
the data
user department and so travel can be greatly reduced as well as the need for higher

networking resources.
Also, since the processing is spread across many servers, there will be a balance
of load and
the system can be made sure that no central server is taking a very toll to the
point of
breaking down and halting the whole business operation which relies on data and
information.

Data Redistribution is often associated with load balancing. In computer


networking, load
balancing is a technique which is usually performed by load balancer computers and
used to
spread processing job among many computers, processes, hard disks or other IT
resources
so that the whole system can get optimal utilization of resource while decreasing
computer
time.
With data redistribution coupled with load balancing technique, processes which are

distributed and communication related activities across the network can be


guaranteed that
no single device within the system will be overwhelmed by a task.

Data redistribution has become a very common process in enterprise data


warehousing.
There are several very broad types of data redistribution methods used in the
business
sector today:

Redistribution within financial institution � this type of data redistribution is


more
focused on the management and control of data usage within an entity (in this case,
any
business organization) which is authorized to receive and distribute business
related data
where the company has full intention of complying with data policies.

Redistribution through web sites � This type of data redistribution is focused on


control
and distribution of business related data among different entities which may either
be
looking to avoid paying for the redistribution license fees (data pirating) or
where the
business data is secondary component of any application.

Redistribution through electronic trading engines - This type of data


redistribution is
focused on delivery of business related data electronic trading software vendors.
This is in
cases where exchanges have tended to waive fees to encourage the penetration of
such
systems.

Redistribution via derived data - This type of data redistribution is focused the
definition
of the fine line separating core business data and data whose value is derived
through
mathematical formulas.
Data warehouses which implement data redistributions should make sure that the
existing
network infrastructure can handle the very large volume of data that travels across
the
network on a regular basis. A data redistribution software manager should also be
installed
to monitor the constant sharing of data replicates and make sure that data
integrity is
maintained all the time.

This software regularly communicates with each data site, following activities of
the data a
making sure that everything works smoothly as different servers process data
replicates
before they are aggregates into a reporting function for final use by the business
organization.

What is Data Resource


Data Resource is a component of information technology infrastructure that
represents all
the data available to an organization, whether they are automated or non-
automated. Different business organizations may have different needs. Today, it is
not
uncommon to find a single company operating in different industries. It is also not

uncommon to find companies branching out to different geographical locations within


the
country it is operating in or even outside the country of its origin.

Setting up an information technology infrastructure can be very complex. As a rule,


the
more complex the company set up is, the IT needs become more complicated and it
would
become more difficult to set up the final IT infrastructure.

The IT infrastructure has many aspects. There is the network part where IT
architects need
to consider networking peripheral including routers, switches, cabling, wireless
connection,
and other products. Then there is the server aspect where certain powerful
computers are
assigned separate tasks like database server, web server, FTP server and others.

The data resource is a separate component of this infrastructure. Like the network
and
server components, and many other components not mentioned, it is important to
carefully
plan the data resource of the IT infrastructure.

The data resource encompasses all its representation of each and every single data
available to an organization. This means that even those non-automated data such as
bulks
of paper files in individual desks of each staff, confidential paper data hidden in
steel
cabinets, sales receipts, invoices and all other transaction paper documents
constitute the
Data Resource. It cannot be denied that despite the digitalization of all business
processes,
papers still play a large part in business operations.

A digital Data Resource provides faster and more efficient means of managing data
for the
company. In today' world, Data Resource implementation does not just end in the the

digitalization aspect.

Today, companies are finding out that the most efficient way to support Data
Resource is
constantly changing and evolving with technology. In fact, the changes have been
very
significant since the last 20 years.

In the field of digital Data Resource, in the not so distant past, there used to be
large
centralized facilities operating Mainframes to a distributed collection of client-
server systems
and back to recentralized arrays of commodity hardware.

Today, the physical nature of Data Resource may have changed dramatically but the
same
need for scalable and reliable infrastructure has not.
It is common today to have Data Resource scattered and connected via network. This
makes so much sense today as the need for information becomes more and more
pervasive
and data consumers are now doing mobile computing: the web, mobile applications,
national or global branch offices, worldwide partner sites, or subsidiaries.

It is no longer practical to create an infrastructure based on a particular


location or piece of
hardware, but rather on the other pervasive entity in every organization with the
help of
network. The network has so many inherent qualities which makes it ideal as the
basis for
managing hosted data resource infrastructure.

Data Resources may be scattered everywhere in the company. There may be data
resource
from finance, sales, marketing, HR, manufacturing and other company departments.
Some
data resource may come from several other departments from other geographical
branches.

In short, Data Resources may come from everywhere and they may converge on one data

warehouse they may send criss-crossing data from department to another. In whatever

case, everything is moving towards networked infrastructure. The best example of


networked Data Resource is the internet where the millions of web servers around
the world
are the Data Resources.

Data Synchronization

Data Synchronization is also known as data version concurrency or data


version synchronization. It ensures the concurrency of replicated data and makes
sure that
every replicated data value is matches the official data version.

There are many data synchronization technologies which are available in order to
synchronize a single set of data between two or more electronic devices such as
computers,
cellular phones, and other personal digital assistants. These technologies can
automatically
do data copying of changes forth and back. For instance, a contact list on the
mobile phone
of a certain user can be synchronized with a similar contact list in another mobile
phone or
in another computer.

Data synchronization can be broadly categorized as local synchronization or remote


synchronization. In local data synchronization, the devices which are tying to
synchronize
data may be located near each other or can be located side by side.
The transfer of data to be synchronize may employ connection technologies such as
infra-
red, network cable or Bluetooth technology. In remote synchronization, devices
which are
tying to synchronize data are located very far from each other and data transfer
employ
network technology with the help of networking protocols such as file transfer
protocol
(FTP).

In the past, data management used to be a scenario where data is either consistent
or
highly available but could never be both at the same time. This was what was
referred to as
the Heisenbergian dilemma. But with today's fast advancement in information
technology
especially in the field of real time processing, data synchronization is much more
efficient
than ever before.

During the time when Usenets were very popular on the internet, it was more
sensible to
make replications of contents across a federation of news servers. The authors of
RFC 977
which specify Standard for the Stream-Based Transmission of News wrote:

�There are few (if any) shared or networked file systems that can offer the
generality of
service that stream connections using Internet TCP provide.�

But today, there are already thousands of shared file systems offer generality of
service.
Generality of service is what web servers do as they serve today's dynamic webpages

including forums, blogs and wikis. These developments have led to better data
synchronization techniques cast from the model of internet infrastructures and
implemented
in organizational data warehouses.

Data synchronization helps greatly in large a warehouse which also maintains


various data
marts and other data sources in order to distribute the processing load for
efficiency. Since
the data warehouse is expected to feed updated, clean, relevant, meaningful and
timely
information to be used by business intelligence systems, it has to make sure that
data is
synchronized throughout the whole system of data sources.

Since data synchronization makes frequent communication between the data warehouse
and the other data sources, problems related to network traffic management could
spring
up. This has be managed carefully be a set of standard network protocols as well as

proprietary protocols employed by network application solutions developers.

Data synchronization is not just about overcoming network problems also. It works
closely
with the whole IT and business data architecture. One aspect of data
synchronization is
isolating multiple data views from underlying model so that the updates of the data
model
will not just alter the data view but also propagate to the synchronized instances
of the said
data model.
There are actually hundreds of techniques for data synchronization and different
software
solution vendors have different implementations of these techniques. There can
never be an
all in one solutions as needs differ from one data warehouse to another.

Data Transformation

Data Transformation is also known as data scrubbing or data cleansing. It is the


formal
process of transforming data in the data resource within a common data
architecture. In
this process the production data is decoded and records are merged from multiple
DBMS
format in order to create meaning information.

It includes transforming operational, historical, disparate and evaluational data


within a
common data architecture to an integrated data resource. It also includes
transforming
data within the integrated data resource, and transforming disparate data.

The IT environment in business organizations today is so diverse. Many


organizations try to
store their enterprise data in multiple relational database management systems
(RDBMS)
and among such popular RDBMS are Microsoft SQL Server, Oracle, Sybase, and DB2.

Aside from the fact that business organizations store data on relational database,
there are
also many other that store some of their data in non-relational formats such as
mainframes,
spreadsheets, and email systems. Still, there are organizations that have
individual smaller
databases such as Microsoft Access for each staff.

When all these scenarios are taken together in a single company (which is still
highly
possible today), the organizations will need to find an efficient way of still
having to operate
as a single entity and where all disparate data and systems can related and
interchange
data among various data stores.
Data transformation is one of the collective process known as ETL (extract,
transform, load)
which is one of the most important processes in data warehouse implementation this
is the
way that data actually gets loaded into the warehouse from different data sources.

There are many tools to help data warehouses with data transformations by setting
objects
or certain utilities to automate the processes of extract, transform and load
operations to or
from a database. With these tools, data can be transformed and loaded from
heterogeneous
sources into any supported database. These tools also allow the automation of data
import
or transformation on a scheduled basis with such features as file transfer protocol
(FTP).

One such notable and widely used tool for data transformation is the Data
Transformation
Services (DTS) which contains DTS objects packages and many components. This tool
is
packaged with the Microsoft SQL Server but is also commonly used with any other
independent databases. When used with Microsoft products, the DTS can allow data to
be
transformed and loaded from heterogeneous sources using OLE DB, ODBC, or text-only
files, into any supported database.

The DTS packages are created with the use of DTS tools such as DTS wizards, DTS
Designer, and DTS Programming Interfaces.

The DTS Wizards, like any other program wizards, automates things by offering
simple
clicks to accomplish complex tasks so that even non-programmers can do such complex

jobs. But they mostly deal with common and simple DTS tasks including the
Import/Export
Wizard and the Copy Database Wizard.

DTS offers a graphical tool for building very sophisticated and complex DTS
packages. The
DTS Designer offers easy way to build the DTS packages with workflows and event-
driven
logic. It can be used in customizing and editing packages which have been created
using the
DTS wizard.

Other functionalities include are the DTS Package execution utilities, DTS Query
Designer
and DTS Run Utility.

Other than unifying and transforming data into a desired format for the data
warehouse
loading, data transformation is also responsible for correcting error by using a
background
task in order to periodically inspect the memory for errors and by doing such, it
reduces the
by using a background task in order to periodically inspect the memory for errors.
It also
minimizes or totally eliminate data redundancy.

Data Warehouse Management Tools


Data Warehouse Management Tools are software applications that extract and
transform
data from operational systems and loads it into the data warehouse.

The area of data warehouse management is very complex as data captured from
operational data sources such as those data coming from transactional business
software
solutions like Supply Chain Management (SCM), Point of Sale, Customer Serving
Software
and Enterprise Resource Planning (ERP) and management software to undergo the ETL
(extract, transform, load) process.
To facilitate data around the data warehouse, efficient ETL tools should be
employed.
Companies may either want to buy third party tools or develop their own ETL tools
by
assigning their in-house programmers to do the job. In general, the rule of thumb
is that
the more complex the data transformation requirements are, the more advantageous it
is to
just purchase third party ETL tools.

When deciding to buy a commercial data warehouse management tool, it is always good
to
consider the following aspects:

Functional capability � This means that the function to be considered is the way
the tool
handles both the "transformation" piece and the "cleansing" piece. When the tool
has strong
capability for both the "transformation" piece and the "cleansing" piece, then by
all means
buy it because in general, a typical data warehouse management tool can only have
strong
capability.

Ability to read directly from your data source � As mentioned earlier, data
warehouse
gets its data from various data sources and the ability to read directly from your
data
source make processing faster and more efficient.

Metadata support � A data warehouse management tool should be able to handle


metadata, which is a very important aspect of data warehousing as metadata is used
to
map source data to its destination.

Some of the data warehouse management tools developers are:

Business Objects is a French company that develops enterprise software. It is the


developer
of Data Integrator, integration and ETL tool that was previously known as Acta. The
Data
Integrator product features the Data Integrator Job Server and the Data Integrator
Designer.

The IBM WebSphere DataStage is an ETL tool and part of the IBM WebSphere
Information
Integration suite and the IBM Information Server. Formerly known as Ardent
DataStage and
Ascential DataStage, this tool is very popular for its ease of use and visual
interface. It is
available in many versions including the Server Edition and the Enterprise Edition.

The Ab Initio software was developed by Ab Initio Software Corporation and is a


fourth
generation data analysis, data manipulation, batch processing, GUI-based parallel
processing ETL tool. This comes as a suite of products which include: Co-Operating
System,
The Component Library, Graphical Development Environment, Enterprise Meta
Environment
and Data Profiler.
SQL Server Integration Services (SSIS) can provide for a good platform to build
data
integration, workflow applications and data warehouse management. In fact, this
tool has
been primarily designed from data warehousing and management and it features fast
and
flexible tools for data extraction, transformation, and loading (ETL) engine. This
tool is a
component of Microsoft SQL Server having replaced Data Transformation Services.

Informatica Corporation is one of the most popular providers of data integration


software
tools services for a wide variety of industries. Some of its products include

. Informatica PowerCenter,

. Informatica PowerExchange,

. Informatica Unstructured Data (For extracting and transforming data from


unstructured
and semi-structured documents like pdf, excel, doc, etc.),

. Informatica PowerChannel (for secure and encrypted data transffer over WAN),

. Informatica Metadata Manager (Impact analysis and metadata reporting) and

. Informatica Data Explorer.

Other data warehouse management tools and developers include


. Data Profiler,

. DMExpress,

. Data Transformation Services,

. ETL Integrator,

. Informatica,

. Pentaho,

. Pervasive Data Integrator,

. Scriptella,
. Sunopsis and

. Talend Open Studio.

Decision Support Systems


Decision Support helps the management to make decisions. Decision Support Systems
facilitate decision makers to access data relevant. Software applications
facilitate users to
search critical information that are crucial for management.

Decision Support Systems fall into a wide spectrum of interpretations because of


the wide
range of domains in which decisions are being made. But in general, Decision
Support
Systems are a very specific group or collection of computerized information system
which
can support business and organizational decision-making activities.

Any properly designed Decision Support Systems features and interactive software
based
system geared towards helping decision makers compile useful information from raw
data,
documents, personal knowledge, and business models to identify and solve problems
and
make decisions.

A typical Decision Support System is composed of a database management system, a


model-base management system and a dialog generation and management system.

The database management system is the system which stores all kind of data and
information. These data may come from various sources but mainly from an
organization's
data repository such as databases, data marts or data warehouse. other sources of
data
may include the internet in general as well as individual users' personal insights
and
experiences.

The model-base management system refers to the component that takes care of all
facts,
situations and events which have been modeled in different techniques such as
optimization
model and goal seeking model.
The dialog generation and management system refers to the interface management such
as
graphical windows and button in software where the end users use to interact with
the
whole system.

There are so many uses for a Decision Support System and one of the most common is
in
business enterprises in all kinds of industries. Some of the information that can
be derived
from the system for decision support may include all of current information assets
which
could include legacy and relational data sources, cubes, data warehouses, and data
marts,
sales figures in any given period intervals, new product sales assumptions which
will be
used as basis for revenue projections and any experiences related to any given
context
which can be the basis for different organizational decisions.
In another aspect which is not directly related to business enterprise, a Decision
Support
System may be used for hospitals as in the case of a clinical decision support
system used
for fast and accurate medical diagnosis.

The system could also be used for government agencies so they can have support for
decisions such as how to effectively deliver basic services to whoever need them
most or
how best to run the economy based on a certain trend in any given period.

It could also be used in the agricultural sector so that wise decisions can be made
on what
crops best suit a particular area or which fertilizers are best for certain types
of crops.

In the education sector, the system can be used to aid in deciding which subject
areas
many students fail and how the school can revise the curriculum to make it more
relevant
and effective.

In the internet, Decision Support System can be used by website administrators to


determine which webpages are most visited and what other functionalities and
services
need to be developed so that more visitors will be enticed to access the site.

Virtually all aspects of operations, whether business enterprise non-business, can


use and
greatly benefit from a Decision Support System. Investing in a Decision Support
System
software application can be a wise move for any organization.

Denormalized Data Store

In a real world scenario, a data warehouse is implemented with different kinds of


data
stores supporting the whole system as it handles high level volume of data. These
three
kinds of data stores are Historical Data Store, Operational Data Store, Analytical
Data Store.
All three kinds of data stores, although they server different purposes, are all
basically
databases. As such they all obey a general standard of database structuring, rules
and
constraints.

In general, it is recommended that databases should be normalized. This means that


it
should follow the at least the three normalization forms which are designed to
eliminate
redundant data. By eliminating redundant data, the database will generally execute
a lot
faster because the data to be inserted will into nice laid columns in one place
only that if
several descriptions or attributes of an entity are also inserted, they are neatly
laid without
causing redundancy because a relationships is well defined.

But there are certain cases however that having redundant data as effect of a
database
denormalization will give the database a better performance.
For instance, when implementing a database which has more data retrieval, as in the
case
of a website which only gets information less frequently but there are more views
of data
from internet users, it is wise to have denormalized table.

As a general rule, when updates are optimized at the expense of retrieval, the by
all means,
the database should not be denormalized. But when retrieval should be optimized at
the
expense of updates, the by all means, denormalization should be employed.

Going to be the data stores, the historical data store is a place for holding
cleansed and
integrated historical data. Since the data acts like archives and therefore
retrieval only
happens less frequently, the data store should be fully normalized. This means it
has to be
normalized up the third normal form.

The operational data store is designed for holding current data which is used for
the day to
day operation. This means that accesses and insertions happen every minute due to
the
progressive activities of business transactions.

Master and transactional data are mish mashed together and the operational data
store
need to efficiently facilitate the generation of operational reports. Hence, the
operational
data store should be implemented as a denormalized data store.

The analytical data store is designed to store both current and historical
information about
the business organization and is implemented with a dimensional structure for the
facilitation of the generation of analytical reports. In this case, denormalization
should be
employed as the current information is also fast getting updated.
The decision whether or not to denormalize a data store should not be taken lightly
as this
involves administrative dedication. This dedication should be manifested in the
form of
documenting business processes and rules in order to ensure that data are valid,
data
migration is scheduled and data consumers kept update about the state of the data
stores.
If denormalized data exists for a certain application, the whole system should be
reviewed
periodically and progressively.

As a general practice, the periodic test whether the extra cost which is related to
processing
with a normalized database justifies the positive effect of denormalization. This
cost should
be measured in terms of Input / Output and CPU processing time saved and complexity
of
the updating programming minimized.

Data stores are such an integral part of the data warehouse and when not optimized,
it
could significantly minimize the entire system's efficiency.
Massive Parallel Processing (MPP)

Massive Parallel Processing (MPP) is the �shared nothing� approach of parallel


computing. It
is a type of computing wherein the process is being done by many CPUs working in
parallel
to execute a single program.

One of the most significant differences between a Symmetric Multi-Processing or SMP


and
Massive Parallel Processing is that with MPP, each of the many CPUs has its own
memory to
assist it in preventing a possible hold up that the user may experience with using
SMP when
all of the CPUs attempt to access the memory at simultaneously.

The idea behind MPP is really just that of the general parallel computing wherein
the
simultaneous execution of some combination of multiple instances of programmed
instructions and data on multiple processors in so that the result can be obtained
a lot more
efficient and fast.

The idea is further based on the fact the having to divide a bigger problem into
smaller
tasks makes it easy to carry out simultaneously with some coordination. The
technique of
parallel computing was first put to practical use by the ILLIAC IV in 1976, fully a
decade
after it was conceived.

In the not so distant past of the information technology before client � server
computing
was on the rise, distributed massive parallel processing was the holy grails of
computer
science. Under this architecture, there various different types of computers
regardless of the
operating system being used and the computers would be able to work on the same
task by
sharing the data involved over a network connection.
Although it has fast become possible to do MPP in many laboratory settings such as
the one
at MIT, there was yet a short supply of practical commercial applications for
distributed
massive parallel processing solutions. As a matter of fact, the only interest at
that time
came from the academics who at the time could hardly find enough grant money in
order to
be able to afford time on a supercomputer. This scenario resulted in MPP becoming
known
as the poor man's approach to supercomputing.

Moving fast to today's setting information technology, massive parallel processing


is once
again on new and greater heights with the popularity of e-business and data
warehousing.
Today's business environment can literally not function without relying heavily on
sort of
data and it is not uncommon to find these days wherein companies invest millions of
dollars
in buying servers that work in concert as digital marketplaces become increasingly
complex.

It is a very real scenario to find a digital marketplace that is not only


processing
transactions but also coordinating information from all participating systems so
that all the
buyers and sellers in the system have access to all relevant information in real
time.

Many software giants in the industry are envisioning loosely coupled servers that
are
connected over the internet such as the Microsoft's .NET strategy. Another giant,
Hewlett-
Packard, has leveraged its core e-speak integration engine by creating its e-
services
architecture.

There are various companies today that have actually started delivering products
which
incorporate massive parallel processing such as Plumtree Software, a developer of
corporate
portal software whose version 4.0 release added an MPP engine which can process
requests
for data from multiple servers that run the Plumtree portal software in parallel.

MPP significantly cuts to the core of e-business in that instead of having to


require users to
request information from multiple individual systems one at a time, they can
actually collect
data simultaneously from all the systems at the same time.

With today's multinational corporations operating in vast geographical locations


and
employing data warehouses to store data, massive parallel processing will
definitely
continue to rise and evolve.
Executive Information Systems (EIS)

An Executive Information System (EIS) as a management information system is


generally
designed to be emphasized with graphical display and very easy to use and appealing

interfaces as this is assumed to be used for supporting and facilitating the


information and
decision making needs of senior executives.

EIS offer strong ad-hoc querying, analysing, reporting and drill-down capabilities
without
having to worry about complexities of the algorithm involved in the system. The
senior
executives can have easy access to both internal and external information relevant
to
meeting the strategic goals of the organization.

Many IT professionals consider the Executive Information System (EIS) as a


specialized
form of a Decision Support System (DSS) with very strong emphasis on the
capabilities
involving reporting and drill down data mining.
In fact, most of the data consumers like the senior position holder in a company
get
generate reports from many layers of data involvement and the reports automatically

highlight trends and patterns in the operations within the business enterprise and
its
relation to the business trends of the industry where the business is operating.

An Executive Information System (EIS) works closely with data warehouses in


monitoring
business performance as well as identifying problems and strong points of the
company.

Today's Business Intelligence, with its high end sub areas including analytics,
reporting and
digital dashboards, could be considered by many to be the evolved form of the
Executive
Information System.

In the past, huge mainframe computers were running the executive information
systems so
that the enterprise data can be unified and utilized for analyzing the sales
performance or
market research statistics for decision makers such as financial officers,
marketing directors,
and chief executive officers who were then not very well versed with computers.

But today's executive information systems are no longer confined to mainframes as


these
types of computers are slowly phased out due to its bulk and the increasing powers
of
smaller computers.

Many executive information systems are now being run on personal computers and
laptops
that high ranking company officials and CEOs can bring with then anywhere as meet
the
demands of a data driven company as well as industry.

In fact, executive information systems are now well integrated into large data
warehouses
with many computers internetworking each other and constantly aggregating data for
informative reporting.
An Executive Information System is a typical software system composed of hardware,
software, and telecommunications network. The hardware may be any computer capable
of
high speed processing but since this is a rather large system dealing with high
volume of
data, the hardware should also consider very large random access memory capacity
and
high storage capacity.

The software component literally controls the flow and logic of the whole system.
The
software takes care of all the algorithms which translate business rules and data
models
into digital representations for the hardware to understand.

The software may both have text base and graphics base. It also contains the
database
which manages all data involves and closely collaborates with the algorithms for
processing
the data. These algorithms may specify how to do routine and special statistical,
financial,
and other quantitative analysis.

The telecommunications network takes care of the cables and other media which will
be
used in data transmission. It also takes care of the traffic within the network as
manages
the system communications with outside networks.
What is surrogate key? Where we use it explain with examples?

Surrogate key is a unique identification key, it is like an artificial or


alternative key to production key, bz
the production key may be alphanumeric or composite key but the surrogate key is
always single numeric
key. Assume the production key is an alphanumeric field if you create an index for
this fields it will occupy
more space, so it is not advisable to join/index, bz generally all the
datawarehousing fact table are having
historical data. These factable are linked with so many dimension table. if it's a
numerical fields the
performance is high

Surrogate key is a substitution for the natural primary key.

It is just a unique identifier or number for each row that can be used for the
primary key to the table.
The only requirement for a surrogate primary key is that it is unique for each row
in the table.

Data warehouses typically use a surrogate, (also known as artificial or identity


key), key for the
dimension tables primary keys. They can use Infa sequence generator, or Oracle
sequence, or SQL
Server Identity values for the surrogate key.

It is useful because the natural primary key (i.e. Customer Number in Customer
table) can change and
this makes updates more difficult.

Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated as the
primary keys
(according to the business users) but, not only can these change, indexing on a
numerical value is
probably better and you could consider creating a surrogate key called, say,
AIRPORT_ID. This would be
internal to the system and as far as the client is concerned you may display only
the AIRPORT_NAME.

Another benefit you can get from surrogate keys (SID) is :

Tracking the SCD - Slowly Changing Dimension.

Let me give you a simple, classical example:

On the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1' (that's
what would be in your
Employee Dimension). This employee has a turnover allocated to him on the Business
Unit 'BU1' But on
the 2nd of June the Employee 'E1' is muted from Business Unit 'BU1' to Business
Unit 'BU2.' The entire
new turnovers have to belong to the new Business Unit 'BU2' but the old one should
belong to the
Business Unit 'BU1.'
If you used the natural business key 'E1' for your employee within your
datawarehouse everything would
be allocated to Business Unit 'BU2' even what actually belongs to 'BU1.'

If you use surrogate keys, you could create on the 2nd of June a new record for the
Employee 'E1' in
your Employee Dimension with a new surrogate key.

This way, in your fact table, you have your old data (before 2nd of June) with the
SID of the Employee
'E1' + 'BU1.' All new data (after 2nd of June) would take the SID of the employee
'E1' + 'BU2.'

You could consider Slowly Changing Dimension as an enlargement of your natural key:
natural key of the
Employee was Employee Code 'E1' but for you it becomes

Employee Code + Business Unit - 'E1' + 'BU1' or 'E1' + 'BU2.' But the difference
with the natural key
enlargement process, is that you might not have all part of your new key within
your fact table, so you
might not be able to do the join on the new enlarge key -> so you need another id.
Surrogate Key a simple concept.

Correct n exact answer for SURROGATE KEY IS BELOW:

Definition of Surrogate Key:

Alternate of Primary Key that allows duplication of datas/records.

Need, Where & Why we use Surrogate Key:

OLTP is of "Normalised Form" whereas OLAP (i.e.) Datawarehouse is of "De-normalised


form".

Actually the DWH concept is to maintain the historic datas for analysing. So its
should denormalized form.

To be denormalise form duplication should be allowed in DWH. When datas entering


the DWH Surrogate
key a new column named serial number is introduced to allow duplication in OLAP
Systems to maintain
historic datas.

You all know one thing a single mobile is used by other person if it is not in use
for more than one year.
how is it posssible just because of this Surrogate Key.

This Data Warehousing site aims to help people get a good high-level understanding
of what it takes to
implement a successful data warehouse project. A lot of the information is from my
personal experience
as a business intelligence professional, both as a client and as a vendor.

This site is divided into five main areas.

- Tools: The selection of business intelligence tools and the selection of the data
warehousing team.
Tools covered are:

. Database, Hardware
. ETL (Extraction, Transformation, and Loading)
. OLAP
. Reporting
. Metadata

- Steps: This selection contains the typical milestones for a data warehousing
project, from requirement
gathering, query optimization, to production rollout and beyond. I also offer my
observations on the data
warehousing field.

- Business Intelligence: Business intelligence is closely related to data


warehousing. This section
discusses business intelligence, as well as the relationship between business
intelligence and data
warehousing.

- Concepts: This section discusses several concepts particular to the data


warehousing field. Topics
include:

. Dimensional Data Model


. Star Schema
. Snowflake Schema
. Slowly Changing Dimension
. Conceptual Data Model
. Logical Data Model
. Physical Data Model
. Conceptual, Logical, and Physical Data Model
. Data Integrity
. What is OLAP
. MOLAP, ROLAP, and HOLAP
. Bill Inmon vs. Ralph Kimball

- Business Intelligence Conferences: Lists upcoming conferences in the business


intelligence / data
warehousing industry.

- Glossary: A glossary of common data warehousing terms.

Database/Hardware Selection

Buy vs. Build

The only choices here are what type of hardware and database to purchase, as there
is basically no way
that one can build hardware/database systems from scratch.

Database/Hardware Selections

In making selection for the database/hardware platform, there are several items
that need to be carefully
considered:

. Scalability: How can the system grow as your data storage needs grow? Which RDBMS
and hardware
platform can handle large sets of data most efficiently? To get an idea of this,
one needs to determine the
approximate amount of data that is to be kept in the data warehouse system once
it's mature, and base
any testing numbers from there.

. Parallel Processing Support: The days of multi-million dollar supercomputers with


one single CPU
are gone, and nowadays the most powerful computers all use multiple CPUs, where
each processor can
perform a part of the task, all at the same time. When I first started working with
massively parallel
computers in 1993, I had thought that it would be the best way for any large
computations to be done
within 5 years. Indeed, parallel computing is gaining popularity now, although a
little slower than I had
originally thought.
. RDBMS/Hardware Combination: Because the RDBMS physically sits on the hardware
platform, there are going to be certain parts of the code that is hardware
platform-dependent. As
a result, bugs and bug fixes are often hardware dependent.

True Case: One of the projects I have worked on was with a major RDBMS provider
paired with a hardware platform that was not so popular (at least not in the data
warehousing world). The DBA constantly complained about the bug not being fixed
because the support level for the particular type of hardware that client had
chosen was
Level 3, which basically meant that no one in the RDBMS support organization will
fix
any bug particular to that hardware platform.

Popular Relational Databases

. Oracle
. Microsoft SQL Server
. IBM DB2
. Teradata
. Sybase
. MySQL

Popular OS Platforms

. Linux
. FreeBSD
. Microsoft

Buy vs. Build

When it comes to ETL tool selection, it is not always necessary to purchase a


third-party tool. This
determination largely depends on three things:

. Complexity of the data transformation: The more complex the data transformation
is, the more
suitable it is to purchase an ETL tool.
. Data cleansing needs: Does the data need to go through a thorough cleansing
exercise before it
is suitable to be stored in the data warehouse? If so, it is best to purchase a
tool with strong data
cleansing functionalities. Otherwise, it may be sufficient to simply build the ETL
routine from
scratch.
. Data volume. Available commercial tools typically have features that can speed up
data
movement. Therefore, buying a commercial product is a better approach if the volume
of data
transferred is large.

ETL Tool Functionalities

While the selection of a database and a hardware platform is a must, the selection
of an ETL tool is highly
recommended, but it's not a must. When you evaluate ETL tools, it pays to look for
the following
characteristics:

. Functional capability: This includes both the 'transformation' piece and the
'cleansing' piece. In
general, the typical ETL tools are either geared towards having strong
transformation capabilities or
having strong cleansing capabilities, but they are seldom very strong in both. As a
result, if you know your
data is going to be dirty coming in, make sure your ETL tool has strong cleansing
capabilities. If you know
there are going to be a lot of different data transformations, it then makes sense
to pick a tool that is
strong in transformation.

. Ability to read directly from your data source: For each organization, there is a
different set of data
sources. Make sure the ETL tool you select can connect directly to your source
data.

. Metadata support: The ETL tool plays a key role in your metadata because it maps
the source data to
the destination, which is an important piece of the metadata. In fact, some
organizations have come to
rely on the documentation of their ETL tool as their metadata source. As a result,
it is very important to
select an ETL tool that works with your overall metadata strategy.

Popular Tools

. IBM WebSphere Information Integration (Ascential DataStage)


. Ab Initio
. Informatica
. Talend

OLAP Tool Selection

Buy vs. Build

OLAP tools are geared towards slicing and dicing of the data. As such, they require
a
strong metadata layer, as well as front-end flexibility. Those are typically
difficult
features for any home-built systems to achieve. Therefore, my recommendation is
that if
OLAP analysis is part of your charter for building a data warehouse, it is best to
purchase an existing OLAP tool rather than creating one from scratch.

OLAP Tool Functionalities

Before we speak about OLAP tool selection criterion, we must first distinguish
between
the two types of OLAP tools, MOLAP (Multidimensional OLAP) and ROLAP (Relational
OLAP).

1. MOLAP: In this type of OLAP, a cube is aggregated from the relational data
source
(data warehouse). When user generates a report request, the MOLAP tool can generate

the create quickly because all data is already pre-aggregated within the cube.
2. ROLAP: In this type of OLAP, instead of pre-aggregating everything into a cube,
the
ROLAP engine essentially acts as a smart SQL generator. The ROLAP tool typically
comes with a 'Designer' piece, where the data warehouse administrator can specify
the
relationship between the relational tables, as well as how dimensions, attributes,
and
hierarchies map to the underlying database tables.
Right now, there is a convergence between the traditional ROLAP and MOLAP vendors.
ROLAP vendor recognize that users want their reports fast, so they are implementing

MOLAP functionalities in their tools; MOLAP vendors recognize that many times it is

necessary to drill down to the most detail level information, levels where the
traditional
cubes do not get to for performance and size reasons.

So what are the criteria for evaluating OLAP vendors? Here they are:

. Ability to leverage parallelism supplied by RDBMS and hardware: This would


greatly increase the
tool's performance, and help loading the data into the cubes as quickly as
possible.

. Performance: In addition to leveraging parallelism, the tool itself should be


quick both in terms of
loading the data into the cube and reading the data from the cube.

. Customization efforts: More and more, OLAP tools are used as an advanced
reporting tool. This is
because in many cases, especially for ROLAP implementations, OLAP tools often can
be used as a
reporting tool. In such cases, the ease of front-end customization becomes an
important factor in the
tool selection process.

. Security Features: Because OLAP tools are geared towards a number of users,
making sure people see
only what they are supposed to see is important. By and large, all established OLAP
tools have a security
layer that can interact with the common corporate login protocols. There are,
however, cases where
large corporations have developed their own user authentication mechanism and have
a "single sign-
on" policy. For these cases, having a seamless integration between the tool and the
in-house
authentication can require some work. I would recommend that you have the tool
vendor team come in
and make sure that the two are compatible.

. Metadata support: Because OLAP tools aggregates the data into the cube and
sometimes serves as
the front-end tool, it is essential that it works with the metadata strategy/tool
you have selected.

Popular Tools

. Business Objects
. IBM Cognos
. SQL Server Analysis Services
. MicroStrategy
. Palo OLAP Server

Reporting Tool Selection


Buy vs. Build
There is a wide variety of reporting requirements, and whether to buy or build a
reporting tool for your business intelligence needs is also heavily dependent on
the type
of requirements. Typically, the determination is based on the following:

. Number of reports: The higher the number of reports, the more likely that buying
a
reporting tool is a good idea. This is not only because reporting tools typically
make
creating new reports easier (by offering re-usable components), but they also
already
have report management systems to make maintenance and support functions easier.
. Desired Report Distribution Mode: If the reports will only be distributed in a
single mode
(for example, email only, or over the browser only), we should then strongly
consider the
possibility of building the reporting tool from scratch. However, if users will
access the
reports through a variety of different channels, it would make sense to invest in a
third-
party reporting tool that already comes packaged with these distribution modes.
. Ad Hoc Report Creation: Will the users be able to create their own ad hoc
reports? If so,
it is a good idea to purchase a reporting tool. These tool vendors have accumulated

extensive experience and know the features that are important to users who are
creating
ad hoc reports. A second reason is that the ability to allow for ad hoc report
creation
necessarily relies on a strong metadata layer, and it is simply difficult to come
up with a
metadata model when building a reporting tool from scratch.

Reporting Tool Functionalities

Data is useless if all it does is sit in the data warehouse. As a result, the
presentation
layer is of very high importance.

Most of the OLAP vendors already have a front-end presentation layer that allows
users
to call up pre-defined reports or create ad hoc reports. There are also several
report tool
vendors. Either way, pay attention to the following points when evaluating
reporting
tools:

. Data source connection capabilities

In general there are two types of data sources, one the relationship database, the
other is the
OLAP multidimensional data source. Nowadays, chances are good that you might want
to have
both. Many tool vendors will tell you that they offer both options, but upon closer
inspection, it
is possible that the tool vendor is especially good for one type, but to connect to
the other type of
data source, it becomes a difficult exercise in programming.

. Scheduling and distribution capabilities

In a realistic data warehousing usage scenario by senior executives, all they have
time for is to
come in on Monday morning, look at the most important weekly numbers from the
previous
week (say the sales numbers), and that's how they satisfy their business
intelligence needs. All
the fancy ad hoc and drilling capabilities will not interest them, because they do
not touch these
features.
Based on the above scenario, the reporting tool must have scheduling and
distribution
capabilities. Weekly reports are scheduled to run on Monday morning, and the
resulting reports
are distributed to the senior executives either by email or web publishing. There
are claims by
various vendors that they can distribute reports through various interfaces, but
based on my
experience, the only ones that really matter are delivery via email and publishing
over the
intranet.

. Security Features: Because reporting tools, similar to OLAP tools, are geared
towards a number of
users, making sure people see only what they are supposed to see is important.
Security can reside at
the report level, folder level, column level, row level, or even individual cell
level. By and large, all
established reporting tools have these capabilities. Furthermore, they have a
security layer that can
interact with the common corporate login protocols. There are, however, cases where
large
corporations have developed their own user authentication mechanism and have a
"single sign-on"
policy. For these cases, having a seamless integration between the tool and the in-
house authentication
can require some work. I would recommend that you have the tool vendor team come in
and make sure
that the two are compatible.

. Customization

Every one of us has had the frustration over spending an inordinate amount of time
tinkering
with some office productivity tool only to make the report/presentation look good.
This is
definitely a waste of time, but unfortunately it is a necessary evil. In fact, a
lot of times, analysts
will wish to take a report directly out of the reporting tool and place it in their
presentations or
reports to their bosses. If the reporting tool offers them an easy way to pre-set
the reports to look
exactly the way that adheres to the corporate standard, it makes the analysts jobs
much easier,
and the time savings are tremendous.

. Export capabilities

The most common export needs are to Excel, to a flat file, and to PDF, and a good
report tool
must be able to export to all three formats. For Excel, if the situation warrants
it, you will want to
verify that the reporting format, not just the data itself, will be exported out to
Excel. This can
often be a time-saver.

. Integration with the Microsoft Office environment


Most people are used to work with Microsoft Office products, especially Excel, for
manipulating
data. Before, people used to export the reports into Excel, and then perform
additional formatting
/ calculation tasks. Some reporting tools now offer a Microsoft Office-like editing
environment
for users, so all formatting can be done within the reporting tool itself, with no
need to export the
report into Excel. This is a nice convenience to the users.

Popular Tools
. SAP Crystal Reports
. MicroStrategy
. IBM Cognos
. Actuate
. Jaspersoft
. Pentaho
. Metadata Tool Selection
. Buy vs. Build
. Only in the rarest of cases does it make sense to build a metadata tool from
scratch. This is
because doing so requires resources that are intimately familiar with the
operational, technical,
and business aspects of the data warehouse system, and such resources are difficult
to come by.
Even when such resources are available, there are often other tasks that can
provide more value
to the organization than to build a metadata tool from scratch.
. In fact, the question is often whether any type of metadata tool is needed at
all. Although
metadata plays an extremely important role in a successful data warehousing
implementation,
this does not always mean that a tool is needed to keep all the "data about data."
It is possible to,
say, keey such information in the repository of other tools used, in a text
documentation, or even
in a presentation or a spreadsheet.
. Having said the above, though, it is author's believe that having a solid
metadata foundation is
one of the keys to the success of a data warehousing project. Therefore, even if a
metadata tool
is not selected at the beginning of the project, it is essential to have a metadata
strategy; that is,
how metadata in the data warehousing system will be stored.
. Metadata Tool Functionalities
. This is the most difficult tool to choose, because there is clearly no standard.
In fact, it might be
better to call this a selection of the metadata strategy. Traditionally, people
have put the data
modeling information into a tool such as ERWin and Oracle Designer, but it is
difficult to extract
information out of such data modeling tools. For example, one of the goals for your
metadata
selection is to provide information to the end users. Clearly this is a difficult
task with a data
modeling tool.
. So typically what is likely to happen is that additional efforts are spent to
create a layer of
metadata that is aimed at the end users. While this allows the end users to gain
the required
insight into what the data and reports they are looking at means, it is clearly
inefficient because all
that information already resides somewhere in the data warehouse system, whether it
be the ETL
tool, the data modeling tool, the OLAP tool, or the reporting tool.
. There are efforts among data warehousing tool vendors to unify on a metadata
model. In June of
2000, the OMG released a metadata standard called CWM (Common Warehouse Metamodel),
and some of the vendors such as Oracle have claimed to have implemented it. This
standard
incorporates the latest technology such as XML, UML, and SOAP, and, if accepted
widely, is truly
the best thing that can happen to the data warehousing industry. As of right now,
though, the
author has not really seen that many tools leveraging this standard, so clearly it
has not quite
caught on yet.
. So what does this mean about your metadata efforts? In the absence of everything
else, I would
recommend that whatever tool you choose for your metadata support supports XML, and
that
whatever other tool that needs to leverage the metadata also supports XML. Then it
is a matter of
defining your DTD across your data warehousing system. At the same time, there is
no need to
worry about criteria that typically is important for the other tools such as
performance and support
for parallelism because the size of the metadata is typically small relative to the
size of the data
warehouse.

Open Source Business Intelligence

What is open source business intelligence?

Open source BI are BI software can be distributed for free and permits users to
modify
the source code. Open source software is available in all BI tools, from data
modeling to
reporting to OLAP to ETL.

Because open source software is community driven, it relies on the community for
improvement. As such, new feature sets typically come from community contribution
rather than as a result of dedicated R&D efforts.

Advantages of open source BI tools

Easy to get started

With traditional BI software, the business model typically involves a hefty startup
cost,
and then there is an annual fee for support and maintenance that is calculated as a

percentage of the initial purchase price. In this model, a company needs to spend a

substantial amount of money before any benefit is realized. With the substantial
cost
also comes the need to go through a sales cycle, from the RFP process to evaluation
to
negotiation, and multiple teams within the organization typically get involved.
These
factors mean that it's not only costly to get started with traditional BI software,
but the
amount of time it takes is also long.

With open source BI, the beginning of the project typically involves a free
download of
the software. Given this, bureaucracy can be kept to a minimum and it is very easy
and
inexpensive to get started.
Lower cost
Because of its low startup cost and the typically lower ongoing maintenance/support

cost, the cost for open source BI software is lower (sometimes much lower) than
traditional BI software.

Easy to customize

By definition, open source software means that users can access and modify the
source
code directly. That means it is possible for developers to get under the hood of
the open
source BI tool and add their own features. In contrast, it is much more difficult
to do this
with traditional BI software because there is no way to access the source code.

Disadvantages of open source BI tools

Features are not as robust

Traditional BI software vendors put in a lot of money and resources into R&D, and
the
result is that the product has a rich feature set. Open source BI tools, on the
other hand,
rely on community support, and hence do not have as strong a feature set.

Consulting help not as readily available

Most of the traditional BI software - MicroStrategy, Business Objects, Cognos,


Oracle
and so on, have been around for a long time. As a result, there are a lot of people
with
experience with those tools, and finding consulting help to implement these
solutions is
usually not very difficult. Open source BI tools, on the other hand, are a fairly
recent
development, and there are relatively few people with implementation experience.
So, it
is more difficult to find consulting help if you go with open source BI.

Open source BI tool vendors

. JasperSoft
. Eclipse BIRT Project
. Pentaho
. SpagoBI
. OpenI

Data Warehouse Team Personnel Selection


There are two areas of discussion: First is whether to use external consultants or
hire
permanent employees. The second is on what type of personnel is recommended for a
data warehousing project.

The pros of hiring external consultants are:

1. They are usually more experienced in data warehousing implementations. The fact
of
the matter is, even today, people with extensive data warehousing backgrounds are
difficult to find. With that, when there is a need to ramp up a team quickly, the
easiest
route to go is to hire external consultants.

The pros of hiring permanent employees are:

1. They are less expensive. With hourly rates for experienced data warehousing
professionals running from $100/hr and up, and even more for Big-5 or vendor
consultants, hiring permanent employees is a much more economical option.

2. They are less likely to leave. With consultants, whether they are on contract,
via a
Big-5 firm, or one of the tool vendor firms, they are likely to leave at a moment's
notice.
This makes knowledge transfer very important. Of course, the flip side is that
these
consultants are much easier to get rid of, too.

The following roles are typical for a data warehouse project:

. Project Manager: This person will oversee the progress and be responsible for the
success of the data
warehousing project.

. DBA: This role is responsible to keep the database running smoothly. Additional
tasks for this role may
be to plan and execute a backup/recovery plan, as well asperformance tuning.

. Technical Architect: This role is responsible for developing and implementing the
overall technical
architecture of the data warehouse, from the backend hardware/software to the
client desktop
configurations.

. ETL Developer: This role is responsible for planning, developing, and deploying
the extraction,
transformation, and loading routine for the data warehouse.

. Front End Developer: This person is responsible for developing the front-end,
whether it be client-
server or over the web.

. OLAP Developer: This role is responsible for the development of OLAP cubes.

. Trainer: A significant role is the trainer. After the data warehouse is


implemented, a person on the
data warehouse team needs to work with the end users to get them familiar with how
the front end is
set up so that the end users can get the most benefit out of the data warehouse
system.
. Data Modeler: This role is responsible for taking the data structure that exists
in the enterprise and
model it into a schema that is suitable for OLAP analysis.

. QA Group: This role is responsible for ensuring the correctness of the data in
the data warehouse.
This role is more important than it appears, because bad data quality turns away
users more than any
other reason, and often is the start of the downfall for the data warehousing
project.

The above list is roles, and one person does not necessarily correspond to only one
role. In fact,
it is very common in a data warehousing team where a person takes on multiple
roles. For a
typical project, it is common to see teams of 5-8 people. Any data warehousing team
that
contains more than 10 people is definitely bloated.

Data Warehouse Design

After the tools and team personnel selections are made, the data warehouse design
can begin. The
following are the typical steps involved in the datawarehousing project cycle.

. Requirement Gathering
. Physical Environment Setup
. Data Modeling
. ETL
. OLAP Cube Design
. Front End Development
. Report Development
. Performance Tuning
. Query Optimization
. Quality Assurance
. Rolling out to Production
. Production Maintenance
. Incremental Enhancements

Each page listed above represents a typical data warehouse design phase, and has
several sections:

. Task Description: This section describes what typically needs to be accomplished


during this
particular data warehouse design phase.
. Time Requirement: A rough estimate of the amount of time this particular data
warehouse task
takes.
. Deliverables: Typically at the end of each data warehouse task, one or more
documents are
produced that fully describe the steps and results of that particular task. This is
especially
important for consultants to communicate their results to the clients.
. Possible Pitfalls: Things to watch out for. Some of them obvious, some of them
not so obvious.
All of them are real.
The Additional Observations section contains my own observations on data warehouse
processes not
included in any of the design steps.
Requirement Gathering

Task Description

The first thing that the project team should engage in is gathering requirements
from
end users. Because end users are typically not familiar with the data warehousing
process or concept, the help of the business sponsor is essential. Requirement
gathering can happen as one-to-one meetings or as Joint Application Development
(JAD) sessions, where multiple people are talking about the project scope in the
same
meeting.

The primary goal of this phase is to identify what constitutes as a success for
this
particular phase of the data warehouse project. In particular, end user reporting /

analysis requirements are identified, and the project team will spend the remaining

period of time trying to satisfy these requirements.

Associated with the identification of user requirements is a more concrete


definition of
other details such as hardware sizing information, training requirements, data
source
identification, and most importantly, a concrete project plan indicating the
finishing date
of the data warehousing project.

Based on the information gathered above, a disaster recovery plan needs to be


developed so that the data warehousing system can recover from accidents that
disable
the system. Without an effective backup and restore strategy, the system will only
last
until the first major disaster, and, as many data warehousing DBA's will attest,
this can
happen very quickly after the project goes live.

Time Requirement

2 - 8 weeks.

Deliverables

. A list of reports / cubes to be delivered to the end users by the end of this
current
phase.
. A updated project plan that clearly identifies resource loads and milestone
delivery dates.

Possible Pitfalls

This phase often turns out to be the most tricky phase of the data warehousing
implementation. The reason is that because data warehousing by definition includes
data from multiple sources spanning many different departments within the
enterprise,
there are often political battles that center on the willingness of information
sharing.
Even though a successful data warehouse benefits the enterprise, there are
occasions
where departments may not feel the same way. As a result of unwillingness of
certain
groups to release data or to participate in the data warehousing requirement
definition,
the data warehouse effort either never gets off the ground, or could not start in
the
direction originally defined.

When this happens, it would be ideal to have a strong business sponsor. If the
sponsor
is at the CXO level, she can often exert enough influence to make sure everyone
cooperates.

Physical Environment Setup

Task Description

Once the requirements are somewhat clear, it is necessary to set up the physical
servers and databases. At a minimum, it is necessary to set up a development
environment and a production environment. There are also many data warehousing
projects where there are three environments: Development, Testing, and Production.

It is not enough to simply have different physical environments set up. The
different
processes (such as ETL, OLAP Cube, and reporting) also need to be set up properly
for
each environment.

It is best for the different environments to use distinct application and database
servers.
In other words, the development environment will have its own application server
and
database servers, and the production environment will have its own set of
application
and database servers.

Having different environments is very important for the following reasons:

. All changes can be tested and QA'd first without affecting the production
environment.
. Development and QA can occur during the time users are accessing the data
warehouse.
. When there is any question about the data, having separate environment(s) will
allow the data warehousing team to examine the data without impacting the
production environment.

Time Requirement

Getting the servers and databases ready should take less than 1 week.

Deliverables
. Hardware / Software setup document for all of the environments, including
hardware specifications, and scripts / settings for the software.

Possible Pitfalls

To save on capital, often data warehousing teams will decide to use only a single
database and a single server for the different environments. Environment separation
is
achieved by either a directory structure or setting up distinct instances of the
database.
This is problematic for the following reasons:

1. Sometimes it is possible that the server needs to be rebooted for the


development
environment. Having a separate development environment will prevent the production
environment from being impacted by this.

2. There may be interference when having different database environments on a


single
box. For example, having multiple long queries running on the development database
could affect the performance on the production database.

Data Modeling

Task Description

This is a very important step in the data warehousing project. Indeed, it is fair
to say that
the foundation of the data warehousing system is the data model. A good data model
will allow the data warehousing system to grow easily, as well as allowing for good

performance.

In data warehousing project, the logical data model is built based on user
requirements,
and then it is translated into the physical data model. The detailed steps can be
found in
the Conceptual, Logical, and Physical Data Modeling section.

Part of the data modeling exercise is often the identification of data sources.
Sometimes
this step is deferred until the ETL step. However, my feeling is that it is better
to find out
where the data exists, or, better yet, whether they even exist anywhere in the
enterprise
at all. Should the data not be available, this is a good time to raise the alarm.
If this was
delayed until the ETL phase, rectifying it will becoming a much tougher and more
complex process.

Time Requirement

2 - 6 weeks.
Deliverables

. Identification of data sources.


. Logical data model.
. Physical data model.

Possible Pitfalls

It is essential to have a subject-matter expert as part of the data modeling team.


This
person can be an outside consultant or can be someone in-house who has extensive
experience in the industry. Without this person, it becomes difficult to get a
definitive
answer on many of the questions, and the entire project gets dragged out.

ETL

Task Description

The ETL (Extraction, Transformation, Loading) process typically takes the longest
to
develop, and this can easily take up to 50% of the data warehouse implementation
cycle or longer. The reason for this is that it takes time to get the source data,
understand the necessary columns, understand the business rules, and understand the

logical and physical data models.

Time Requirement

1 - 6 weeks.

Deliverables

. Data Mapping Document


. ETL Script / ETL Package in the ETL tool

Possible Pitfalls

There is a tendency to give this particular phase too little development time. This
can
prove suicidal to the project because end users will usually tolerate less
formatting,
longer time to run reports, less functionality (slicing and dicing), or fewer
delivered
reports; one thing that they will not tolerate is wrong information.

A second common problem is that some people make the ETL process more
complicated than necessary. In ETL design, the primary goal should be to optimize
load
speed without sacrificing on quality. This is, however, sometimes not followed.
There
are cases where the design goal is to cover all possible future uses, whether they
are
practical or just a figment of someone's imagination. When this happens, ETL
performance suffers, and often so does the performance of the entire data
warehousing
system.

OLAP Cube Design

Task Description

Usually the design of the olap cube can be derived from the Requirement
Gathering phase. More often than not, however, users have some idea on what they
want, but it is difficult for them to specify the exact report / analysis they want
to see.
When this is the case, it is usually a good idea to include enough information so
that
they feel like they have gained something through the data warehouse, but not so
much
that it stretches the data warehouse scope by a mile. Remember that data
warehousing
is an iterative process - no one can ever meet all the requirements all at once.

Time Requirement

1 - 2 weeks.

Deliverables

. Documentation specifying the OLAP cube dimensions and measures.


. Actual OLAP cube / report.

Possible Pitfalls

Make sure your olap cube-bilding process is optimized. It is common for the data
warehouse to be on the bottom of the nightly batch load, and after the loading of
the
data warehouse, there usually isn't much time remaining for the olap cube to be
refreshed. As a result, it is worthwhile to experiment with the olap cube
generation paths
to ensure optimal performance.

Front End Development

Task Description

Regardless of the strength of the OLAP engine and the integrity of the data, if the
users
cannot visualize the reports, the data warehouse brings zero value to them. Hence
front
end development is an important part of a data warehousing initiative.

So what are the things to look out for in selecting a front-end deployment
methodology?
The most important thing is that the reports should need to be delivered over the
web,
so the only thing that the user needs is the standard browser. These days it is no
longer
desirable nor feasible to have the IT department doing program installations on end

users desktops just so that they can view reports. So, whatever strategy one
pursues,
make sure the ability to deliver over the web is a must.

The front-end options ranges from an internal front-end development using scripting

languages such as ASP, PHP, or Perl, to off-the-shelf products such as Seagate


Crystal
Reports, to the more higher-level products such as Actuate. In addition, many OLAP
vendors offer a front-end on their own. When choosing vendor tools, make sure it
can
be easily customized to suit the enterprise, especially the possible changes to the

reporting requirements of the enterprise. Possible changes include not just the
difference in report layout and report content, but also include possible changes
in the
back-end structure. For example, if the enterprise decides to change from
Solaris/Oracle to Microsoft 2000/SQL Server, will the front-end tool be flexible
enough
to adjust to the changes without much modification?

Another area to be concerned with is the complexity of the reporting tool. For
example,
do the reports need to be published on a regular interval? Are there very specific
formatting requirements? Is there a need for a GUI interface so that each user can
customize her reports?

Time Requirement

1 - 4 weeks.

Deliverables

Front End Deployment Documentation

Possible Pitfalls

Just remember that the end users do not care how complex or how technologically
advanced your front end infrastructure is. All they care is that they receives
their
information in a timely manner and in the way they specified.

Report Development
Task Description

Report specification typically comes directly from the requirements phase. To the
end
user, the only direct touchpoint he or she has with the data warehousing system is
the
reports they see. So, report development, although not as time consuming as some of

the other steps such as ETL and data modeling, nevertheless plays a very important
role in determining the success of the data warehousing project.

One would think that report development is an easy task. How hard can it be to just

follow instructions to build the report? Unfortunately, this is not true. There are
several
points the data warehousing team need to pay attention to before releasing the
report.

User customization: Do users need to be able to select their own metrics? And how
do
users need to be able to filter the information? The report development process
needs
to take those factors into consideration so that users can get the information they
need
in the shortest amount of time possible.

Report delivery: What report delivery methods are needed? In addition to delivering

the report to the web front end, other possibilities include delivery via email,
via text
messaging, or in some form of spreadsheet. There are reporting solutions in the
marketplace that support report delivery as a flash file. Such flash file
essentially acts as
a mini-cube, and would allow end users to slice and dice the data on the report
without
having to pull data from an external source.

Access privileges: Special attention needs to be paid to who has what access to
what
information. A sales report can show 8 metrics covering the entire company to the
company CEO, while the same report may only show 5 of the metrics covering only a
single district to a District Sales Director.

Report development does not happen only during the implementation phase. After the
system goes into production, there will certainly be requests for additional
reports.
These types of requests generally fall into two broad categories:

1. Data is already available in the data warehouse. In this case, it should be


fairly
straightforward to develop the new report into the front end. There is no need to
wait for
a major production push before making new reports available.

2. Data is not yet available in the data warehouse. This means that the request
needs to
be prioritized and put into a future data warehousing development cycle.
Time Requirement

1 - 2 weeks.
Deliverables

. Report Specification Documentation.


. Reports set up in the front end / reports delivered to user's preferred channel.

Possible Pitfalls

Make sure the exact definitions of the report are communicated to the users.
Otherwise,
user interpretation of the report can be errenous.

Performance Tuning

Task Description

There are three major areas where a data warehousing system can use a little
performance tuning:

. ETL - Given that the data load is usually a very time-consuming process (and
hence
they are typically relegated to a nightly load job) and that data warehousing-
related
batch jobs are typically of lower priority, that means that the window for data
loading is
not very long. A data warehousing system that has its ETL process finishing right
on-
time is going to have a lot of problems simply because often the jobs do not get
started
on-time due to factors that is beyond the control of the data warehousing team. As
a
result, it is always an excellent idea for the data warehousing group to tune the
ETL
process as much as possible.
. Query Processing - Sometimes, especially in a ROLAP environment or in a system
where the reports are run directly against the relationship database, query
performance
can be an issue. A study has shown that users typically lose interest after 30
seconds of
waiting for a report to return. My experience has been that ROLAP reports or
reports that
run directly against the RDBMS often exceed this time limit, and it is hence ideal
for the
data warehousing team to invest some time to tune the query, especially the most
popularly ones. We present a number of query optimization ideas.
. Report Delivery - It is also possible that end users are experiencing significant
delays in
receiving their reports due to factors other than the query performance. For
example,
network traffic, server setup, and even the way that the front-end was built
sometimes
play significant roles. It is important for the data warehouse team to look into
these areas
for performance tuning.

Time Requirement
3 - 5 days.

Deliverables

. Performance tuning document - Goal and Result

Possible Pitfalls

Make sure the development environment mimics the production environment as much
as possible - Performance enhancements seen on less powerful machines sometimes
do not materialize on the larger, production-level machines.

Query Optimization

For any production database, SQL query performance becomes an issue sooner or
later. Having long-running queries not only consumes system resources that makes
the
server and application run slowly, but also may lead to table locking and data
corruption
issues. So, query optimization becomes an important task.

First, we offer some guiding principles for query optimization:

1. Understand how your database is executing your query

Nowadays all databases have their own query optimizer, and offers a way for users
to
understand how a query is executed. For example, which index from which table is
being used to execute the query? The first step to query optimization is
understanding
what the database is doing. Different databases have different commands for this.
For
example, in MySQL, one can use "EXPLAIN [SQL Query]" keyword to see the query
plan. In Oracle, one can use "EXPLAIN PLAN FOR [SQL Query]" to see the query plan.

2. Retrieve as little data as possible

The more data returned from the query, the more resources the database needs to
expand to process and store these data. So for example, if you only need to
retrieve
one column from a table, do not use 'SELECT *'.

3. Store intermediate results

Sometimes logic for a query can be quite complex. Often, it is possible to achieve
the
desired result through the use of subqueries, inline views, and UNION-type
statements.
For those cases, the intermediate results are not stored in the database, but are
immediately used within the query. This can lead to performance issues, especially
when the intermediate results have a large number of rows.

The way to increase query performance in those cases is to store the intermediate
results in a temporary table, and break up the initial SQL statement into several
SQL
statements. In many cases, you can even build an index on the temporary table to
speed up the query performance even more. Granted, this adds a little complexity in

query management (i.e., the need to manage temporary tables), but the speedup in
query performance is often worth the trouble.

Below are several specific query optimization strategies.

. Use Index
Using an index is the first strategy one should use to speed up a query. In fact,
this strategy is so important that index optimization is also discussed.
. Aggregate Table
Pre-populating tables at higher levels so less amount of data need to be parsed.
. Vertical Partitioning
Partition the table by columns. This strategy decreases the amount of data a
SQL query needs to process.
. Horizontal Partitioning
Partition the table by data value, most often time. This strategy decreases the
amount of data a SQL query needs to process.
. Denormalization
The process of denormalization combines multiple tables into a single table. This
speeds up query performance because fewer table joins are needed.
. Server Tuning
Each server has its own parameters, and often tuning server parameters so that
it can fully take advantage of the hardware resources can significantly speed up
query performance.

Quality Assurance

Task Description

Once the development team declares that everything is ready for further testing,
the QA
team takes over. The QA team is always from the client. Usually the QA team members

will know little about data warehousing, and some of them may even resent the need
to
have to learn another tool or tools. This makes the QA process a tricky one.

Sometimes the QA process is overlooked. On my very first data warehousing project,


the project team worked very hard to get everything ready for Phase 1, and everyone
thought that we had met the deadline. There was one mistake, though, the project
managers failed to recognize that it is necessary to go through the client QA
process
before the project can go into production. As a result, it took five extra months
to bring
the project to production (the original development time had been only 2 1/2
months).

Time Requirement

1 - 4 weeks.

Deliverables

. QA Test Plan
. QA verification that the data warehousing system is ready to go to production

Possible Pitfalls

As mentioned above, usually the QA team members know little about data
warehousing, and some of them may even resent the need to have to learn another
tool
or tools. Make sure the QA team members get enough education so that they can
complete the testing themselves.

Production Maintenance

Task Description

Once the data warehouse goes production, it needs to be maintained. Tasks as such
regular backup and crisis management becomes important and should be planned out.
In addition, it is very important to consistently monitor end user usage. This
serves two
purposes: 1. To capture any runaway requests so that they can be fixed before
slowing
the entire system down, and 2. To understand how much users are utilizing the data
warehouse for return-on-investment calculations and future enhancement
considerations.

Time Requirement

Ongoing.

Deliverables

Consistent availability of the data warehousing system to the end users.


Possible Pitfalls

Usually by this time most, if not all, of the developers will have left the
project, so it is
essential that proper documentation is left for those who are handling production
maintenance. There is nothing more frustrating than staring at something another
person did, yet unable to figure it out due to the lack of proper documentation.

Another pitfall is that the maintenance phase is usually boring. So, if there is
another
phase of the data warehouse planned, start on that as soon as possible.

Incremental Enhancements
Task Description

Once the data warehousing system goes live, there are often needs for incremental
enhancements. I am not talking about a new data warehousing phases, but simply
small
changes that follow the business itself. For example, the original geographical
designations may be different, the company may originally have 4 sales regions, but

now because sales are going so well, now they have 10 sales regions.

Deliverables

. Change management documentation


. Actual change to the data warehousing system

Possible Pitfalls

Because a lot of times the changes are simple to make, it is very tempting to just
go
ahead and make the change in production. This is a definite no-no. Many unexpected
problems will pop up if this is done. I would very strongly recommend that the
typical
cycle of development --> QA --> Production be followed, regardless of how simple
the
change may seem.

Observations
This section lists the trends I have seen based on my experience in the data
warehousing field:

Quick implementation time

Lack of collaboration with data mining efforts

Industry consolidation

How to measure success

Recipes for data warehousing project failure

Quick Implementation Time

If you add up the total time required to complete the tasks from Requirement
Gathering to Rollout to Production, you'll find it takes about 9 - 29 weeks to
complete
each phase of the data warehousing efforts. The 9 weeks may sound too quick, but I
have been personally involved in a turnkey data warehousing implementation that
took
40 business days, so that is entirely possible. Furthermore, some of the tasks may
proceed in parallel, so as a rule of thumb it is reasonable to say that it
generally takes 2
- 6 months for each phase of the data warehousing implementation.

Why is this important? The main reason is that in today's business world, the
business
environment changes quickly, which means that what is important now may not be
important 6 months from now. For example, even the traditionally static financial
industry is coming up with new products and new ways to generate revenue in a rapid

pace. Therefore, a time-consuming data warehousing effort will very likely become
obsolete by the time it is in production. It is best to finish a project quickly.
The focus on
quick delivery time does mean, however, that the scope for each phase of the data
warehousing project will necessarily be limited. In this case, the 80-20 rule
applies, and
our goal is to do the 20% of the work that will satisfy 80% of the user needs. The
rest
can come later.

Lack Of Collaboration With Data Mining Efforts

Usually data mining is viewed as the final manifestation of the data warehouse. The

ideal is that now information from all over the enterprise is conformed and stored
in a
central location, data mining techniques can be applied to find relationships that
are
otherwise not possible to find. Unfortunately, this has not quite happened due to
the
following reasons:

1. Few enterprises have an enterprise data warehouse infrastructure. In fact,


currently
they are more likely to have isolated data marts. At the data mart level, it is
difficult to
come up with relationships that cannot be answered by a good OLAP tool.

2. The ROI for data mining companies is inherently lower because by definition,
data
mining will only be performed by a few users (generally no more than 5) in the
entire
enterprise. As a result, it is hard to charge a lot of money due to the low number
of
users. In addition, developing data mining algorithms is an inherently complex
process
and requires a lot of up front investment. Finally, it is difficult for the vendor
to put a
value proposition in front of the client because quantifying the returns on a data
mining
project is next to impossible.

This is not to say, however, that data mining is not being utilized by enterprises.
In fact,
many enterprises have made excellent discoveries using data mining techniques. What

I am saying, though, is that data mining is typically not associated with a data
warehousing initiative. It seems like successful data mining projects are usually
stand-
alone projects.

Industry Consolidation

In the last several years, we have seen rapid industry consolidation, as the weaker
competitors
are gobbled up by stronger players. The most significant transactions are below
(note that the
dollar amount quoted is the value of the deal when initially announced):

. IBM purchased Cognos for $5 billion in 2007.


. SAP purchased Business Objects for $6.8 billion in 2007.
. Oracle purchased Hyperion for $3.3 billion in 2007.
. Business Objects (OLAP/ETL) purchased FirstLogic (data cleansing) for $69 million
in
2006.
. Informatica (ETL) purchased Similarity Systems (data cleansing) for $55 million
in 2006.
. IBM (database) purchased Ascential Software (ETL) for $1.1 billion in cash in
2005.
. Business Objects (OLAP) purchased Crystal Decisions (Reporting) for $820 million
in
2003.
. Hyperion (OLAP) purchased Brio (OLAP) for $142 million in 2003.
. GEAC (ERP) purchased Comshare (OLAP) for $52 million in 2003.

For the majority of the deals, the purchase represents an effort by the buyer to
expand into other
areas of data warehousing (Hyperion's purchase of Brio also falls into this
category because,
even though both are OLAP vendors, their product lines do not overlap). This
clearly shows
vendors' strong push to be the one-stop shop, from reporting, OLAP, to ETL.

There are two levels of one-stop shop. The first level is at the corporate level.
In this case, the
vendor is essentially still selling two entirely separate products. But instead of
dealing with two
sets of sales and technology support groups, the customers only interact with one
such group.
The second level is at the product level. In this case, different products are
integrated. In data
warehousing, this essentially means that they share the same metadata layer. This
is actually a
rather difficult task, and therefore not commonly accomplished. When there is
metadata
integration, the customers not only get the benefit of only having to deal with one
vendor instead
of two (or more), but the customer will be using a single product, rather than
multiple products.
This is where the real value of industry consolidation is shown.

How To Measure Success

Given the significant amount of resources usually invested in a data warehousing


project, a very important question is how success can be measured. This is a
question
that many project managers do not think about, and for good reason: Many project
managers are brought in to build the data warehousing system, and then turn it over
to
in-house staff for ongoing maintenance. The job of the project manager is to build
the
system, not to justify its existence.

Just because this is often not done does not mean this is not important. Just like
a data
warehousing system aims to measure the pulse of the company, the success of the
data warehousing system itself needs to be measured. Without some type of measure
on the return on investment (ROI), how does the company know whether it made the
right choice? Whether it should continue with the data warehousing investment?

There are a number of papers out there that provide formula on how to calculate the

return on a data warehousing investment. Some of the calculations become quite


cumbersome, with a number of assumptions and even more variables. Although they
are all valid methods, I believe the success of the data warehousing system can
simply
be measured by looking at one criteria:

How often the system is being used.

If the system is satisfying user needs, users will naturally use the system. If
not, users
will abandon the system, and a data warehousing system with no users is actually a
detriment to the company (since resources that can be deployed elsewhere are
required
to maintain the system). Therefore, it is very important to have a tracking
mechanism to
figure out how much are the users accessing the data warehouse. This should not be
a
problem if third-party reporting/OLAP tools are used, since they all contain this
component. If the reporting tool is built from scratch, this feature needs to be
included in
the tool. Once the system goes into production, the data warehousing team needs to
periodically check to make sure users are using the system. If usage starts to dip,
find
out why and address the reason as soon as possible. Is the data quality lacking?
Are
the reports not satisfying current needs? Is the response time slow? Whatever the
reason, take steps to address it as soon as possible, so that the data warehousing
system is serving its purpose successfully.

Business Intelligence

Business intelligence is a term commonly associated with data warehousing. In fact,

many of the tool vendors position their products as business intelligence


software rather than data warehousing software. There are other occasions where the

two terms are used interchangeably. So, exactly what is business inteligence?

Business intelligence usually refers to the information that is available for the
enterprise to make decisions on. A data warehousing (or data mart) system is the
backend, or the infrastructural, component for achieving business intellignce.
Business
intelligence also includes the insight gained from doing data mining analysis, as
well as
unstrctured data (thus the need fo content management systems). For our purposes
here, we will discuss business intelligence in the context of using a data
warehouse
infrastructure.

This section includes the following:

Business intelligence tools: Tools commonly used for business intelligence.

Business intelligence uses: Different forms of business intelligence.

Business intelligence news: News in the business intelligence area.

Tools

The most common tools used for business intelligence are as follows. They are
listed in
the following order: Increasing cost, increasing functionality, increasing business

intelligence complexity, and decreasing number of total users.

Excel
Take a guess what's the most common business intelligence tool? You might be
surprised to find out it's Microsoft Excel. There are several reasons for this:

1. It's relatively cheap.

2. It's commonly used. You can easily send an Excel sheet to another person without

worrying whether the recipient knows how to read the numbers.

3. It has most of the functionalities users need to display data.

In fact, it is still so popular that all third-party reporting / OLAP tools have an
"export to
Excel" functionality. Even for home-built solutions, the ability to export numbers
to Excel
usually needs to be built.

Excel is best used for business operations reporting and goals tracking.

Reporting tool

In this discussion, I am including both custom-built reporting tools and the


commercial
reporting tools together. They provide some flexibility in terms of the ability for
each user
to create, schedule, and run their own reports. The Reporting Tool Selection
selection
discusses how one should select an OLAP tool.

Business operations reporting and dashboard are the most common applications for a
reporting tool.

OLAP tool

OLAP tools are usually used by advanced users. They make it easy for users to look
at
the data from multiple dimensions. The OLAP Tool Selection selection discusses how
one should select an OLAP tool.

OLAP tools are used for multidimensional analysis.

Data mining tool

Data mining tools are usually only by very specialized users, and in an
organization,
even large ones, there are usually only a handful of users using data mining tools.

Data mining tools are used for finding correlation among different factors.
Concepts

Several concepts are of particular importance to data warehousing. They are


discussed
in detail in this section.

Dimensional Data Model: Dimensional data model is commonly used in data


warehousing systems. This section describes this modeling technique, and the two
common schema types, star schema and snowflake schema.

Slowly Changing Dimension: This is a common issue facing data warehousing


practioners. This section explains the problem, and describes the three ways of
handling this problem with examples.

Conceptual Data Model: What is a conceptual data model, its features, and an
example of this type of data model.

Logical Data Model: What is a logical data model, its features, and an example of
this
type of data model.

Physical Data Model: What is a physical data model, its features, and an example of

this type of data model.

Conceptual, Logical, and Physical Data Model: Different levels of abstraction for a

data model. This section compares and constrasts the three different types of data
models.

Data Integrity: What is data integrity and how it is enforced in data warehousing.

What is OLAP: Definition of OLAP.

MOLAP, ROLAP, and HOLAP: What are these different types of OLAP technology?
This section discusses how they are different from the other, and the advantages
and
disadvantages of each.

Bill Inmon vs. Ralph Kimball: These two data warehousing heavyweights have a
different view of the role between data warehouse and data mart.

Factless Fact Table: A fact table without any fact may sound silly, but there are
real life
instances when a factless fact table is useful in data warehousing.

Junk Dimension: Discusses the concept of a junk dimension: When to use it and why
is it useful.
Conformed Dimension: Discusses the concept of a conformed dimension: What is it
and why is it important.

Dimensional Data Model

Dimensional data model is most often used in data warehousing systems. This is
different from the 3rd normal form, commonly used for transactional (OLTP) type
systems. As you can imagine, the same data would then be stored differently in a
dimensional model than in a 3rd normal form model.

To understand dimensional data modeling, let's define some of the terms commonly
used in this type of modeling:

Dimension: A category of information. For example, the time dimension.

Attribute: A unique level within a dimension. For example, Month is an attribute in


the
Time Dimension.

Hierarchy: The specification of levels that represents relationship between


different
attributes within a dimension. For example, one possible hierarchy in the Time
dimension is Year . Quarter . Month . Day.

Fact Table: A fact table is a table that contains the measures of interest. For
example,
sales amount would be such a measure. This measure is stored in the fact table with

the appropriate granularity. For example, it can be sales amount by store by day.
In this
case, the fact table would contain three columns: A date column, a store column,
and a
sales amount column.

Lookup Table: The lookup table provides the detailed information about the
attributes.
For example, the lookup table for the Quarter attribute would include a list of all
of the
quarters available in the data warehouse. Each row (each quarter) may have several
fields, one for the unique ID that identifies the quarter, and one or more
additional fields
that specifies how that particular quarter is represented on a report (for example,
first
quarter of 2001 may be represented as "Q1 2001" or "2001 Q1").

A dimensional model includes fact tables and lookup tables. Fact tables connect to
one
or more lookup tables, but fact tables do not have direct relationships to one
another.
Dimensions and hierarchies are represented by lookup tables. Attributes are the
non-
key columns in the lookup tables.
In designing data models for data warehouses / data marts, the most commonly used
schema types are Star Schema and Snowflake Schema.

Whether one uses a star or a snowflake largely depends on personal preference and
business needs. Personally, I am partial to snowflakes, when there is a business
case
to analyze the information at that particular level.

Fact Table Granularity

Granularity

The first step in designing a fact table is to determine the granularity of the
fact table.
By granularity, we mean the lowest level of information that will be stored in the
fact
table. This constitutes two steps:

1. Determine which dimensions will be included.


2. Determine where along the hierarchy of each dimension the information will be
kept.

The determining factors usually goes back to the requirements.

Which Dimensions To Include

Determining which dimensions to include is usually a straightforward process,


because
business processes will often dictate clearly what are the relevant dimensions.

For example, in an off-line retail world, the dimensions for a sales fact table are
usually
time, geography, and product. This list, however, is by no means a complete list
for all
off-line retailers. A supermarket with a Rewards Card program, where customers
provide some personal information in exchange for a rewards card, and the
supermarket would offer lower prices for certain items for customers who present a
rewards card at checkout, will also have the ability to track the customer
dimension.
Whether the data warehousing system includes the customer dimension will then be a
decision that needs to be made.

What Level Within Each Dimensions To Include

Determining which part of hierarchy the information is stored along each dimension
is a
bit more tricky. This is where user requirement (both stated and possibly future)
plays a
major role.

In the above example, will the supermarket wanting to do analysis along at the
hourly
level? (i.e., looking at how certain products may sell by different hours of the
day.) If so,
it makes sense to use 'hour' as the lowest level of granularity in the time
dimension. If
daily analysis is sufficient, then 'day' can be used as the lowest level of
granularity.
Since the lower the level of detail, the larger the data amount in the fact table,
the
granularity exercise is in essence figuring out the sweet spot in the tradeoff
between
detailed level of analysis and data storage.

Note that sometimes the users will not specify certain requirements, but based on
the
industry knowledge, the data warehousing team may foresee that certain requirements

will be forthcoming that may result in the need of additional details. In such
cases, it is
prudent for the data warehousing team to design the fact table such that lower-
level
information is included. This will avoid possibly needing to re-design the fact
table in the
future. On the other hand, trying to anticipate all future requirements is an
impossible
and hence futile exercise, and the data warehousing team needs to fight the urge of
the
"dumping the lowest level of detail into the data warehouse" symptom, and only
includes what is practically needed. Sometimes this can be more of an art than
science,
and prior experience will become invaluable here.

Fact And Fact Table Types

Types of Facts

There are three types of facts:

. Additive: Additive facts are facts that can be summed up through all of the
dimensions
in the fact table.
. Semi-Additive: Semi-additive facts are facts that can be summed up for some of
the
dimensions in the fact table, but not the others.
. Non-Additive: Non-additive facts are facts that cannot be summed up for any of
the
dimensions present in the fact table.

Let us use examples to illustrate each of the three types of facts. The first
example
assumes that we are a retailer, and we have a fact table with the following
columns:

Date

Store

Product
Sales_Amount
The purpose of this table is to record the sales amount for each product in each
store
on a daily basis. Sales_Amount is the fact. In this case, Sales_Amount is an
additive
fact, because you can sum up this fact along any of the three dimensions present in
the
fact table -- date, store, and product. For example, the sum of Sales_Amount for
all 7
days in a week represent the total sales amount for that week.

Say we are a bank with the following fact table:

Date

Account

Current_Balance

Profit_Margin

The purpose of this table is to record the current balance for each account at the
end of
each day, as well as the profit margin for each account for each
day. Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-
additive fact, as it makes sense to add them up for all accounts (what's the total
current
balance for all accounts in the bank?), but it does not make sense to add them up
through time (adding up all current balances for a given account for each day of
the
month does not give us any useful information). Profit_Margin is a non-additive
fact, for
it does not make sense to add them up for the account level or the day level.

Types of Fact Tables

Based on the above classifications, there are two types of fact tables:

. Cumulative: This type of fact table describes what has happened over a period of
time.
For example, this fact table may describe the total sales by product by store by
day. The
facts for this type of fact tables are mostly additive facts. The first example
presented
here is a cumulative fact table.
. Snapshot: This type of fact table describes the state of things in a particular
instance of
time, and usually includes more semi-additive and non-additive facts. The second
example presented here is a snapshot fact table.

Star Schema

In the star schema design, a single object (the fact table) sits in the middle and
is
radially connected to other surrounding objects (dimension lookup tables) like a
star.
Star Schema
Each dimension is represented as a single table. The primary key in each dimension
table is related to a forieng key in the fact table.

Sample star schema

All measures in the fact table are related to all the dimensions that fact table is
related
to. In other words, they all have the same level of granularity.

A star schema can be simple or complex. A simple star consists of one fact table; a

complex star can have more than one fact table.

Let's look at an example: Assume our data warehouse keeps store sales data, and the

different dimensions are time, store, product, and customer. In this case, the
figure on
the left repesents our star schema. The lines between two tables indicate that
there is a
primary key / foreign key relationship between the two tables. Note that different
dimensions are not related to one another.

Slowly Changing Dimensions

The "Slowly Changing Dimension" problem is a common one particular to data


warehousing. In a nutshell, this applies to cases where the attribute for a record
varies
over time. We give an example below:

Christina is a customer with ABC Inc. She first lived in Chicago, Illinois. So, the
original
entry in the customer lookup table has the following record:

Customer Key

Name

State
1001

Christina

Illinois

At a later date, she moved to Los Angeles, California on January, 2003. How should
ABC Inc. now modify its customer table to reflect this change? This is the "Slowly
Changing Dimension" problem.

There are in general three ways to solve this type of problem, and they are
categorized
as follows:

Type 1: The new record replaces the original record. No trace of the old record
exists.

Type 2: A new record is added into the customer dimension table. Therefore, the
customer is treated essentially as two people.

Type 3: The original record is modified to reflect the change.

We next take a look at each of the scenarios and how the data model and the data
looks like for each of them. Finally, we compare and contrast among the three
alternatives.

Type 1 Slowly Changing Dimension

In Type 1 Slowly Changing Dimension, the new information simply overwrites the
original information. In other words, no history is kept.

In our example, recall we originally have the following table:

Customer Key

Name

State

1001

Christina

Illinois

After Christina moved from Illinois to California, the new information replaces the
new
record, and we have the following table:

Customer Key
Name

State

1001

Christina

California

Advantages:
- This is the easiest way to handle the Slowly Changing Dimension problem, since
there
is no need to keep track of the old information.

Disadvantages:

- All history is lost. By applying this methodology, it is not possible to trace


back in
history. For example, in this case, the company would not be able to know that
Christina
lived in Illinois before.

Usage:

About 50% of the time.

When to use Type 1:

Type 1 slowly changing dimension should be used when it is not necessary for the
data
warehouse to keep track of historical changes.

Type 2 Slowly Changing Dimension

In Type 2 Slowly Changing Dimension, a new record is added to the table to


represent
the new information. Therefore, both the original and the new record will be
present.
The newe record gets its own primary key.

In our example, recall we originally have the following table:

Customer Key

Name

State

1001

Christina

Illinois

After Christina moved from Illinois to California, we add the new information as a
new
row into the table:

Customer Key

Name

State
1001

Christina

Illinois

1005

Christina

California

Advantages:

- This allows us to accurately keep all historical information.

Disadvantages:
- This will cause the size of the table to grow fast. In cases where the number of
rows
for the table is very high to start with, storage and performance can become a
concern.

- This necessarily complicates the ETL process.

Usage:

About 50% of the time.

When to use Type 2:

Type 2 slowly changing dimension should be used when it is necessary for the data
warehouse to track historical changes.

Type 3 Slowly Changing Dimension

In Type 3 Slowly Changing Dimension, there will be two columns to indicate the
particular attribute of interest, one indicating the original value, and one
indicating the
current value. There will also be a column that indicates when the current value
becomes active.

In our example, recall we originally have the following table:

Customer Key

Name

State

1001

Christina

Illinois

To accommodate Type 3 Slowly Changing Dimension, we will now have the following
columns:

. Customer Key
. Name
. Original State
. Current State
. Effective Date

After Christina moved from Illinois to California, the original information gets
updated,
and we have the following table (assuming the effective date of change is January
15,
2003):
Customer Key

Name

Original State

Current State

Effective Date

1001

Christina

Illinois

California

15-JAN-2003
Advantages:

- This does not increase the size of the table, since new information is updated.

- This allows us to keep some part of history.

Disadvantages:

- Type 3 will not be able to keep all history where an attribute is changed more
than
once. For example, if Christina later moves to Texas on December 15, 2003, the
California information will be lost.

Usage:

Type 3 is rarely used in actual practice.

When to use Type 3:

Type III slowly changing dimension should only be used when it is necessary for the

data warehouse to track historical changes, and when such changes will only occur
for
a finite number of time.

Conceptual, Logical, And Physical Data Models

The three level of data modeling, conceptual data model, logical data model,
and physical data model, were discussed in prior sections. Here we compare these
three types of data models. The table below compares the different features:

Feature

Conceptual

Logical

Physical

Entity Names

Entity Relationships

.
Attributes

Primary Keys

Foreign Keys

Table Names

.
Conceptual Model Design
Logical Model Design
Physical Model Design
Column Names

Column Data Types

Below we show the conceptual, logical, and physical versions of a single data
model.

Conceptual Model Design

Logical Model Design

Physical Model Design

We can see that the complexity increases from conceptual to logical to physical.
This is
why we always first start with the conceptual data model (so we understand at high
level
what are the different entities in our data and how they relate to one another),
then
move on to the logical data model (so we understand the details of our data without

worrying about how they will actually implemented), and finally the physical data
model
(so we know exactly how to implement our data model in the database of choice). In
a
data warehousing project, sometimes the conceptual data model and the logical data
model are considered as a single deliverable.

Data Integrity

Data integrity refers to the validity of data, meaning data is consistent and
correct. In the
data warehousing field, we frequently hear the term, "Garbage In, Garbage Out." If
there is no data integrity in the data warehouse, any resulting report and analysis
will
not be useful.
In a data warehouse or a data mart, there are three areas of where data integrity
needs
to be enforced:

Database level

We can enforce data integrity at the database level. Common ways of enforcing data
integrity include:

Referential integrity

The relationship between the primary key of one table and the foreign key of
another
table must always be maintained. For example, a primary key cannot be deleted if
there
is still a foreign key that refers to this primary key.

Primary key / Unique constraint

Primary keys and the UNIQUE constraint are used to make sure every row in a table
can be uniquely identified.

Not NULL vs NULL-able

For columns identified as NOT NULL, they may not have a NULL value.

Valid Values

Only allowed values are permitted in the database. For example, if a column can
only
have positive integers, a value of '-1' cannot be allowed.

ETL process

For each step of the ETL process, data integrity checks should be put in place to
ensure
that source data is the same as the data in the destination. Most common checks
include record counts or record sums.

Access level

We need to ensure that data is not altered by any unauthorized means either during
the
ETL process or in the data warehouse. To do this, there needs to be safeguards
against
unauthorized access to data (including physical access to the servers), as well as
logging of all data access history. Data integrity can only ensured if there is no
unauthorized access to the data.
What Is OLAP

OLAP stands for On-Line Analytical Processing. The first attempt to provide a
definition
to OLAP was by Dr. Codd, who proposed 12 rules for OLAP. Later, it was discovered
that this particular white paper was sponsored by one of the OLAP tool vendors,
thus
causing it to lose objectivity. The OLAP Report has proposed the FASMI
test, Fast Analysis of SharedMultidimensional Information. For a more detailed
description of both Dr. Codd's rules and the FASMI test, please visit The OLAP
Report.

For people on the business side, the key feature out of the above list is
"Multidimensional." In other words, the ability to analyze metrics in different
dimensions
such as time, geography, gender, product, etc. For example, sales for the company
is
up. What region is most responsible for this increase? Which store in this region
is most
responsible for the increase? What particular product category or categories
contributed
the most to the increase? Answering these types of questions in order means that
you
are performing an OLAP analysis.

Depending on the underlying technology used, OLAP can be braodly divided into two
different camps: MOLAP and ROLAP. A discussion of the different OLAP types can be
found in the MOLAP, ROLAP, and HOLAP section.

MOLAP, ROLAP, And HOLAP

In the OLAP world, there are mainly two different types: Multidimensional OLAP
(MOLAP) and Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies
that combine MOLAP and ROLAP.

MOLAP

This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a
multidimensional cube. The storage is not in the relational database, but in
proprietary
formats.

Advantages:

. Excellent performance: MOLAP cubes are built for fast data retrieval, and is
optimal for slicing and dicing operations.
. Can perform complex calculations: All calculations have been pre-generated
when the cube is created. Hence, complex calculations are not only doable, but
they return quickly.
Disadvantages:

. Limited in the amount of data it can handle: Because all calculations are
performed when the cube is built, it is not possible to include a large amount of
data in the cube itself. This is not to say that the data in the cube cannot be
derived from a large amount of data. Indeed, this is possible. But in this case,
only summary-level information will be included in the cube itself.
. Requires additional investment: Cube technology are often proprietary and do
not already exist in the organization. Therefore, to adopt MOLAP technology,
chances are additional investments in human and capital resources are needed.

ROLAP

This methodology relies on manipulating the data stored in the relational database
to
give the appearance of traditional OLAP's slicing and dicing functionality. In
essence,
each action of slicing and dicing is equivalent to adding a "WHERE" clause in the
SQL
statement.

Advantages:

. Can handle large amounts of data: The data size limitation of ROLAP technology
is the limitation on data size of the underlying relational database. In other
words,
ROLAP itself places no limitation on data amount.
. Can leverage functionalities inherent in the relational database: Often,
relational
database already comes with a host of functionalities. ROLAP technologies,
since they sit on top of the relational database, can therefore leverage these
functionalities.

Disadvantages:

. Performance can be slow: Because each ROLAP report is essentially a SQL


query (or multiple SQL queries) in the relational database, the query time can be
long if the underlying data size is large.
. Limited by SQL functionalities: Because ROLAP technology mainly relies on
generating SQL statements to query the relational database, and SQL
statements do not fit all needs (for example, it is difficult to perform complex
calculations using SQL), ROLAP technologies are therefore traditionally limited
by what SQL can do. ROLAP vendors have mitigated this risk by building into the
tool out-of-the-box complex functions as well as the ability to allow users to
define their own functions.

HOLAP

HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For
summary-type information, HOLAP leverages cube technology for faster performance.
Factless Fact Table Example
When detail information is needed, HOLAP can "drill through" from the cube into the

underlying relational data.

Bill Inmon vs. Ralph Kimball

In the data warehousing field, we often hear about discussions on where a person /
organization's philosophy falls into Bill Inmon's camp or into Ralph Kimball's
camp. We
describe below the difference between the two.

Bill Inmon's paradigm: Data warehouse is one part of the overall business
intelligence
system. An enterprise has one data warehouse, and data marts source their
information
from the data warehouse. In the data warehouse, information is stored in 3rd normal

form.

Ralph Kimball's paradigm: Data warehouse is the conglomerate of all data marts
within the enterprise. Information is always stored in the dimensional model.

There is no right or wrong between these two ideas, as they represent different
data
warehousing philosophies. In reality, the data warehouse in most enterprises are
closer
to Ralph Kimball's idea. This is because most data warehouses started out as a
departmental effort, and hence they originated as a data mart. Only when more data
marts are built later do they evolve into a data warehouse.

Factless Fact Table


A factless fact table is a fact table that does not have any measures. It is
essentially an
intersection of dimensions. On the surface, a factless fact table does not make
sense, since a
fact table is, after all, about facts. However, there are situations where having
this kind of
relationship makes sense in data warehousing.

For example, think about a record of student attendance in classes. In this case,
the fact
table would consist of 3 dimensions: the student dimension, the time dimension, and
the
class dimension. This factless fact table would look like the following:
The only measure that you can possibly attach to each combination is "1" to show
the
presence of that particular combination. However, adding a fact that always shows 1
is
redundant because we can simply use the COUNT function in SQL to answer the same
questions.

Factless fact tables offer the most flexibility in data warehouse design. For
example,
one can easily answer the following questions with this factless fact table:

. How many students attended a particular class on a particular day?


. How many classes on average does a student attend on a given day?

Without using a factless fact table, we will need two separate fact tables to
answer the
above two questions. With the above factless fact table, it becomes the only fact
table
that's needed.

Junk Dimension

In data warehouse design, frequently we run into a situation where there are yes/no

indicator fields in the source system. Through business analysis, we know it is


necessary to keep those information in the fact table. However, if keep all those
indicator fields in the fact table, not only do we need to build many small
dimension
tables, but the amount of information stored in the fact table also increases
tremendously, leading to possible performance and management issues.

Junk dimension is the way to solve this problem. In a junk dimension, we combine
these
indicator fields into a single dimension. This way, we'll only need to build a
single
dimension table, and the number of fields in the fact table, as well as the size of
the fact
table, can be decreased. The content in the junk dimension table is the combination
of
all possible values of the individual indicator fields.

Let's look at an example. Assuming that we have the following fact table:
Fact Table Before Junk Dimension
Fact Table With Junk Dimension

In this example, the last 3 fields are all indicator fields. In this existing
format, each one
of them is a dimension. Using the junk dimension principle, we can combine them
into a
single junk dimension, resulting in the following fact table:

Note that now the number of dimensions in the fact table went from 7 to 5.

The content of the junk dimension table would look like the following:
Junk Dimension Example

In this case, we have 3 possible values for the TXN_CODE field, 2 possible values
for
the COUPON_IND field, and 2 possible values for the PREPAY_IND field. This results
in a total of 3 x 2 x 2 = 12 rows for the junk dimension table.

By using a junk dimension to replace the 3 indicator fields, we have decreased the
number of dimensions by 2 and also decreased the number of fields in the fact table
by
2. This will result in a data warehousing environment that offer better performance
as
well as being easier to manage.

Conformed Dimension

A conformed dimension is a dimension that has exactly the same meaning and content
when being referred from different fact tables. A conformed dimension can refer to
multiple tables in multiple data marts within the same organization. For two
dimension
tables to be considered as conformed, they must either be identical or one must be
a
subset of another. There cannot be any other type of difference between the two
tables.
For example, two dimension tables that are exactly the same except for the primary
key
are not considered conformed dimensions.

Why is conformed dimension important? This goes back to the definition of data
warehouse being "integrated." Integrated means that even if a particular entity had

different meanings and different attributes in the source systems, there must be a
single
version of this entity once the data flows into the data warehouse.

The time dimension is a common conformed dimension in an organization. Usually the


only rules to consider with the time dimension is whether there is a fiscal year in
addition to the calendar year and the definition of a week. Fortunately, both are
relatively easy to resolve. In the case of fiscal vs calendar year, one may go with
either
fiscal or calendar, or an alternative is to have two separate conformed dimensions,
one
for fiscal year and one for calendar year. The definition of a week is also
something that
can be different in large organizations: Finance may use Saturday to Friday, while
marketing may use Sunday to Saturday. In this case, we should decide on a
definition
and move on. The nice thing about the time dimension is once these rules are set,
the
values in the dimension table will never change. For example, October 16th will
never
become the 15th day in October.

Not all conformed dimensions are as easy to produce as the time dimension. An
example is the customer dimension. In any organization with some history, there is
a
high likelihood that different customer databases exist in different parts of the
organization. To achieve a conformed customer dimension means those data must be
compared against each other, rules must be set, and data must be cleansed. In
addition, when we are doing incremental data loads into the data warehouse, we'll
need
to apply the same rules to the new values to make sure we are only adding truly new

customers to the customer dimension.

Building a conformed dimension also part of the process in master data management,
or MDM. In MDM, one must not only make sure the master data dimensions are
conformed, but that conformity needs to be brought back to the source systems.

Glossary

[A-D] | [E-Z]

Aggregation: One way of speeding up query performance. Facts are summed up for
selected
dimensions from the originalfact table. The resulting aggregate table will have
fewer rows, thus making
queries that can use them go faster.

Attribute: Attributes represent a single type of information in a dimension. For


example, year is an
attribute in the Time dimension.

Conformed Dimension: A dimension that has exactly the same meaning and content when
being
referred to from different fact tables.

Data Mart: Data marts have the same definition as the data warehouse (see below),
but data marts have
a more limited audience and/or data content.

Data Warehouse: A warehouse is a subject-oriented, integrated, time-variant and


non-volatile collection
of data in support of management's decision making process (as defined by Bill
Inmon).

Data Warehousing: The process of designing, building, and maintaining a data


warehouse system.

Dimension: The same category of information. For example, year, month, day, and
week are all part of
the Time Dimension.

Dimensional Model: A type of data modeling suited for data warehousing. In a


dimensional model, there
are two types of tables: dimensional tables and fact tables. Dimensional table
records information on
each dimension, and fact table records all the "fact", or measures.

Dimensional Table: Dimension tables store records related to this particular


dimension. No facts are
stored in a dimensional table.

Drill Across: Data analysis across dimensions.

Drill Down: Data analysis to a child attribute.

Drill Through: Data analysis that goes from an OLAP cube into the relational
database.

Drill Up: Data analysis to a parent attribute.

ETL: Stands for Extraction, Transformation, and Loading. The movement of data from
one area to another.

Fact Table: A type of table in the dimensional model. A fact table typically
includes two
types of columns: fact columns and foreign keys to the dimensions.

Hierarchy: A hierarchy defines the navigating path for drilling up and drilling
down. All
attributes in a hierarchy belong to the same dimension.

Metadata: Data about data. For example, the number of tables in the database is a
type
of metadata.

Metric: A measured value. For example, total sales is a metric.

MOLAP: Multidimensional OLAP. MOLAP systems store data in the multidimensional


cubes.

OLAP: On-Line Analytical Processing. OLAP should be designed to provide end users
a quick way of slicing and dicing the data.
ROLAP: Relational OLAP. ROLAP systems store data in the relational database.

Snowflake Schema: A common form of dimensional model. In a snowflake schema,


different hierarchies in a dimension can be extended into their own dimensional
tables.
Therefore, a dimension can have more than a single dimension table.

Star Schema: A common form of dimensional model. In a star schema, each dimension
is represented by a single dimension table.

Master Data Management

What is Master Data Management

Master Data Management (MDM) refers to the process of creating and managing data
that an organization must have as a single master copy, called the master data.
Usually,
master data can include customers, vendors, employees, and products, but can differ

by different industries and even different companies within the same industry. MDM
is
important because it offers the enterprise a single version of the truth. Without a
clearly
defined master data, the enterprise runs the risk of having multiple copies of data
that
are inconsistent with one another.

MDM is typically more important in larger organizations. In fact, the bigger the
organization, the more important the discipline of MDM is, because a bigger
organization means that there are more disparate systems within the company, and
the
difficulty on providing a single source of truth, as well as the benefit of having
master
data, grows with each additional data source. A particularly big challenge to
maintaining
master data occurs when there is a merger/acquisition. Each of the organizations
will
have its own master data, and how to merge the two sets of data will be
challenging.
Let's take a look at the customer files: The two companies will likely have
different
unique identifiers for each customer. Addresses and phone numbers may not match.
One may have a person's maiden name and the other the current last name. One may
have a nickname (such as "Bill") and the other may have the full name (such as
"William"). All these contribute to the difficulty in creating and maintain in a
single set of
master data.

At the heart of the master data management program is the definition of the master
data. Therefore, it is essential that we identify who is responsible for defining
and
enforcing the definition. Due to the importance of master data, a dedicated person
or
team should be appointed. At the minimum, a data steward should be identified. The
responsible party can also be a group -- such as a data governance committee or a
data governance council.
Master Data Management vs Data Warehousing

Based on the discussions so far, it seems like Master Data Management and Data
Warehousing have a lot in common. For example, the effort of data transformation
and
cleansing is very similar to an ETL process in data warehousing, and in fact they
can
use the same ETL tools. In the real world, it is not uncommon to see MDM and data
warehousing fall into the same project. On the other hand, it is important to call
out the
main differences between the two:

1) Different Goals

The main purpose of a data warehouse is to analyze data in a multidimensional


fashion,
while the main purpose of MDM is to create and maintain a single source of truth
for a
particular dimension within the organization. In addition, MDM requires solving the
root
cause of the inconsistent metadata, because master data needs to be propagated back

to the source system in some way. In data warehousing, solving the root cause is
not
always needed, as it may be enough just to have a consistent view at the data
warehousing level rather than having to ensure consistency at the data source
level.

2) Different Types of Data

Master Data Management is only applied to entities and not transactional data,
while a
data warehouse includes data that are both transactional and non-transactional in
nature. The easiest way to think about this is that MDM only affects data that
exists in
dimensional tables and not in fact tables, while in a data warehousing environment
includes both dimensional tables and fact tables.

3) Different Reporting Needs

In data warehousing, it is important to deliver to end users the proper types of


reports
using the proper type of reporting tool to facilitate analysis. In MDM, the
reporting needs
are very different -- it is far more important to be able to provide reports on
data
governance, data quality, and compliance, rather than reports based on analytical
needs.

4) Where Data Is Used

In a data warehouse, usually the only usage of this "single source of truth" is for

applications that access the data warehouse directly, or applications that access
systems that source their data straight from the data warehouse. Most of the time,
the
original data sources are not affected. In master data management, on the other
hand,
we often need to have a strategy to get a copy of the master data back to the
source
system. This poses challenges that do not exist in a data warehousing environment.
For
example, how do we sync the data back with the original source? Once a day? Once an
hour? How do we handle cases where the data was modified as it went through the
cleansing process? And how much modification do we need make do to the source
system so it can use the master data? These questions represent some of the
challenges MDM faces. Unfortunately, there is no easy answer to those questions, as

the solution depends on a variety of factors specific to the organization, such as


how
many source systems there are, how easy / costly it is to modify the source system,
and
even how internal politics play out.

Data Warehousing Concepts

Posted on December 15, 2007 by sailu

Data Warehouse Concepts

What is a Data Warehouse? According to Inmon, famous author for several


data warehouse books, �A data warehouse is a subject oriented, integrated,
time variant, non volatile collection of data in support of management�s
decision making process�.

Example: In order to store data, over the years, many application designers
in each branch have made their individual decisions as to how an
application and database should be built. So source systems will be
different in naming conventions, variable measurements, encoding
structures, and physical attributes of data. Consider a bank that has got
several branches in several countries, has millions of customers and the
lines of business of the enterprise are savings, and loans. The following
example explains how the data is integrated from source systems to target
systems.

Example of Source Data

System Name Attribute Name Column Name Datatype Values

Source System 1 Customer Application Date


CUSTOMER_APPLICATION_DATE NUMERIC(8,0) 11012005

Source System 2 Customer Application Date


CUST_APPLICATION_DATE DATE 11012005
Source System 3 Application Date APPLICATION_DATE DATE
01NOV2005

In the aforementioned example, attribute name, column name, datatype and


values are entirely different from one source system to another. This
inconsistency in data can be avoided by integrating the data into a data
warehouse with good standards.

Example of Target Data(Data Warehouse)

Target System Attribute Name Column Name Datatype


Values

Record #1 Customer Application Date CUSTOMER_APPLICATION_DATE


DATE 01112005

Record #2 Customer Application Date CUSTOMER_APPLICATION_DATE


DATE 01112005

Record #3 Customer Application Date CUSTOMER_APPLICATION_DATE


DATE 01112005

In the above example of target data, attribute names, column names, and
data types are consistent throughout the target system. This is how data
from various source systems is integrated and accurately stored into the
data warehouse

[edit]

Data Warehouse & Data Mart

A data warehouse is a relational/multidimensional database that is


designed for query and analysis rather than transaction processing. A data
warehouse usually contains historical data that is derived from transaction
data. It separates analysis workload from transaction workload and enables
a business to consolidate data from several sources.

In addition to a relational/multidimensional database, a data warehouse


environment often consists of an ETL solution, an OLAP engine, client
analysis tools, and other applications that manage the process of gathering
data and delivering it to business users.
There are three types of data warehouses:

1. Enterprise Data Warehouse � An enterprise data warehouse provides


a central database for decision support throughout the enterprise.
2. ODS(Operational Data Store) � This has a broad enterprise wide
scope, but unlike the real entertprise data warehouse, data is
refreshed in near real time and used for routine business activity. One
of the typical applications of the ODS (Operational Data Store) is to
hold the recent data before migration to the Data
Warehouse.Typically, the ODS are not conceptually equivalent to the
Data Warehouse albeit do store the data that have a deeper level of
the history than that of the OLTP data.
3. Data Mart � Datamart is a subset of data warehouse and it supports a
particular region, business unit or business function.

Data warehouses and data marts are built on dimensional data modeling
where fact tables are connected with dimension tables. This is most useful
for users to access data since a database can be visualized as a cube of
several dimensions. A data warehouse provides an opportunity for slicing
and dicing that cube along each of its dimensions.

Data Mart: A data mart is a subset of data warehouse that is designed for a
particular line of business, such as sales, marketing, or finance. In a
dependent data mart, data can be derived from an enterprise-wide data
warehouse. In an independent data mart, data can be collected directly from
sources.

Figure 1.12 : Data Warehouse and Datamarts

[edit]

General Information

In general, an organization�s objective is to earn money by selling a product


or by providing service to the product. An organization may be at one place
or may have several branches. When we consider an example of an
organization selling products throughtout the world, the main four major
dimensions are product, location, time and organization. Dimension tables
have been explained in detail under the section Dimensions. With this
example, we will try to provide detailed explanation about STAR
SCHEMA.

[edit]

What is Star Schema?

Star Schema is a relational database schema for representing


multidimensional data. It is the simplest form of data warehouse schema
that contains one or more dimensions and fact tables. It is called a star
schema because the entity-relationship diagram between dimensions and
fact tables resembles a star where one fact table is connected to multiple
dimensions. The center of the star schema consists of a large fact table and
it points towards the dimension tables. The advantage of star schema are
slicing down, performance increase and easy understanding of data.

Steps in designing Star Schema

. Identify a business process for analysis(like sales).


. Identify measures or facts (sales dollar).
. Identify dimensions for facts(product dimension, location dimension,
time dimension, organization dimension).
. List the columns that describe each dimension.(region name, branch
name, region name).
. Determine the lowest level of summary in a fact table(sales dollar).

Important aspects of Star Schema & Snow Flake Schema

. In a star schema every dimension will have a primary key.


. In a star schema, a dimension table will not have any parent table.
. Whereas in a snow flake schema, a dimension table will have one or
more parent tables.
. Hierarchies for the dimensions are stored in the dimensional table
itself in star schema.
. Whereas hierarchies are broken into separate tables in snow flake
schema. These hierarchies help to drill down the data from topmost
hierarchies to the lowermost hierarchies.

[edit]

Glossary

[edit]

Hierarchy

A logical structure that uses ordered levels as a means of organizing data. A


hierarchy can be used to define data aggregation; for example, in a time
dimension, a hierarchy might be used to aggregate data from the Month
level to the Quarter level, from the Quarter level to the Year level. A
hierarchy can also be used to define a navigational drill path, regardless of
whether the levels in the hierarchy represent aggregated totals or not.

[edit]

Level

A position in a hierarchy. For example, a time dimension might have a


hierarchy that represents data at the Month, Quarter, and Year levels.

[edit]

Fact Table
A table in a star schema that contains facts and connected to dimensions. A
fact table typically has two types of columns: those that contain facts and
those that are foreign keys to dimension tables. The primary key of a fact
table is usually a composite key that is made up of all of its foreign keys. A
fact table might contain either detail level facts or facts that have been
aggregated (fact tables that contain aggregated facts are often instead called
summary tables). A fact table usually contains facts with the same level of
aggregation. Example of Star Schema: Figure 1.6

In the example figure 1.6, sales fact table is connected to dimensions


location, product, time and organization. It shows that data can be sliced
across all dimensions and again it is possible for the data to be aggregated
across multiple dimensions. �Sales Dollar� in sales fact table can be
calculated across all dimensions independently or in a combined manner
which is explained below.

. Sales Dollar value for a particular product


. Sales Dollar value for a product in a location
. Sales Dollar value for a product in a year within a location
. Sales Dollar value for a product in a year within a location sold or
serviced by an employee

[edit]

Snowflake Schema

A snowflake schema is a term that describes a star schema structure


normalized through the use of outrigger tables. i.e dimension table
hierachies are broken into simpler tables. In star schema example we had 4
dimensions like location, product, time, organization and a fact table(sales).

In Snowflake schema, the example diagram shown below has 4 dimension


tables, 4 lookup tables and 1 fact table. The reason is that
hierarchies(category, branch, state, and month) are being broken out of the
dimension tables(PRODUCT, ORGANIZATION, LOCATION, and TIME)
respectively and shown separately. In OLAP, this Snowflake schema
approach increases the number of joins and poor performance in retrieval
of data. In few organizations, they try to normalize the dimension tables to
save space. Since dimension tables hold less space, Snowflake schema
approach may be avoided.

Example of Snowflake Schema: Figure 1.7

[edit]

Fact Table

The centralized table in a star schema is called as FACT table. A fact table
typically has two types of columns: those that contain facts and those that
are foreign keys to dimension tables. The primary key of a fact table is
usually a composite key that is made up of all of its foreign keys. In the
example fig 1.6 �Sales Dollar� is a fact(measure) and it can be added across
several dimensions. Fact tables store different types of measures like
additive, non additive and semi additive measures.

Measure Types

. Additive � Measures that can be added across all dimensions.


. Non Additive � Measures that cannot be added across all dimensions.
. Semi Additive � Measures that can be added across few dimensions
and not with others.

A fact table might contain either detail level facts or facts that have been
aggregated (fact tables that contain aggregated facts are often instead called
summary tables). In the real world, it is possible to have a fact table that
contains no measures or facts. These tables are called as Factless Fact tables.

Steps in designing Fact Table

. Identify a business process for analysis(like sales).


. Identify measures or facts (sales dollar).
. Identify dimensions for facts(product dimension, location dimension,
time dimension, organization dimension).
. List the columns that describe each dimension.(region name, branch
name, region name).
. Determine the lowest level of summary in a fact table(sales dollar).

Example of a Fact Table with an Additive Measure in Star Schema: Figure


1.6

In the example figure 1.6, sales fact table is connected to dimensions


location, product, time and organization. Measure �Sales Dollar� in sales
fact table can be added across all dimensions independently or in a
combined manner which is explained below.

. Sales Dollar value for a particular product


. Sales Dollar value for a product in a location
. Sales Dollar value for a product in a year within a location
. Sales Dollar value for a product in a year within a location sold or
serviced by an employee

Database � RDBMS

There are a number of relational databases to store data. A relational


database contains normalized data stored in tables. Tables contain records
and columns. RDBMS makes it easy to work with individual records. Each
row contains a unique instance of data for the categories defined by the
columns.

RDBMS are used in OLTP applications(e.g. ATM cards) very frequently


and sometimes datawarehouse may also use relational databases. Please
refer to Relational data modeling for details to know how data from a
source system is normalized and stored in RDBMS databases.

Popular RDBMS Databases

RDBMS Name Company Name

Oracle Oracle Corporation


IBM DB2 UDB IBM Corporation

IBM Informix IBM Corporation

Microsoft SQL Server Microsoft

Sybase Sybase Corporation

Teradata NCR

Informatica Some Interview Questions

Posted on February 20, 2008 by sailu

1. What is source qualifier?


2. Difference between DSS & OLTP?
3. Explain grouped cross tab?
4. Hierarchy of DWH?
5. How many repositories can we create in Informatica?
6. What is surrogate key?
7. What is difference between Mapplet and reusable transformation?
8. What is aggregate awareness?
9. Explain reference cursor?
10. What are parallel querys and query hints?
11. DWH architecture?
12. What are cursors?
13. Advantages of de normalized data?
14. What is operational data source (ODS)?
15. What is meta data and system catalog?
16. What is factless fact schema?
17. What is confirmed dimension?
18. What is the capacity of power cube?
19. Difference between PowerPlay transformer and power play reports?
20. What is IQD file?
21. What is Cognos script editor?
22. What is difference macros and prompts?
23. What is power play plug in?
24. Which kind of index is preferred in DWH?
25. What is hash partition?
26. What is DTM session?
27. How can you define a transformation? What are different types of
transformations in Informatica?
28. What is mapplet?
29. What is query panel?
30. What is a look up function? What is default transformation for the look up
function?
31. What is difference between a connected look up and unconnected look up?
32. What is staging area?
33. What is data merging, data cleansing and sampling?
34. What is up date strategy and what are th options for update strategy?
35. OLAP architecture?
36. What is subject area?
37. Why do we use DSS database for OLAP tools?

Introduction to Informatica PowerCenter 7.1

Posted on December 11, 2007 by sailu

. Informatica is one of the most popular ETL(Extraction, Transformation


and Loading) tools in the market today. Informatica Power Center
provides an environment to load data into a centralized location such as
an Operational data Store (ODS) or a data mart or a data warehouse. U
can extract the data from various data sources such as flat files, or any
Database or even COBOL files, Transform the data based on ur business
logic and load data into different types of targets including files and
relational databases.

Informatica Provides the following components:


. Informatica Repository
. Informatica Client
. Informatica Server

Informatica Repository: The Repository is the core of the informatica suite.


Repository database contains a set of metadata tables that the Informatica
tools and applications access. The Informatica Client and server access the
repository to save and retrieve metadata.Informatica Client: The PowerCenter
Client is comprised of applications that you use to manage the repository,
design mappings and mapplets, create sessions and workflows to load the
data, and monitor workflow progress.The Informatica Client consists of three
client applications.

. Repository Manager
. Designer
. Server Manager

In this tutorial, you use the following applications and tools:

. Repository Manager. Use the Repository Manager to create and administer the
metadata repository. You use the
Repository Manager to create a repository user and group. You create a folder to
store the metadata you create in the
lessons.. Repository Server Administration Console. Use the Repository Server
Administration console to
administer the Repository Servers and repositories.. Designer. Use the Designer to
create mappings that contain
transformation instructions for the PowerCenter Server. Before you can create
mappings, you must add source and
target definitions to the repository. Designer comprises the following tools:

. Source Analyzer. Import or create source definitions.


. Warehouse Designer. Import or create target definitions.
. Mapping Designer. Create mappings that the PowerCenter Server uses to extract,
transform, and load
data.
. Workflow Manager. Use the Workflow Manager to create and run workflows and tasks.
A workflow is a
set of instructions that describes how and when to run tasks related to extracting,
transforming, and loading
data.
. Workflow Monitor. Use the Workflow Monitor to monitor scheduled and running
workflows for each
PowerCenter Server.

Informatica Server: The Informatica Server extracts the source data, performs the
data transformation and loads the
transformed data into the targets. Sources accessed by Powercenter

. Relational: Sybase, Oracle, IBM DB2, Informix, MS SQL Server


and Teradata.
. File: Fixed and delimited flat files, COBOL files and XML files
. Extended: If u use Power Center, u can purchase additional
Powerconnect products to connect to other business sources such as
SAP R/3, Siebel, etc.
. Mainframes: If u use Power Center, u can purchase additional
Powerconnect products to connect to IBM DB2 on MVS
. Others: MS Access and MS Excel

Creating Repository Users

Posted on December 12, 2007 by sailu

Creating Repository Users and Groups

You can create a repository user profile for everyone working in the repository,
each with a separate user name and
password. You can also create user groups and assign each user to one or more
groups. Then, grant repository
privileges to each group, so users in the group can perform tasks within the
repository (such as use the Designer or
create workflows).

The repository user profile is not the same as the database user profile. While a
particular user might not have access
to a database as a database user, that same person can have privileges to a
repository in the database as
a repository user.

Informatica tools include two types of security:

. Privileges. Repository-wide security that controls which task or set of tasks a


single user or group of users can
access.

. Permissions. Security assigned to individual folders within the repository.

PowerCenter uses the following privileges:

. Use Designer

. Browse Repository

. Use Repository Manager

. Use Workflow Manager

. Workflow Operator
. Administer Repository

. Administer Server

. Super UserYou can perform various tasks for each privilege. Privileges depend on
your group membership. Every
repository user belongs to at least one group. For example, the user who
administers the repository belongs to the
Administrators group. By default, you receive the privileges assigned to your
group. While it is most common to
assign privileges by group, the repository administrator, who has either the Super
User or Administer Repository
privilege, can also grant privileges to individual users.

An administrator can perform the following tasks

. Create groups.

. Assign privileges to groups.

. Create users and assign them to groups.

In the following steps, you will perform the following tasks:

1. Connect to the repository as an Administrator. If necessary, ask your


administrator for the user name and
password. Otherwise, ask your administrator to complete the lessons in this chapter
for you.

2. Create a group called Group1. To do this, you need to log in to the repository
as the

Administrator.

3. Assign privileges to the Group1 group.

4. Create a new user.

Connecting to the Repository

To perform the following tasks, you need to connect to the repository. If you are
already

connected to the repository, disconnect and connect again to log in as the


Administrator.

Otherwise, ask your administrator to perform the tasks in this chapter for you.

To connect to the Repository,

1. Launch the Repository Manager.

A list of all repositories appears in the Navigator.

2. Double-click on ur repository .

3. Enter the repository user name and password for the Administrator user. Click
Connect.
The dialog box expands to enter additional information.

4. Enter the host name and port number needed to connect to the repository
database.

5. Click Connect.

You are now connected to the repository as the Administrator user.

Creating Source Tables and Definitions

Posted on December 12, 2007 by sailu

Creating Source Tables

Most of the data warehouses already have existing source tables or flat files.
Before you create source
definitions, you need to create the source tables in the database. In this lesson,
you run an SQL script in the
Warehouse Designer to create sample source tables. The SQL script creates sources
with table names and data.

Note: These SQL Scripts come along with Informatica Power Center software.

When you run the SQL script, you create the following source tables:

. CUSTOMERS

. DEPARTMENT

. DISTRIBUTORS

. EMPLOYEES

. ITEMS

. ITEMS_IN_PROMOTIONS

. JOBS

. MANUFACTURERS

. ORDERS

. ORDER_ITEMS

. PROMOTIONS

. STORES
Generally, you use the Warehouse Designer to create target tables in the target
database. The Warehouse Designer
generates SQL based on the definitions in the workspace. However, we will use this
feature to generate the source
tutorial tables from the tutorial SQL scripts that ship with the product.

To create the sample source tables:

1. Launch the Designer, double-click the icon for your repository, and log into the

repository.

Use your user profile to open the connection.

2. Double-click the Tutorial_yourname folder.

3. Choose Tools-Warehouse Designer to switch to the Warehouse Designer.

4. Choose Targets-Generate/Execute SQL.

The Database Object Generation dialog box gives you several options for creating
tables.

5. Click the Connect button to connect to the source database.

6. Select the ODBC data source you created for connecting to the source database.

7. Enter the database user name and password and click the Connect button.

You now have an open connection to the source database. You know that you are
connected when the Disconnect
button displays and the ODBC name of the source database appears in the dialog box.

8. Make sure the Output window is open at the bottom of the Designer.

If it is not open, choose View-Output.

9. Click the browse button to find the SQL file.

Note : The SQL file is installed in the Tutorial folder in the PowerCenter Client
installation directory.

10. Select the SQL file appropriate to the source database platform you are using.
Click Open.

Alternatively, you can enter the file name and path of the SQL file.

Platform File

Informix SMPL_INF.SQL

Microsoft SQL Server SMPL_MS.SQL

Oracle SMPL_ORA.SQL
Sybase SQL Server SMPL_SYB.SQL

DB2 SMPL_DB2.SQL

Teradata SMPL_TERA_SQL

11. Click Execute SQL file.

The database now executes the SQL script to create the sample source database
objects and to insert values into the
source tables. While the script is running, the Output window displays the
progress.

12. When the script completes, click Disconnect, and then click Close.

Creating Source Definitions

Now we are ready to create the source definitions in the repository based on the
source tables created in the previous
session. The repository contains a description of source tables, not the actual
data contained in them. After you add
these source definitions to the repository, you can use them in a mapping.

To import the sample source definitions:

1. In the Designer, choose Tools-Source Analyzer to open the Source Analyzer.

2. Double-click the tutorial folder to view its contents.

Every folder contains nodes for sources, targets, schemas, mappings, mapplets, and

reusable transformations.

3. Choose Sources-Import from Database.

4. Select the ODBC data source to access the database containing the source tables.

5. Enter the user name and password to connect to this database. Also, enter the
name of

the source table owner, if necessary.

In Oracle, the owner name is the same as the user name. Make sure that the owner
name
connect-to-db.jpg
is in all caps (for example, JDOE).

6. Click Connect.

7. In the Select tables list, expand the database owner and the TABLES heading.

If you click the All button, you can see all tables in the source database.

You should now see a list of all the tables you created by running the SQL script
in addition to any tables already in
the database.

8. Select the following tables:

. CUSTOMERS

. DEPARTMENT

. DISTRIBUTORS

. EMPLOYEES

. ITEMS

. ITEMS_IN_PROMOTIONS
. JOBS

. MANUFACTURERS

. ORDERS

. ORDER_ITEMS

. PROMOTIONS

. STORES

Tip: Hold down the Ctrl key to select multiple tables. Or, hold down the Shift key
to

select a block of tables. You may need to scroll down the list of tables to select
all tables.

9. Click OK to import the source definitions into the repository.

Creating Target Definitions and Tables

Posted on December 12, 2007 by sailu

You can import target definitions from existing target tables, or you can create
the definitions and then generate and
run the SQL to create the target tables. In this session,we shall create a target
definition in the Warehouse Designer,
and then create a target table based on the definition.

Creating Target Definitions

The next step is to create the metadata for the target tables in the repository.
The actual table that the target
definition describes does not exist yet.

Target definitions define the structure of tables in the target database, or the
structure of file targets the PowerCenter
Server creates when you run a workflow. If you add a target definition to the
repository that does not exist in a
relational database, you need to create target tables in your target database. You
do this by generating and executing
the necessary SQL code within the Warehouse Designer.

In the following steps, you will copy the EMPLOYEES source definition into the
Warehouse Designer to create the
target definition. Then, you will modify the target definition by deleting and
adding columns to create the definition
you want.

1. In the Designer, switch to the Warehouse Designer.


target-defn.jpg
2. Click and drag the EMPLOYEES source definition from the Navigator to the
Warehouse Designer workspace.

The Designer creates a new target definition, EMPLOYEES, with the same column
definitions as the EMPLOYEES
source definition and the same database type.

Next, you will modify the target column definitions.

3. Double-click the EMPLOYEES target definition by to open it.

The Edit Tables dialog box appears.

4. Click Rename and name the target definition T_EMPLOYEES.

Note: If you need to change the database type for the target definition(like if ur
source is oracle and target in
teradata), you can select the correct database type when you edit the target
definition.

5. Click the Columns tab.

The target column definitions are the same as the EMPLOYEES source definition.

6. Select the JOB_ID column and click the delete button.

7. Delete the following columns:

. ADDRESS1

. ADDRESS2

. CITY
edittable_targetdefn2.jpg
. STATE

. POSTAL_CODE

. HOME_PHONE

. EMAIL

When you finish, the target definition should look similar to the following target
definition:

Note that the EMPLOYEE_ID column is a primary key. The primary key cannot accept
null values. The Designer
automatically selects Not Null and disables the Not Null option. You now have a
column ready to receive data from
the EMPLOYEE_ID column in the EMPLOYEES source table.

Note: If you want to add a business name for any column, scroll to the right and
enter it.

8. Click OK to save your changes and close the dialog box.

9. Choose Repository-Save.

Creating Target Tables

You can use the Warehouse Designer to run an existing SQL script to create target
tables.
dbobjgen.jpg
Note: When you use the Warehouse Designer to generate SQL, you can choose to drop
the table in the database
before creating it. To do this, select the Drop Table option. If the target
database already contains tables,
make sure it does not contain a table with the same name as the table you plan to
create. If the table exists in
the database, you lose the existing table and data.

To create the target T_EMPLOYEES table:

1. In the workspace, select the T_EMPLOYEES target definition.

2. Choose Targets-Generate/Execute SQL.

The dialog box to run the SQL script appears.

3. In the Filename field, enter the following text:

C:\[your installation directory]\MKT_EMP.SQL

If you installed the client software in a different location, enter the appropriate
drive letter and directory.

4. If you are connected to the source database from the previous lesson, click
Disconnect, and then click Connect.

5. Select the ODBC data source to connect to the target database.

6. Enter the necessary user name and password, and then click Connect.

7. Select the Create Table, Drop Table, and Primary Key options.

8. Click the Generate and Execute button.

The Designer runs the DDL code needed to create T_EMPLOYEES. If you want to review
the actual code, click
Edit SQL file to open the MKT_EMP.SQL file.
9. Click Close to exit.

What is Data Warehousing?


A data warehouse is the main repository of an organization�s historical data, its
corporate memory. It
contains the raw material for management�s decision support system. The critical
factor leading to the use
of a data warehouse is that a data analyst can perform complex queries and
analysis, such as data
mining, on the information without slowing down the operational systems
(Ref:Wikipedia). Data
warehousing collection of data designed to support management decision making. Data
warehouses
contain a wide variety of data that present a coherent picture of business
conditions at a single point in
time. It is a repository of integrated information, available for queries and
analysis.
What are fundamental stages of Data Warehousing?
Offline Operational Databases � Data warehouses in this initial stage are developed
by simply copying
the database of an operational system to an off-line server where the processing
load of reporting does
not impact on the operational system�s performance.
Offline Data Warehouse � Data warehouses in this stage of evolution are updated on
a regular time cycle
(usually daily, weekly or monthly) from the operational systems and the data is
stored in an integrated
reporting-oriented data structure.
Real Time Data Warehouse � Data warehouses at this stage are updated on a
transaction or event basis,
every time an operational system performs a transaction (e.g. an order or a
delivery or a booking etc.)
Integrated Data Warehouse � Data warehouses at this stage are used to generate
activity or transactions
that are passed back into the operational systems for use in the daily activity of
the organization.
What is Dimensional Modeling?
Dimensional data model concept involves two types of tables and it is different
from the 3rd normal form.
This concepts uses Facts table which contains the measurements of the business and
Dimension table
which contains the context(dimension of calculation) of the measurements.
What is Fact table?
Fact table contains measurements of business processes also fact table contains the
foreign keys for the
dimension tables. For example, if your business process is �paper production� then
�average production
of paper by one machine� or �weekly production of paper� would be considered as
measurement of
business process.
What is Dimension table?
Dimensional table contains textual attributes of measurements stored in the facts
tables. Dimensional
table is a collection of hierarchies, categories and logic which can be used for
user to traverse in
hierarchy nodes.
What are the Different methods of loading Dimension tables?
There are two different ways to load data in dimension tables.
Conventional (Slow) :
All the constraints and keys are validated against the data before, it is loaded,
this way data integrity is
maintained.
Direct (Fast) :
All the constraints and keys are disabled before the data is loaded. Once data is
loaded, it is validated
against all the constraints and keys. If data is found invalid or dirty it is not
included in index and all future
processes are skipped on this data.
What is OLTP?
OLTP is abbreviation of On-Line Transaction Processing. This system is an
application that modifies data
the instance it receives and has a large number of concurrent users.

What is OLAP?
OLAP is abbreviation of Online Analytical Processing. This system is an application
that collects,
manages, processes and presents multidimensional data for analysis and management
purposes.
What is the difference between OLTP and OLAP?
Data Source
OLTP: Operational data is from original data source of the data
OLAP: Consolidation data is from various source.
Process Goal
OLTP: Snapshot of business processes which does fundamental business tasks
OLAP: Multi-dimensional views of business activities of planning and decision
making
Queries and Process Scripts
OLTP: Simple quick running queries ran by users.
OLAP: Complex long running queries by system to update the aggregated data.
Database Design
OLTP: Normalized small database. Speed will be not an issue due to smaller database
and normalization
will not degrade performance. This adopts entity relationship(ER) model and an
application-oriented
database design.
OLAP: De-normalized large database. Speed is issue due to larger database and de-
normalizing will
improve performance as there will be lesser tables to scan while performing tasks.
This adopts star,
snowflake or fact constellation mode of subject-oriented database design.
Describes the foreign key columns in fact table and dimension table?
Foreign keys of dimension tables are primary keys of entity tables.
Foreign keys of facts tables are primary keys of Dimension tables.
What is Data Mining?
Data Mining is the process of analyzing data from different perspectives and
summarizing it into useful
information.
What is the difference between view and materialized view?
A view takes the output of a query and makes it appear like a virtual table and it
can be used in place of
tables.
A materialized view provides indirect access to table data by storing the results
of a query in a separate
schema object.
What is ER Diagram?
Entity Relationship Diagrams are a major data modelling tool and will help organize
the data in your
project into entities and define the relationships between the entities. This
process has proved to enable
the analyst to produce a good database structure so that the data can be stored and
retrieved in a most
efficient manner.
An entity-relationship (ER) diagram is a specialized graphic that illustrates the
interrelationships between
entities in a database. A type of diagram used in data modeling for relational data
bases. These diagrams
show the structure of each table and the links between tables.
What is ODS?
ODS is abbreviation of Operational Data Store. A database structure that is a
repository for near real-time
operational data rather than long term trend data. The ODS may further become the
enterprise shared
operational database, allowing operational systems that are being re-engineered to
use the ODS as there
operation databases.
What is ETL?
ETL is abbreviation of extract, transform, and load. ETL is software that enables
businesses to
consolidate their disparate data while moving it from place to place, and it
doesn�t really matter that that
data is in different forms or formats. The data can come from any source.ETL is
powerful enough to
handle such data disparities. First, the extract function reads data from a
specified source database and
extracts a desired subset of data. Next, the transform function works with the
acquired data � using rules
orlookup tables, or creating combinations with other data � to convert it to the
desired state. Finally, the
load function is used to write the resulting data to a target database.
What is VLDB?
VLDB is abbreviation of Very Large DataBase. A one terabyte database would normally
be considered to
be a VLDB. Typically, these are decision support systems or transaction processing
applications serving
large numbers of users.

Is OLTP database is design optimal for Data Warehouse?


No. OLTP database tables are normalized and it will add additional time to queries
to return results.
Additionally OLTP database is smaller and it does not contain longer period (many
years) data, which
needs to be analyzed. A OLTP system is basically ER model and not Dimensional
Model. If a complex
query is executed on a OLTP system, it may cause a heavy overhead on the OLTP
server that will affect
the normal business processes.
If de-normalized is improves data warehouse processes, why fact table is in normal
form?
Foreign keys of facts tables are primary keys of Dimension tables. It is clear that
fact table contains
columns which are primary key to other table that itself make normal form table.
What are lookup tables?
A lookup table is the table placed on the target table based upon the primary key
of the target, it just
updates the table by allowing only modified (new or updated) records based on
thelookup condition.
What are Aggregate tables?
Aggregate table contains the summary of existing warehouse data which is grouped to
certain levels of
dimensions. It is always easy to retrieve data from aggregated tables than visiting
original table which has
million records. Aggregate tables reduces the load in the database server and
increases the performance
of the query and can retrieve the result quickly.
What is real time data-warehousing?
Data warehousing captures business activity data. Real-time data warehousing
captures business activity
data as it occurs. As soon as the business activity is complete and there is data
about it, the completed
activity data flows into the data warehouse and becomes available instantly.
What are conformed dimensions?
Conformed dimensions mean the exact same thing with every possible fact table to
which they are joined.
They are common to the cubes.
What is conformed fact?
Conformed dimensions are the dimensions which can be used across multiple Data
Marts in combination
with multiple facts tables accordingly.
How do you load the time dimension?
Time dimensions are usually loaded by a program that loops through all possible
dates that may appear
in the data. 100 years may be represented in a time dimension, with one row per
day.
What is a level of Granularity of a fact table?
Level of granularity means level of detail that you put into the fact table in a
data warehouse. Level of
granularity would mean what detail are you willing to put for each transactional
fact.
What are non-additive facts?
Non-additive facts are facts that cannot be summed up for any of the dimensions
present in the fact table.
However they are not considered as useless. If there is changes in dimensions the
same facts can be
useful.
What is factless facts table?
A fact table which does not contain numeric fact columns it is called factless
facts table.
What are slowly changing dimensions (SCD)?
SCD is abbreviation of Slowly changing dimensions. SCD applies to cases where the
attribute for a
record varies over time.
There are three different types of SCD.
1) SCD1 : The new record replaces the original record. Only one record exist in
database � current data.
2) SCD2 : A new record is added into the customer dimension table. Two records
exist in database �
current data and previous history data.
3) SCD3 : The original data is modified to include new data. One record exist in
database � new
information are attached with old information in same row.
What is hybrid slowly changing dimension?
Hybrid SCDs are combination of both SCD 1 and SCD 2. It may happen that in a table,
some columns
are important and we need to track changes for them i.e capture the historical data
for them whereas in
some columns even if the data changes, we don�t care.
What is BUS Schema?
BUS Schema is composed of a master suite of confirmed dimension and standardized
definition if facts.
What is a Star Schema?
Star schema is a type of organizing the tables such that we can retrieve the result
from the database
quickly in the warehouse environment.
What Snow Flake Schema?
Snowflake Schema, each dimension has a primary dimension table, to which one or
more additional
dimensions can join. The primary dimension table is the only table that can join to
the fact table.
Differences between star and snowflake schema?
Star schema � A single fact table with N number of Dimension, all dimensions will
be linked directly with
a fact table. This schema is de-normalized and results in simple join and less
complex query as well as
faster results.
Snow schema � Any dimensions with extended dimensions are know as snowflake schema,
dimensions
maybe interlinked or may have one to many relationship with other tables. This
schema is normalized and
results in complex join and very complex query as well as slower results.

What is Difference between ER Modeling and Dimensional Modeling?


ER modeling is used for normalizing the OLTP database design. Dimensional modeling
is used for de-
normalizing the ROLAP/MOLAP design.
What is degenerate dimension table?
If a table contains the values, which are neither dimension nor measures is called
degenerate
dimensions.
Why is Data Modeling Important?
Data modeling is probably the most labor intensive and time consuming part of the
development process.
The goal of the data model is to make sure that the all data objects required by
the database are
completely and accurately represented. Because the data model uses easily
understood notations and
natural language , it can be reviewed and verified as correct by the end-users.
In computer science, data modeling is the process of creating a data model by
applying a data model
theory to create a data model instance. A data model theory is a formal data model
description. When
data modelling, we are structuring and organizing data. These data structures are
then typically
implemented in a database management system. In addition to defining and organizing
the data, data
modeling will impose (implicitly or explicitly) constraints or limitations on the
data placed within the
structure.
Managing large quantities of structured and unstructured data is a primary function
of information
systems. Data models describe structured data for storage in data management
systems such as
relational databases. They typically do not describe unstructured data, such as
word processing
documents, email messages, pictures, digital audio, and video. (Reference :
Wikipedia)
What is surrogate key?
Surrogate key is a substitution for the natural primary key. It is just a unique
identifier or number for each
row that can be used for the primary key to the table. The only requirement for a
surrogate primary key is
that it is unique for each row in the table. It is useful because the natural
primary key can change and this
makes updates more difficult.Surrogated keys are always integer or numeric.
What is Data Mart?
A data mart (DM) is a specialized version of a data warehouse (DW). Like data
warehouses, data marts
contain a snapshot of operational data that helps business people to strategize
based on analyses of past
trends and experiences. The key difference is that the creation of a data mart is
predicated on a specific,
predefined need for a certain grouping and configuration of select data. A data
mart configuration
emphasizes easy access to relevant information (Reference : Wiki). Data Marts are
designed to help
manager make strategic decisions about their business.
What is the difference between OLAP and data warehouse?
Datawarehouse is the place where the data is stored for analyzing where as OLAP is
the process of
analyzing the data,managing aggregations, partitioning information into cubes for
in depth visualization.
What is a Cube and Linked Cube with reference to data warehouse?
Cubes are logical representation of multidimensional data.The edge of the cube
contains dimension
members and the body of the cube contains data values. The linking in cube ensures
that the data in the
cubes remain consistent.
What is junk dimension?
A number of very small dimensions might be lumped together to form a single
dimension, a junk
dimension � the attributes are not closely related. Grouping of Random flags and
text Attributes in a
dimension and moving them to a separate sub dimension is known as junk dimension.
What is snapshot with reference to data warehouse?
You can disconnect the report from the catalog to which it is attached by saving
the report with a
snapshot of the data.
What is active data warehousing?
An active data warehouse provides information that enables decision-makers within
an organization to
manage customer relationships nimbly, efficiently and proactively.
What is the difference between data warehousing and business intelligence?
Data warehousing deals with all aspects of managing the development, implementation
and operation of
a data warehouse or data mart including meta data management, data acquisition,
data cleansing, data
transformation, storage management, data distribution, data archiving, operational
reporting, analytical
reporting, security management, backup/recovery planning, etc. Business
intelligence, on the other hand,
is a set of software tools that enable an organization to analyze measurable
aspects of their business
such as sales performance, profitability, operational efficiency, effectiveness of
marketing campaigns,
market penetration among certain customer groups, cost trends, anomalies and
exceptions, etc.
Typically, the term �business intelligence� is used to encompass OLAP, data
visualization, data mining
and query/reporting tools.

Collection of Datawarehousing Interview Question Collection.These questions are


frequently asked in top
companies like HP,IBM
What is a lookup table?
Why should you put your data warehouse on a different system than your OLTP system?

What are Aggregate tables?


What's A Data warehouse
What is ODS?
What is a dimension table?
What is Dimensional Modelling? Why is it important ?
Why is Data Modeling Important?
What is data mining?
What is ETL?
Why are OLTP database designs not generally a good idea for a Data Warehouse?
What is Fact table?
What are conformed dimensions?
What are the Different methods of loading Dimension tables?
What is conformed fact?
What are Data Marts?
What is a level of Granularity of a fact table?
How are the Dimension tables designed?
What are non-additive facts?
What type of Indexing mechanism do we need to use for a typical datawarehouse?
What Snow Flake Schema?
What is real time data-warehousing?
What are slowly changing dimensions?
What are Semi-additive and factless facts and in which scenario will you use such
kinds of fact tables?
Differences between star and snowflake schemas?
What is a Star Schema?
What is a general purpose scheduling tool?
What is ER Diagram?
Which columns go to the fact table and which columns go the dimension table?
What are modeling tools available in the Market?
Name some of modeling tools available in the Market?
How do you load the time dimension?
Explain the advanatages of RAID 1, 1/0, and 5. What type of RAID setup would you
put your TX logs.
What is Difference between E-R Modeling and Dimentional Modeling.?
Why fact table is in normal form?
What are the advantages data mining over traditional approaches?
What are the vaious ETL tools in the Market?
What is a CUBE in datawarehousing concept?
What is data validation strategies for data mart validation after loading process ?

what is the datatype of the surrgate key ?


What is degenerate dimension table?
What does level of Granularity of a fact table signify?
What is the Difference between OLTP and OLAP?
What is SCD1 , SCD2 , SCD3?
What is Dimensional Modelling?
What are the methodologies of Data Warehousing.?
What is a linked cube?
What is the main difference between Inmon and Kimball philosophies of data
warehousing?
What is Data warehosuing Hierarchy?
What is the main differnce between schema in RDBMS and schemas in
DataWarehouse....?
What is hybrid slowly changing dimension?
What are the different architecture of datawarehouse?
What are the vaious ETL tools in the Market?
What is VLDB?
What are Data Marts ?
What are the steps to build the datawarehouse ?
what is incremintal loading? 2.what is batch processing? 3.what is crass reference
table? 4.what is
aggregate fact table?
what is junk dimension? what is the difference between junk dimension and
degenerated dimension?
What are the possible data marts in Retail sales.?
What is the definition of normalized and denormalized view and what are the
differences between them?
What is meant by metadata in context of a Datawarehouse and how it is important?
Differences between star and snowflake schemas?
Difference between Snow flake and Star Schema. What are situations where Snow flake
Schema is
better than Star Schema to use and when the opposite is true?
What is VLDB?
What's the data types present in bo?n what happens if we implement view in the
designer n report
Can a dimension table contains numeric values?
What is the difference between view and materialized view?
What is surrogate key ? where we use it expalin with examples
What is ER Diagram?
What is aggregate table and aggregate fact table ... any examples of both?
What is active data warehousing?
Why do we override the execute method is struts? Plz give me the details?
What is the difference between Datawarehousing and BusinessIntelligence?
What is the difference between OLAP and datawarehosue?
What is fact less fact table? where you have used it in your project?
Why Denormalization is promoted in Universe Designing?
What is the difference between ODS and OLTP?
What is the difference between datawarehouse and BI?
Is OLAP databases are called decision support system ??? true/false?
explain in detail about type 1, type 2(SCD), type 3 ?
What is snapshot?
What is the difference between datawarehouse and BI?
What are non-additive facts in detail?
What is BUS Schema?
What are the various Reporting tools in the Market?
What is Normalization, First Normal Form, Second Normal Form , Third Normal Form?

DataWarehousing Concepts and Interview Questions expected in a job interview


How do you index a dimension table?
Answer: clustered index on the dim key, and non clustered index (individual) on
attribute columns which
are used on the query�s �where clause�.
Purpose: this question is critical to be asked if you are looking for a Data
Warehouse Architect (DWA) or
a Data Architect (DA). Many DWA and DA only knows logical data model. Many of them
don�t know how
to index. They don�t know how different the physical tables are in Oracle compared
to in Teradata. This
question is not essential if you are looking for a report or ETL developer. It�s
good for them to know, but
it�s not essential
Tell me what you know about William Inmon?
Answer: He was the one who introduced the concept of data warehousing. Arguably
Barry Devlin was
the first one, but he�s not as popular as Inmon. If you ask who is Barry Devlin or
who is Claudia Imhoff
99.9% of the candidates wouldn�t know. But every decent practitioner in data
warehousing should know
about Inmon and Kimball.
Purpose: to test if the candidate is a decent practitioner in data warehousing or
not. You�ll be surprise
(especially if you are interviewing a report developer) how many candidates don�t
know the answer. If
someone is applying for a BI architect role and he never heard about Inmon you
should worry.
How do we build a real time data warehouse?
Answer: if the candidate asks �Do you mean real time or near real time� it may
indicate that they have a
good amount of experience dealing with this in the past. There are two ways we
build a real time data
warehouse (and this is applicable for both Normalised DW and Dimensional DW):
a) By storing previous periods� data in the warehouse then putting a view on top of
it pointing to the
source system�s current period data. �Current period� is usually 1 day in DW, but
in some industries e.g.
online trading and ecommerce, it is 1 hour.
b) By storing previous periods� data in the warehouse then use some kind of
synchronous mechanism to
propagate current period�s data. An example of synchronous data propagation
mechanism is SQL Server
2008�s Change Tracking or the old school�s trigger.
Near real time DW is built using asynchronous data propagation mechanism, aka mini
batch (2-5 mins
frequency) or micro batch (30s � 1.5 mins frequency).
Purpose: to test if the candidate understands complex, non-traditional mechanism
and follows the latest
trends. Real time DW was considered impossible 5 years ago and only developed in
the last 5 years. If
the DW is normalised it�s easier to make it real time than if the DW is dimensional
as there�s dim key
lookup involved.
What is the difference between a data mart and a data warehouse
Answer: Most candidates will answer that one is big and the other is small. Some
good candidates
(particularly Kimball practitioners) will say that data mart is one star. Whereas
DW is a collection of all
stars. An excellent candidate will say all the above answers, plus they will say
that a DW could be the
normalised model that store EDW, whereas DM is the dimensional model containing 1-4
stars for specific
department (both relational DB and multidimensional DB).
Purpose: The question has 3 different levels of answer, so we can see how deep the
candidate�s
knowledge in data warehousing.
What the purpose of having a multidimensional database?
Answer: Many candidates don�t know what a multidimensional database (MDB) is. They
have heard
about OLAP, but not MDB. So if the candidate looks puzzled, help them by saying �an
MDB is an OLAP
database�. Many will say �Oh� I see� but actually they are still puzzled so it will
take a good few
moments before they are back to earth again. So ask again: �What is the purpose of
having an OLAP
database?� The answer is performance and easier data exploration. An MDB (aka cube)
is a hundred
times faster than relational DB for returning an aggregate. An MDB will be very
easy to navigate, drilling
up and down the hierarchies and across attributes, exploring the data.
Purpose: This question is irrelevant to report or ETL developer, but a must for a
cube developer and
DWA/DA. Every decent cube developer (SSAS, Hyperion, Cognos) should be able to
answer the
question as it�s their bread and butter.
Why do you need a staging area?
Answer: Because:
a) Some data transformations/manipulations from source system to DWH can�t be done
on the fly, but
requires several stages and therefore needs to �be landed on disk first�
b) The time to extract data from the source system is limited (e.g. we were only
given 1 hour window) so
we just �get everything we need out first and process later�
c) For traceability and consistency, i.e. some data transform are simple and some
are complex but for
consistency we put all of them on stage first, then pick them up from stage for
further processing
d) Some data is required by more than 1 parts of the warehouse (e.g. ODS and DDS)
and we want to
minimise the impact to the source system�s workload. So rather than reading twice
from the source
system, we �land� the data on the staging then both the ODS and the DDS read the
data from staging.
Purpose: This question is intended more for an ETL developer than a report/cube
developer. Obviously a
data architect needs to know this too.
How do you decide that you need to keep it as 1 dimension or split it into 2
dimensions? Take for
example dim product: there are attributes which are at product code level and there
are attributes
which are at product group level. Should we keep them all in 1 dimension (product)
or split them
into 2 dimensions (product and product group)?
Answer: Depends on how they are going to be used, as I explained in my article �One
or two dimensions�
Purpose: To test if the candidate is conversant in dimensional modelling. This
question especially is
relevant for data architects and cube developers and less relevant for a report or
ETL developer.
Fact table columns usually numeric. In what case does a fact table have a varchar
column?
Answer: degenerate dimension
Purpose: to check if the candidate has ever involved in detailed design of
warehouse tables.
What kind of dimension is a �degenerate dimension�? Give me an example.
Answer: A �dimension� which stays in the fact table. It is usually the reference
number of the transaction.
For example: Transaction ID, payment ref and order ID
Purpose: Just another question to test the fundamentals.
What is show flaking? What are the advantages and disadvantages?
Answer: In dimensional modelling, snow flaking is breaking a dimension into several
tables by
normalising it. The advantages are: a) performance when processing dimensions in
SSAS, b) flexibility if
the sub dim is used in several places e.g. city is used in dim customer and dim
supplier (or in insurance
DW: dim policy holder and dim broker), c) one place to update, and d) the DW load
is quicker as there are
less duplications of data. The disadvantages are: a) more difficult in �navigating
the star*�, i.e. need joins
a few tables, b) worse �sum group by*� query performance (compared to �pure
star*�), c) more flexible in
accommodating requirements, i.e. the city attributes for dim supplier don�t have to
be the same as the city
attributes for dim customer, d) the DW load is simpler as you don�t have to
integrate the city.
*: a �star� is a fact table with all its dimensions, �navigating� means
browsing/querying, �sum group by� is a
SQL select statement with a �group by� clause, pure star is a fact table with all
its dimensions and none of
the dims are snow-flaked.
Purpose: Snow flaking is one of the classic debates in dimensional modelling
community. It is useful to
check if the candidate understands the reasons of just �following blindly�. This
question is applicable
particularly for data architect and OLAP designer. If their answers are way off
then you should worry. But
it also relevant to ETL and report developers as they will be populating and
querying the structure

Advanced DataWarehousing Concepts and Interview Questions


How do you implement Slowly Changing Dimension type 2? I am not looking for the
definition, but
the practical implementation e.g. table structure, ETL/loading. {M}
Answer: Create the dimension table as normal, i.e. first the dim key column as an
integer, then the
attributes as varchar (or varchar2 if you use Oracle). Then I�d create 3 additional
columns: IsCurrent flag,
�Valid From� and �Valid To� (they are datetime columns). With regards to the ETL,
I�d check first if the row
already exists by comparing the natural key. If it exists then �expire the row� and
insert a new row. Set the
�Valid From� date to today�s date or the current date time.
An experienced candidate (particularly DW ETL developer) will not set the �Valid
From� date to the
current date time, but to the time when the ETL started. This is so that all the
rows in the same load will
have the same Valid From, which is 1 millisecond after the expiry time of the
previous version thus
avoiding issue with ETL workflows that run across midnight.
Purpose: SCD 2 is the one of the first things that we learn in data warehousing. It
is considered the
basic/fundamental. The purpose of this question is to separate the quality
candidate from the ones who
are bluffing. If the candidate can not answer this question you should worry.
How do you index a fact table? And explain why. {H}
Answer: Index all the dim key columns, individually, non clustered (SQL Server) or
bitmap (Oracle). The
dim key columns are used to join to the dimension tables, so if they are indexed
the join will be faster. An
exceptional candidate will suggest 3 additional things: a) index the fact key
separately, b) consider
creating a covering index in the right order on the combination of dim keys, and c)
if the fact table is
partitioned the partitioning key must be included in all indexes.
Purpose: Many people know data warehousing only in theory or only in logical data
model. This question
is designed to separate those who have actually built a data warehouse and those
who haven�t.
In the source system, your customer record changes like this: customer1 and
customer2 now
becomes one company called customer99. Explain a) impact to the customer dim
(SCD1), b)
impact to the fact tables. {M}
Answer: In the customer dim we update the customer1 row, changing it to customer99
(remember that it
is SCD1). We do soft delete on the customer2 row by updating the IsActive flag
column (hard delete is
not recommended). On the fact table we find the Surrogate Key for customer1 and 2
and update it with
customer99�s SK.
Purpose: This is a common problem that everybody in data warehousing encounters. By
asking this
question we will know if the candidate has enough experience in data warehousing.
If they have not come
across this (probably they are new in DW), we want to know if they have the
capability to deal with it or
not.
Question: What are the differences between Kimball approach and Inmon�s? Which one
is better
and why?
Answer: if you are looking for a junior role e.g. a developer, then the expected
answer is: in Kimball we
do dimension modelling, i.e. fact and dim tables whereas in Inmon�s we do CIF, i.e.
EDW in normalised
form and we then create a DM/DDS from the EDW. Junior candidates usually prefer
Kimball, because of
query performance and flexibility, or because that�s the only one they know; which
is fine. But if you are
interviewing for a senior role e.g. senior data architect then they need to say
that the approach depends
on the situation. Both Kimball & Inmon�s approaches have advantages and
disadvantages.
Purpose: a) to see if the candidate understands the core principles of data
warehousing or they just
�know the skin�, b) to find out if the candidate is open minded, i.e. the solution
depends on what we are
trying to achieve (there�s right or wrong answer) or if they are blindly using
Kimball for every situation.
Suppose a fact row has unknown dim keys, do you load that row or not? Can you
explain the
advantage/disadvantages?
Answer: We need to load that row so that the total of the measure/fact is correct.
To enable us to load
the row, we need to either set the unknown dim key to 0 or the dim key of the newly
created dim rows.
We can also not load that row (so the total of the measure will be different from
the source system) if the
business requirement prefer it. In this case we load the fact row to a quarantine
area complete with error
processing, DQ indicator and audit log. On the next day, after we receive the dim
row, we load the fact
row. This is commonly known as Late Arriving Dimension Rows and there are many
sources for further
information; .
Purpose: again this is a common problem that we encounter in regular basis in data
warehousing. With
this question we want to see if the candidate�s experience level is up to the
expectation or not.
Please tell me your experience on your last 3 data warehouse projects. What were
your roles in
those projects? What were the issues and how did you solve them?
Answer: There�s no wrong or right answer here. With this question you are looking
for a) whether they
have done similar things to your current project, b) whether their have done the
same role as the role you
are offering, c) whether they faces the same issues as your current DW project.
Purpose: Some of the reasons why we pay more to certain candidates compared to the
others are: a)
they have done it before they can deliver quicker than those who haven�t, b) they
come from our
competitors so we would know what�s happening there and we can make a better system
than theirs, c)
they have solved similar issues so we could �borrow their techniques�.
What are the advantages of having a normalised DW compared to dimensional DW? What
are the
advantages of dimensional DW compared to normalised DW?
Answer: The advantages of dimensional DW are: a) flexibility, e.g. we can
accommodate changes in the
requirements with minimal changes on the data model, b) performance, e.g. you can
query it faster than
normalised model, c) it�s quicker and simpler to develop than normalised DW and
easier to maintain.
Purpose: to see if the candidate has seen �the other side of the coin�. Many people
in data warehousing
only knows Kimball/dimensional. Second purpose of this question is to check if the
candidate
understands the benefit of dimensional modelling, which is a fundamental
understanding in data
warehousing.
What is 3rd normal form? {L} Give me an example of a situation where the tables are
not in
3rdrd NF.
Answer: No column is transitively depended on the PK. For example, column1 is
dependant on column2
and column2 is dependant on column3. In this case column3 is �transitively
dependant� on column1. To
make it 3rd NF we need to split it into 2 tables: table1 which has column1 &
column2 and table2 which has
column2 and column3.
Purpose: A lot of people talk about �3rd normal form� but they don�t know what it
means. This is to test if
the candidate is one of those people. If they can�t answer 3rdNF, ask 2nd NF. If
they can�t answer 2nd NF,
ask 1st NF.
Tell me how to design a data warehouse, i.e. what are the steps of doing
dimensional modelling?
Answer: There are many ways, but it should not be too far from this order: 1.
Understand the business
process, 2. Declare the grain of the fact table, 3. Create the dimension tables
including attributes, 4. Add
the measures to the fact tables (from Kimball�s Toolkit book chapter 2). Step 3 and
4 could be reversed
(add the fact first, then create the dims), but step 1 & 2 must be done in that
order. Understanding the
business process must always be the first, and declaring the grain must always be
the second.
Purpose: This question is for data architect or data warehouse architect to see if
they can do their job. It�s
not a question for an ETL, report or cube developer.
How do you join 2 fact tables?
Answer: It�s a trap question. You don�t usually join 2 fact tables especially if
they have different grain.
When designing a dimensional model, you include all the necessary measures into the
same fact table. If
the measure you need is located on another fact table, then there�s something wrong
with the design.
You need to add that measure to the fact table you are working with. But what if
the measure has a
different grain? Then you add the lower grain measure to the higher grain fact
table. What if the fact table
you are working with has a lower grain? Then you need to get the business logic for
allocating the
measure.
It is possible to join 2 fact tables, i.e. using the common dim keys. But the
performance is usually horrible,
hence people don�t do this in practice, except for small fact tables (<100k rows).
For example: if
FactTable1 has dim1key, dim2key, dimkey3 and FactTable2 has dim1key and dim2key
then you could
join them like this:

select f2.dim1key, f2.dim2key, f1.measure1, f2.measure2

from

( select dim1key, dim2key, sum(measure1) as measure1

from FactTable1

group by dim1key, dim2key

) f1

join FactTable2 f2
8

on f1.dim1key = f2.dim1key and f1.dim2key = f2.dim2key

So if we don�t join 2 fact tables that way, how do we do it? The answer is using
the fact key column. It is a
good practice (especially in SQL Server because of the concept of cluster index) to
have a fact key
column to enable us to identify rows on the fact table . The performance would be
much better (than
joining on dim keys), but you need to plan this in advance as you need to include
the fact key column on
the other fact table.

select f2.dim1key, f2.dim2key, f1.measure1, f2.measure2

from FactTable1 f1

join FactTable2 f2

on f2.fact1key = f1.factkey

I implemented this technique originally for self joining, but then expand the usage
to join to other fact
table. But this must be used on an exception basis rather than the norm.
Purpose: not to trap the candidate of course. But to see if they have the
experience dealing with a
problem which doesn�t happen every day.
Explain the concepts and capabilities of Business Intelligence.

Business Intelligence helps to manage data by applying different skills,


technologies, security and quality
risks. This also helps in achieving a better understanding of data.Business
intelligence can be considered
as the collective information. It helps in making predictions of business
operations using gathered data in
a warehouse. Business intelligence application helps to tackle sales, financial,
production etc business
data. It helps in a better decision making and can be also considered as a decision
support system.

Name some of the standard Business Intelligence tools in the


market.

Business intelligence tools are to report, analyze and present data. Few of the
tools available in the
market are:

. Eclipse BIRT Project:- Based on eclipse. Mainly used for web applications and it
is open source.
. Freereporting.com:- It is a free web based reporting tool.
. JasperSoft:- BI tool used for reporting, ETL etc.
. Pentaho:- Has data mining, dashboard and workflow capabilities.
. Openl:- A web application used for OLAP reporting.

Explain the Dashboard in the business intelligence.

A dashboard in business intellgence allows huge data and reports to be read in a


single graphical
interface. They help in making faster decisions by replying on measurable data seen
at a glance. They
can also be used to get into details of this data to analyze the root cause of any
business performance. It
represents the business data and business state at a high level. Dashboards can
also be used for cost
control. Example of need of a dashboard: Banks run thousands of ATM�s. They need to
know how much
cash is deposited, how much is left etc.

SAS Business Intelligence.

SAS business intelligence has analytical capabilities like statistics, reporting,


data mining, predictions,
forecasting and optimization. They help in getting data in the format desired. It
helps in improving quality
of data.

Explain the SQL Server 2005 Business Intelligence components.

. SQL Server Integration Services:- Used for data transformation and creation. Used
in data
acquisition form a source system.
. SQL Server Analysis Services: Allows data discovery using data mining. Using
business logic it
supports data enhancement.
. SQL Server Reporting Services:- Used for Data presentation and distribution
access.

uestion: From where you Get the Logical Query of your Request?
Answer: The logical SQL generated by the server can be viewed in BI Answers. If I
have not understood
the question, Please raise your voice.
Question: Major Challenges You Faced While Creating the RPD?
Answer: Every now and then there are problems with the database connections but the
problem while
creating the repository RPD files comes with complex schemas made on OLTP systems
consisting of lot
of joins and checking the results. Th type of join made need to be checked. By
default it is inner join but
sometimes the requirement demands other types of joins. There are lot of problems
with the date formats
also.
Question: What are Global Filter and how thery differ From Column Filter?
Answer: Column filter- simply a filter applied on a column which we can use to
restrict our column values
while pulling the data or in charts to see the related content.
Global filter- Not sure. I understand this filter will have impact on across the
application but I really dont
understand where and how it can be user. I heard of global variables but not global
filters.
How to make the Delivery Profilers Work?
When we are Use SA System how Does SA Server understand that It needs to use it For
Getting
the User Profile information?
Where to Configure the Scheduler?
Answer: We configure the OBIEE schedular in database.
Question: How to hide Certain Columns From a User?
Answer: Application access level security- Do not add the column in the report, Do
not add the column in
the presentation layer.
Question:How can we Enable Drills in a Given Column Data?
Answer: To enable Drill down for a column, it should be included in the hirarchy in
OBIEE. Hyperion IR
has a drill anywhere feature where dont have to define and can drill to any
available column.
Question: Is Drill Down Possible without the attribute being a Part of a
Hierarchical Dimension?
Answer: No
Question: How do u Conditional Format.?
Answer: while creating a chat in BI Answers, you can define the conditions and can
apply colour
formatting.
Question: What is Guided Navigation?
Answer: I think it is just the arrangement of hyperlinks to guide the user to
navigate between the reports
to do the analysis.
How is Webcat File Deployed Across Environment?
Question: How the users Created Differs From RPD/Answers/Dashboards Level?
Answer: RPD users can do administrator tasks like adding new data source, create
hirarchies, change
column names where as Answers users may create new charts, edit those charts and
Dashboard users
may only view and analyse the dashboard or can edit dashboard by adding/removing
charts objects.
Question: Online/Offline Mode how it Impact in Dev and Delpoyment?
Answer: Online Mode- You can make changes in the RPD file and push in changes which
will be
immediately visible to the users who are already connected. This feature we may use
in production
environment.
Offline mode- can be useful in test or development environment.
Questions: Explan me the Schema in Your Last Project?
DB What happens if u Reconcile/Sync Both?

Q.What is Business Intelligence?


A.Business intelligence (BI) is a broad category of applications and technologies
for gathering, storing,
analyzing, and providing access to data to help enterprise users make better
business
decisions.Business Intelligence is process of exploring data in order to take
strategic decisions whether to
drive profitability or to manage costs.
It is a broad category of application programs and technologies for gathering,
storing, analyzing, and
providing access to data to help enterprise users make better business decisions.
BI applications include
the activities of decision support, query and reporting, online analytical
processing (OLAP), statistical
analysis, forecasting, and data mining.This technology based on customer and profit
oriented models that
reduces operating costs and provide increased profitability by improving
productivity, sales, service and
helps to make decision making capabilities at no time.

Q.Name some of the standard Business Intelligence tools in the market?


A.Some of the standard Business Intelligence tools in the market According to there
performance
1.MICROSTRATEGY
2.BUSINESS OBJECTS,CRYSTAL REPORTS
3.COGNOS REPORT NET
4.MS-OLAP SERVICES
5.SAS
6.Business objects
7.Hyperion
8.Microsoft integrated services

Q.What is OLAP?
A.OLAP stands for Online Analytical Processing. It is used for Anaytical
reporting.This helps to do
Business analysis of you data, but normal reporting tools are not supporting to
Business analysis. This is
the major difference between reporting tool and OLAP tool.It is a GateWay between
the Business user
and DWH.

Q.What is OLAP, MOLAP, ROLAP, DOLAP, HOLAP? Examples?


A.Cubes in a data warehouse are stored in three different modes. A relational
storage model is called
Relational Online Analytical Processing mode or ROLAP, while aMultidimensional
Online Analytical
processing mode is called MOLAP. When dimensions are stored in a combination of the
two modes then
it is known as Hybrid Online Analytical Processing mode or HOLAP.

1.MOLAP
This is the traditional mode in OLAP analysis. In MOLAP data is stored in form of
multidimensional cubes
and not in relational databases. The advantages of this mode is that it provides
excellent query
performance and the cubes are built for fast data retrieval. All calculations are
pre-generated when the
cube is created and can be easily applied while querying data.
The disadvantages of this model are that it can handle only a limited amount of
data. Since all
calculations have been pre-built when the cube was created, the cube cannot be
derived from a large
volume of data. This deficiency can be bypassed by including only summary level
calculations while
constructing the cube. This model also requires huge additional investment as cube
technology is
proprietary and the knowledge base may not exist in the organization.

2.ROLAP
The underlying data in this model is stored in relational databases. Since the data
is stored in relational
databases this model gives the appearance of traditional OLAP�s slicing and dicing
functionality. The
advantages of this model is it can handle a large amount of data and can leverage
all the functionalities of
the relational database.
The disadvantages are that the performance is slow and each ROLAP report is an SQL
query with all
the limitations of the genre. It is also limited by SQL functionality. ROLAP
vendors have tried to mitigate
this problem by building into the tool out-of-the-box complex functions as well as
providing the users with
an ability to define their own functions.
3.HOLAP
HOLAP technology tries to combine the strengths of the above two models. For
summary type
information HOLAP leverages cube technology and for drilling down into details it
uses the ROLAP
model.

Q.Comparing the use of MOLAP, HOLAP and ROLAP


A.The type of storage medium impacts on cube processing time, cube storage and cube
browsing speed.
Some of the factors that affect MOLAP storage are:

1.Cube browsing is the fastest when using MOLAP. This is so even in cases where no
aggregations
have been done. The data is stored in a compressed multidimensional format and can
be accessed
quickly than in the relational database. Browsing is very slow in ROLAP about the
same in HOLAP.
Processing time is slower in ROLAP, especially at higher levels of aggregation.

2.MOLAP storage takes up more space than HOLAP as data is copied and at very low
levels of
aggregation it takes up more room than ROLAP. ROLAP takes almost no storage space
as data is not
duplicated. However ROALP aggregations take up more space than MOLAP or HOLAP
aggregations.

3.All data is stored in the cube in MOLAP and data can be viewed even when the
original data source is
not available. In ROLAP data cannot be viewed unless connected to the data source.

4.MOLAP can handle very limited data only as all data is stored in the cube.

Q.How to Import universes and user from business object 6.5 to XI R2, it is showing
as some
ODBC error is there any setting to change?
A.You can import universes through import option in file menu.if ur odbc driver is
not connecting then u
can check ur database driver

Q.What are the various modules in Business Objects product Suite?


A.
1.Supervisor :It is the control center for the administration and security of your
entire BusinessObjects
deployment.

2.Designer :It is the tool used to create, manage and distribute universe for
BusinessObjects
and WebIntelligence Users. A universe is a file that containe connection parameters
for one or more
database middleware and SQL structure called objects that map to actual SQL
structure in the database
as columns,tables and database.

BusinessObjects Full client Reporting tool :Helps to create businessobjects reports


based on the
universe and also from the other data sources.
BusinessObjects Thin client Reporting tool :Helps to querry and analysis on the
universe and also
share the report among other users. It doesn't require any software, just need a
webbrowser and the
system connected to the businessobjetcs server.
3.Auditor :Tool is used for monitor and analysis user and system activity.
4.Application Foundation : This module covers a set of products which is used for
Enterprise
Performance Management (EPM). The tools are
1.Dashboard manager
2.Scorecard
3.Performance Management Applications

Q.What is Hyperion? Is it an OLAP tool? what is the difference between OLAP and ETL
tools?
What is the future for OLAP and ETL market for the next five years?
a.Its an Business Intelligence tools. Like Brio which was an independent product
bought over my
Hyperion has converted this product name to Hyperion Intelligence.

Q.Is Hyperion an OLAP tool?


You can analyse data schemas using this tools.
1.OLAP: Its an online analytical processing tool. There are various products
available for data analysis.
2.ETL: Extract , Transform and Load. This is a product to extract the data from
multiple/single source
transform the data and load it into a a table,flatfile or simply a target.
There is a quite a bit compitation in the market with regard to the ETL product as
well as the OLAP
products. These tools would definately be widely used for data load and data
analysis purpose.

Q.Explain the Dashboard in the business intelligence.


A.
1.Dashboard: A dashboard in business intellgence allows huge data and reports to be
read in a single
graphical interface. They help in making faster decisions by replying on measurable
data seen at a
glance. They can also be used to get into details of this data to analyze the root
cause of any business
performance. It represents the business data and business state at a high level.
Dashboards can also be
used for cost control. Example of need of a dashboard: Banks run thousands of
ATM�s. They need to
know how much cash is deposited, how much is left etc.

2.SAS Business Intelligence : SAS business intelligence has analytical capabilities


like statistics,
reporting, data mining, predictions, forecasting and optimization. They help in
getting data in the format
desired. It helps in improving quality of data.

Q.Explain the SQL Server 2005 Business Intelligence components.


A.
1.SQL Server Integration Services:- Used for data transformation and creation. Used
in data acquisition
form a source system.
2.SQL Server Analysis Services: Allows data discovery using data mining. Using
business logic it
supports data enhancement.
3.SQL Server Reporting Services:- Used for Data presentation and distribution
access.

Q.How would you improve the performance of the reports.


A.Performance of the reports starts with analyzing the problem.The problem could be
with database,
Universe or the report itself

1.Analyzing the database


a.Run the SQL from the report on an oracle client like SQL Navigator or toad after
passing in all the
parameters.
b.Identify if the SQL takes considerable less time than the report. If yes then the
problem is with the
Universe or with the report if no then
c.Run an explain plan on the SQL
d.Look to see if all the statistics are computed, indexes are built and the indexes
are used
e.Check to see if aggregate tables can be used ( Aggregate tables are useful if the
data can be concised
to 1/10th of fact data)
f.Check to see if data has increased and usage of materialized views could help.

1.Creating materialized views enable to pre -run the complex joins and store the
data.
2.Most of the DW environment has a day old data hence they don�t have lot of
overhead.
3.Running a report against a single materialized table is always faster then
running against multiple tables
with complex joins.
4.Indexes can be created on this materialized view to further increase the
performance.

g.Check to see if the performance of the SQL can be increased by using hints ,if
yes then add a hint to
the report SQL and freeze the SQL, this might have an additional overhead of
maintaining the report

2.Analyzing the Universe


a.Check is all the outer joins are properly created and remove unnecessary outer
join
b.Business Objects as such do not use Multi Pass SQL , Multi pass SQL is a
technique a software use to
break down a complex SQL into multiple smaller SQLs. Hence a query using one fact
table and three
dimension tables can be broken down into two, one between the dimension tables and
the second
between the first result and the fact table. This can be achieved in BO by creating
Derived Tables. The
derived table would be based on three dimension tables and the reports hence can
use one derived table
and one fact table instead of four tables.

c.The Keys tab allows you to define index awareness for an object. Index awareness
is the ability to take
advantage of the indexes on key columns to speed data retrieval.

1.In a typical data warehousing environment surrogate keys are used as primary keys
instead of
natural keys , this primary key may not be meaningful to the end user but Designer
can take advantage of
the indexes on key columns to speed data retrieval.
2.The only disadvantage is it would not return duplicate data unless the duplicate
data has
separate keys
d.Check to see if the size of the universe has increased recently
e.Try to create a different universe for new requirements
f.Under extreme conditions the AUTOPARSE parameter in the param file can be turned
off, this could be
too risky if not handled properly.

3.Analyzing the Report


a.Check to see if there are any conditions which could be pushed into universe as
Filters
b.Check to see if a formula has multiple usage ,turn this to a variable
c.Check if there are any variables which are not used, remove them.
d.Remove any additional filters or alerters on the report

1. What is Business Intelligence?


Business Intelligence is a process for increasing the competitive advantage of a
business by intelligent
use of available data in decision making.
The five key stages of Business Intelligence:
. Data Sourcing
. Data Analysis
. Situation Awareness
. Risk Assessment
. Decision Support
2. What is a Universe in Business Intelligence?
A "universe" is a "Business object" terminology. Business objects also happens to
be the name of the
company. The universe is the interfacing layer between the client and the
datawarehouse . The universe
defines the relationship among the various tables in the datawarehouse.
Or
Universe is a semantic layer between the database and the user interface (reports).

3. What is OLAP in Business Intelligence?


Online Analytical Processing, a category of software tools that provides analysis
of data stored in a
database. OLAP tools enable users to analyze different dimensions of
multidimensional data. For
example, it provides time series and trend analysis views. The chief component of
OLAP is the OLAP
server, which sits between a client and a database management systems (DBMS). The
OLAP server
understands how data is organized in the database and has special functions
analyzing the data.
A good OLAP interface has writes an efficient sql and reads an accurate data from
db.To design and
architect having good knowledge on DB understanding the report requirements.
4. What are the various modules in Business Objects product?
. Business Objects Reporter
. Reporting & Analyzing tool
. Designer
. Universe creation
. database interaction
. connectivity
Supervisor - For Administrative purposes
Webintelligence - Access of report data through internet
BroadCast Agent - For scheduling the reports
Data Integrator - The ETL tool of Business Objects, designed to handle huge amounts
of data
5. What is OLAP, MOLAP, ROLAP, DOLAP, HOLAP? Explain with Examples?
OLAP - On-Line Analytical Processing.
Designates a category of applications and technologies that allow the collection,
storage, manipulation
and reproduction of multidimensional data, with the goal of analysis.
MOLAP - Multidimensional OLAP.
This term designates a cartesian data structure more specifically. In effect, MOLAP
contrasts with
ROLAP. Inb the former, joins between tables are already suitable, which enhances
performances. In the
latter, joins are computed during the request.
Targeted at groups of users because it's a shared environment. Data is stored in
an exclusive server-
based format. It performs more complex analysis of data.
DOLAP - Desktop OLAP.
Small OLAP products for local multidimensional analysis Desktop OLAP. There can be
a mini
multidimensional database (using Personal Express), or extraction of a datacube
(using Business
Objects).
Designed for low-end, single, departmental user. Data is stored in cubes on the
desktop. It's like having
your own spreadsheet. Since the data is local, end users dont have to worry about
performance hits
against the server.
ROLAP - Relational OLAP.
Designates one or several star schemas stored in relational databases. This
technology permits
multidimensional analysis with data stored in relational databases.
Used for large departments or groups because it supports large amounts of data and
users.
HOLAP:Hybridization of OLAP, which can include any of the above.
6. Why an infocube has maximum of 16 dimensions?
It depends upon the Database limits provided to define the Foreign key constraint,
e.g. in Sql Server
2005, the recommended max limit for foreign keys is 253, but you can define more.
7. What is BAS? What is the function?
The Business Application Support (BAS) functional area at SLAC provides
administrative computing
services to the Business Services Division and Human Resources Department. We are
responsible for
software development and maintenance of the PeopleSoft? Applications and
consultation to customers
with their computer-related tasks.
8. Name some of the standard Business Intelligence tools in the Market?
Some of the standard Business Intelligence tools in the market According to their
performance
. MICROSTRATEGY
. BUSINESS OBJECTS,CRYSTAL REPORTS
. COGNOS REPORT NET
. MS-OLAP SERVICES
Or
. Seagate Crystal report
. SAS
. Business objects
. Microstrategy
. Cognos
. Microsoft OLAP
7. Hyperion
8. Microsoft integrated services
9. How do we enhance the functionality of the reports in BO?
You can format the BO Reports by using various features available. You can turn the
table reports into a
2-Dimensional or 3-Dimensional charts. You can apply an Alert to show some data in
a different format ,
based on some business rule. You can also create some prompts, which will asks user
to give some input
values before seeing the reports, this way they will see only filtered data. There
are many similar exciting
options available to enhance the reports.
10. What are dashboards?
A management reporting tool to gauge how well the organization company is
performing. It normally uses
"traffic-lights" or "smiley faces" to determine the status.

Explain about Auditing in BO XI R2? What is the use of it?


Auditor is used by the business objects administrators to know the complete
information of the business
intelligence system.
. it monitors entire BIsystem at a glance.
. Analyzes usage and change impact
. optimises the BI deployment
How do we Tune the BO Reports for Performance Improvement?
We can tune the report by using index awareness in universe
Why we ca not create an aggregate on an ODS Object?
. Operational Data Store has very low data latency. Data moved to ODS mostly on
event based rather
than time based ETL to Data Warehouse/Data Mart.
. ODS is more closer to OLTP system. We don't normally prefer to store aggregated
data in OLTP. So it
is with ODS.
. Unlike data warehouse where data is HISTORICAL, ODS is near real time(NRT). So
data aggregation
is less important is ODS as data keeps changing.
What is hierarchy relationship in a dimension.
whether it is:
1. 1:1
2. 1:m
3. m:m
1:M
How to connect GDE to Co Operating system in Abinitio?
We can connect Ab Initio GDE with Co>operating system using Run->Settings.In there
u can specify the
host IP address and the connection type .Refer Ab Initio help for further details.
Explain the Name of some standard Business Intelligence tools in the market?
Some of the standard Business Intelligence tools in the market According to there
performance
1)MICROSTRATEGY
2)BUSINESS OBJECTS,CRYSTAL REPORTS
3)COGNOS REPORT NET
4)MS-OLAP SERVICES
What are the various modules in Business Objects product Suite?
Supervisor:
Supervisor is the control center for the administration and security of your entire
BusinessObjects
deployment.
Designer:
Designer is the tool used to create, manage and distribute universe for
BusinessObjects and
WebIntelligence Users. A universe is a file that containe connection parameters for
one or more database
middleware and SQL structure called objects that map to actual SQL structure in the
database as
columns,tables and database.
BusinessObjects Full client Reporting tool:
Helps to create businessobjects reports based on the universe and also from the
other data sources.
BusinessObjects Thin client Reporting tool:
Helps to querry and analysis on the universe and also share the report among other
users. It doesnt
require any software, just need a webbrowser and the system connected to the
businessobjetcs server.
Auditor:
Tool is used for monitor and analysis user and system activity.
Application Foundation:
This module covers a set of products which is used for Enterprise Performance
Management (EPM). The
tools are
. Dashboard manager
. Scorecard
. Performance Management Applications

Q.What is Business Intelligence?


A.Business intelligence (BI) is a broad category of applications and technologies
for gathering, storing,
analyzing, and providing access to data to help enterprise users make better
business
decisions.Business Intelligence is process of exploring data in order to take
strategic decisions whether to
drive profitability or to manage costs.
It is a broad category of application programs and technologies for gathering,
storing, analyzing, and
providing access to data to help enterprise users make better business decisions.
BI applications include
the activities of decision support, query and reporting, online analytical
processing (OLAP), statistical
analysis, forecasting, and data mining.This technology based on customer and profit
oriented models that
reduces operating costs and provide increased profitability by improving
productivity, sales, service and
helps to make decision making capabilities at no time.

Q.Name some of the standard Business Intelligence tools in the market?


A.Some of the standard Business Intelligence tools in the market According to there
performance
1.MICROSTRATEGY
2.BUSINESS OBJECTS,CRYSTAL REPORTS
3.COGNOS REPORT NET
4.MS-OLAP SERVICES
5.SAS
6.Business objects
7.Hyperion
8.Microsoft integrated services

Q.What is OLAP?
A.OLAP stands for Online Analytical Processing. It is used for Anaytical
reporting.This helps to do
Business analysis of you data, but normal reporting tools are not supporting to
Business analysis. This is
the major difference between reporting tool and OLAP tool.It is a GateWay between
the Business user
and DWH.

Q.What is OLAP, MOLAP, ROLAP, DOLAP, HOLAP? Examples?


A.Cubes in a data warehouse are stored in three different modes. A relational
storage model is called
Relational Online Analytical Processing mode or ROLAP, while aMultidimensional
Online Analytical
processing mode is called MOLAP. When dimensions are stored in a combination of the
two modes then
it is known as Hybrid Online Analytical Processing mode or HOLAP.

1.MOLAP
This is the traditional mode in OLAP analysis. In MOLAP data is stored in form of
multidimensional cubes
and not in relational databases. The advantages of this mode is that it provides
excellent query
performance and the cubes are built for fast data retrieval. All calculations are
pre-generated when the
cube is created and can be easily applied while querying data.
The disadvantages of this model are that it can handle only a limited amount of
data. Since all
calculations have been pre-built when the cube was created, the cube cannot be
derived from a large
volume of data. This deficiency can be bypassed by including only summary level
calculations while
constructing the cube. This model also requires huge additional investment as cube
technology is
proprietary and the knowledge base may not exist in the organization.

2.ROLAP
The underlying data in this model is stored in relational databases. Since the data
is stored in relational
databases this model gives the appearance of traditional OLAP�s slicing and dicing
functionality. The
advantages of this model is it can handle a large amount of data and can leverage
all the functionalities of
the relational database.
The disadvantages are that the performance is slow and each ROLAP report is an SQL
query with all
the limitations of the genre. It is also limited by SQL functionality. ROLAP
vendors have tried to mitigate
this problem by building into the tool out-of-the-box complex functions as well as
providing the users with
an ability to define their own functions.

3.HOLAP
HOLAP technology tries to combine the strengths of the above two models. For
summary type
information HOLAP leverages cube technology and for drilling down into details it
uses the ROLAP
model.

Q.Comparing the use of MOLAP, HOLAP and ROLAP


A.The type of storage medium impacts on cube processing time, cube storage and cube
browsing speed.
Some of the factors that affect MOLAP storage are:
1.Cube browsing is the fastest when using MOLAP. This is so even in cases where no
aggregations
have been done. The data is stored in a compressed multidimensional format and can
be accessed
quickly than in the relational database. Browsing is very slow in ROLAP about the
same in HOLAP.
Processing time is slower in ROLAP, especially at higher levels of aggregation.

2.MOLAP storage takes up more space than HOLAP as data is copied and at very low
levels of
aggregation it takes up more room than ROLAP. ROLAP takes almost no storage space
as data is not
duplicated. However ROALP aggregations take up more space than MOLAP or HOLAP
aggregations.

3.All data is stored in the cube in MOLAP and data can be viewed even when the
original data source is
not available. In ROLAP data cannot be viewed unless connected to the data source.

4.MOLAP can handle very limited data only as all data is stored in the cube.

Q.How to Import universes and user from business object 6.5 to XI R2, it is showing
as some
ODBC error is there any setting to change?
A.You can import universes through import option in file menu.if ur odbc driver is
not connecting then u
can check ur database driver

Q.What are the various modules in Business Objects product Suite?


A.
1.Supervisor :It is the control center for the administration and security of your
entire BusinessObjects
deployment.

2.Designer :It is the tool used to create, manage and distribute universe for
BusinessObjects
and WebIntelligence Users. A universe is a file that containe connection parameters
for one or more
database middleware and SQL structure called objects that map to actual SQL
structure in the database
as columns,tables and database.

BusinessObjects Full client Reporting tool :Helps to create businessobjects reports


based on the
universe and also from the other data sources.
BusinessObjects Thin client Reporting tool :Helps to querry and analysis on the
universe and also
share the report among other users. It doesn't require any software, just need a
webbrowser and the
system connected to the businessobjetcs server.

3.Auditor :Tool is used for monitor and analysis user and system activity.
4.Application Foundation : This module covers a set of products which is used for
Enterprise
Performance Management (EPM). The tools are
1.Dashboard manager
2.Scorecard
3.Performance Management Applications
Q.What is Hyperion? Is it an OLAP tool? what is the difference between OLAP and
ETL tools?
What is the future for OLAP and ETL market for the next five years?
a.Its an Business Intelligence tools. Like Brio which was an independent product
bought over my
Hyperion has converted this product name to Hyperion Intelligence.

Q.Is Hyperion an OLAP tool?


You can analyse data schemas using this tools.
1.OLAP: Its an online analytical processing tool. There are various products
available for data analysis.
2.ETL: Extract , Transform and Load. This is a product to extract the data from
multiple/single source
transform the data and load it into a a table,flatfile or simply a target.
There is a quite a bit compitation in the market with regard to the ETL product as
well as the OLAP
products. These tools would definately be widely used for data load and data
analysis purpose.

Q.Explain the Dashboard in the business intelligence.


A.
1.Dashboard: A dashboard in business intellgence allows huge data and reports to be
read in a single
graphical interface. They help in making faster decisions by replying on measurable
data seen at a
glance. They can also be used to get into details of this data to analyze the root
cause of any business
performance. It represents the business data and business state at a high level.
Dashboards can also be
used for cost control. Example of need of a dashboard: Banks run thousands of
ATM�s. They need to
know how much cash is deposited, how much is left etc.

2.SAS Business Intelligence : SAS business intelligence has analytical capabilities


like statistics,
reporting, data mining, predictions, forecasting and optimization. They help in
getting data in the format
desired. It helps in improving quality of data.

Q.Explain the SQL Server 2005 Business Intelligence components.


A.
1.SQL Server Integration Services:- Used for data transformation and creation. Used
in data acquisition
form a source system.
2.SQL Server Analysis Services: Allows data discovery using data mining. Using
business logic it
supports data enhancement.
3.SQL Server Reporting Services:- Used for Data presentation and distribution
access.

Q.How would you improve the performance of the reports.


A.Performance of the reports starts with analyzing the problem.The problem could be
with database,
Universe or the report itself

1.Analyzing the database


a.Run the SQL from the report on an oracle client like SQL Navigator or toad after
passing in all the
parameters.
b.Identify if the SQL takes considerable less time than the report. If yes then the
problem is with the
Universe or with the report if no then
c.Run an explain plan on the SQL
d.Look to see if all the statistics are computed, indexes are built and the indexes
are used
e.Check to see if aggregate tables can be used ( Aggregate tables are useful if the
data can be concised
to 1/10th of fact data)
f.Check to see if data has increased and usage of materialized views could help.

1.Creating materialized views enable to pre -run the complex joins and store the
data.
2.Most of the DW environment has a day old data hence they don�t have lot of
overhead.
3.Running a report against a single materialized table is always faster then
running against multiple tables
with complex joins.
4.Indexes can be created on this materialized view to further increase the
performance.

g.Check to see if the performance of the SQL can be increased by using hints ,if
yes then add a hint to
the report SQL and freeze the SQL, this might have an additional overhead of
maintaining the report

2.Analyzing the Universe


a.Check is all the outer joins are properly created and remove unnecessary outer
join
b.Business Objects as such do not use Multi Pass SQL , Multi pass SQL is a
technique a software use to
break down a complex SQL into multiple smaller SQLs. Hence a query using one fact
table and three
dimension tables can be broken down into two, one between the dimension tables and
the second
between the first result and the fact table. This can be achieved in BO by creating
Derived Tables. The
derived table would be based on three dimension tables and the reports hence can
use one derived table
and one fact table instead of four tables.

c.The Keys tab allows you to define index awareness for an object. Index awareness
is the ability to take
advantage of the indexes on key columns to speed data retrieval.

1.In a typical data warehousing environment surrogate keys are used as primary keys
instead of
natural keys , this primary key may not be meaningful to the end user but Designer
can take advantage of
the indexes on key columns to speed data retrieval.
2.The only disadvantage is it would not return duplicate data unless the duplicate
data has
separate keys

d.Check to see if the size of the universe has increased recently


e.Try to create a different universe for new requirements
f.Under extreme conditions the AUTOPARSE parameter in the param file can be turned
off, this could be
too risky if not handled properly.

3.Analyzing the Report


a.Check to see if there are any conditions which could be pushed into universe as
Filters
b.Check to see if a formula has multiple usage ,turn this to a variable
c.Check if there are any variables which are not used, remove them.
d.Remove any additional filters or alerters on the report
Question:How you generally Approach to ur Analytics Project?
Answer: Any project should start from defining the scope of the project and the
approach
should be not to deviate from the scope.
Then the project should be functionally divided into smaller modules generally done
by
project managers along with technical and functional leads.
The functional leads then decide on majorly three things:
1. According to the defined scope of the project they start gathering requirements
while
interacting with the clients.
2. They had a discussion with the technical leads and try to reach a solution.
3. Technical leads decides what schemas to create and what requirements are going
to
fulfill by that schema.
Technical leads discuss all this with the developers and try to close requirements.

Simultaneously testing and deployment is planned in a phased manner.


Question: How we are going to decide which schema we are going to implement in the
data
warehouse?
Answer: One way is what is mentioned in Question above.
If you ask me to blindly create schemas for the warehouse without knowing any
requirements,
I will simply first divide the schemas on the basis of functional areas of an
Organisation
which are similar to the modules in an ERP like sales, finance, purchase,
inventory,
production, HR etc.
I will broadly describe the expected analysis an organisation would like to do in
every
module. I think this way you would be able to complete at least 40-50 % of the
requirements. To move ahead, study the data and business and you can create few
more
schemas.
Question: What are the Challenges You Faced while making of Reports?
Answer: Making of an report has never been a difficult task. But problem comes when
users
are reluctant to adopt a new system. I have experienced that if you are not able to
create
the report in exactly the way they used to see, they will keep asking for the
changes. Your
approach should be to first show them what they want to see and then add more
information
in the report.
Question: What you will do when your Report is not Fetching Right Data?
Answer: this is the biggest problem in report creation and verification. There
could be two
reasons for report not fetching the right data.
1. Mostly clients do not have correct data in their database and on top of that to
correct
the results they make some changes at the report level to bring the desired result
which
you may not e aware of while creating the reports. Clients try to match the data
with their
existing reports and you never get the correct results. you try to discover the
things and
at later stage come to know of all these problems and you are held responsible for
this
delay. Hence always consult the SPOC(Single Point of Contact) and try to understand
the
logic they have used to generate their reports.
2. If the database values are correct, there there could be a problem with the
joins and
relations in the schema. You need to discover that analysing and digging deep into
the
matter.
There are more questions which I will try to answer later.
The questions are very specific to OBIEE and I dont have much experience in that.
Hence you may not
agree to my answers, but wherever please post a comment and let me know too.
Question: How analytics Process Your Request When you Create your Requests.
Answer: If the Question means how does Oracle BI Analytics Server processes the
user requests, the
answer is- Oracle BI server converts the logical SQL submitted by the client into
optimised physical SQL
which is then sent to the backend database. Also in between it performs various
tasks like converting the
user operations like user selections to form a logical SQL, checking and verifying
credentials, breaking
the request into threads(as Oracle BI is a multi threaded server), processes the
requests, manages the
cached results, again converting the results received from the database into user
presentable form etc.

What is the relation between EME , GDE and Co-operating system ?

ans. EME is said as enterprise metdata env, GDE as graphical devlopment env and Co-
operating sytem
can be said as asbinitio server
relation b/w this CO-OP, EME AND GDE is as fallows
Co operating system is the Abinitio Server. this co-op is installed on perticular
O.S platform that is called
NATIVE O.S .comming to the EME, its i just as repository in informatica , its hold
the
metadata,trnsformations,db config files source and targets informations. comming to
GDE its is end user
envirinment where we can devlop the graphs(mapping just like in informatica)
desinger uses the GDE and designs the graphs and save to the EME or Sand box it is
at user side.where
EME is ast server side.

What is the use of aggregation when we have rollup

as we know rollup component in abinitio is used to summirize group of data record.


then where we will
use aggregation ?
ans: Aggregation and Rollup both can summerise the data but rollup is much more
convenient to use. In
order to understand how a particular summerisation being rollup is much more
explanatory compared to
aggregate. Rollup can do some other functionalities like input and output filtering
of records.
Aggregate and rollup perform same action, rollup display intermediat
result in main memory, Aggregate does not support intermediat result
what are kinds of layouts does ab initio supports
Basically there are serial and parallel layouts supported by AbInitio. A graph can
have both at the same
time. The parallel one depends on the degree of data parallelism. If the multi-file
system is 4-way parallel
then a component in a graph can run 4 way parallel if the layout is defined such as
it's same as the
degree of parallelism.

How can you run a graph infinitely?

To run a graph infinitely, the end script in the graph should call the .ksh file of
the graph. Thus if the name
of the graph is abc.mp then in the end script of the graph there should be a call
to abc.ksh.
Like this the graph will run infinitely.

How do you add default rules in transformer?

Double click on the transform parameter of parameter tab page of component


properties, it will open
transform editor. In the transform editor click on the Edit menu and then select
Add Default Rules from the
dropdown. It will show two options - 1) Match Names 2) Wildcard.

Do you know what a local lookup is?

If your lookup file is a multifile and partioned/sorted on a particular key then


local lookup function can be
used ahead of lookup function call. This is local to a particular partition
depending on the key.
Lookup File consists of data records which can be held in main memory. This makes
the transform
function to retrieve the records much faster than retirving from disk. It allows
the transform component to
process the data records of multiple files fastly.

What is the difference between look-up file and look-up, with a relevant example?

Generally Lookup file represents one or more serial files(Flat files). The amount
of data is small enough to
be held in the memory. This allows transform functions to retrive records much more
quickly than it could
retrive from Disk.
A lookup is a component of abinitio graph where we can store data and retrieve it
by using a key
parameter.
A lookup file is the physical file where the data for the lookup is stored.
How many components in your most complicated graph? It depends the type of
components you us.
usually avoid using much complicated transform function in a graph.

Explain what is lookup?

Lookup is basically a specific dataset which is keyed. This can be used to mapping
values as per the data
present in a particular file (serial/multi file). The dataset can be static as well
dynamic ( in case the lookup
file is being generated in previous phase and used as lookup file in current
phase). Sometimes, hash-
joins can be replaced by using reformat and lookup if one of the input to the join
contains less number of
records with slim record length.
AbInitio has built-in functions to retrieve values using the key for the lookup
What is a ramp limit?
The limit parameter contains an integer that represents a number of reject events
The ramp parameter contains a real number that represents a rate of reject events
in the number of
records processed.
no of bad records allowed = limit + no of records*ramp.
ramp is basically the percentage value (from 0 to 1)
This two together provides the threshold value of bad records.

Have you worked with packages?

Multistage transform components by default uses packages. However user can create
his own set of
functions in a transfer function and can include this in other transfer functions.

Have you used rollup component? Describe how.

If the user wants to group the records on particular field values then rollup is
best way to do that. Rollup is
a multi-stage transform function and it contains the following mandatory functions.

1. initialise
2. rollup
3. finalise
Also need to declare one temporary variable if you want to get counts of a
particular group.
For each of the group, first it does call the initialise function once, followed by
rollup function calls for each
of the records in the group and finally calls the finalise function once at the end
of last rollup call.

How do you add default rules in transformer?

Add Default Rules � Opens the Add Default Rules dialog. Select one of the
following: Match Names �
Match names: generates a set of rules that copies input fields to output fields
with the same name. Use
Wildcard (.*) Rule � Generates one rule that copies input fields to output fields
with the same name.
)If it is not already displayed, display the Transform Editor Grid.
2)Click the Business Rules tab if it is not already displayed.
3)Select Edit > Add Default Rules.
In case of reformat if the destination field names are same or subset of the source
fields then no need to
write anything in the reformat xfr unless you dont want to use any real transform
other than reducing the
set of fields or split the flow into a number of flows to achive the functionality.

What is the difference between partitioning with key and round robin?

Partition by Key or hash partition -> This is a partitioning technique which is


used to partition data when
the keys are diverse. If the key is present in large volume then there can large
data skew. But this method
is used more often for parallel data processing.
Round robin partition is another partitioning technique to uniformly distribute the
data on each of the
destination data partitions. The skew is zero in this case when no of records is
divisible by number of
partitions. A real life example is how a pack of 52 cards is distributed among 4
players in a round-robin
manner.

How do you improve the performance of a graph?

There are many ways the performance of the graph can be improved.
1) Use a limited number of components in a particular phase
2) Use optimum value of max core values for sort and join components
3) Minimise the number of sort components
4) Minimise sorted join component and if possible replace them by in-memory
join/hash join
5) Use only required fields in the sort, reformat, join components
6) Use phasing/flow buffers in case of merge, sorted joins
7) If the two inputs are huge then use sorted join, otherwise use hash join with
proper driving port
8) For large dataset don't use broadcast as partitioner
9) Minimise the use of regular expression functions like re_index in the trasfer
functions
10) Avoid repartitioning of data unnecessarily
Try to run the graph as long as possible in MFS. For these input files should be
partitioned and if possible
output file should also be partitioned.
How do you truncate a table?
From Abinitio run sql component using the DDL "trucate table
By using the Truncate table component in Ab Initio

Have you eveer encountered an error called "depth not equal"?


When two components are linked together if their layout doesnot match then this
problem can occur
during the compilation of the graph. A solution to this problem would be to use a
partitioning component in
between if there was change in layout.

What is the function you would use to transfer a string into a decimal?

In this case no specific function is required if the size of the string and decimal
is same. Just use decimal
cast with the size in the transform function and will suffice. For example, if the
source field is defined as
string(8) and the destination as decimal(8) then (say the field name is field1).
out.field :: (decimal(8)) in.field
If the destination field size is lesser than the input then use of string_substring
function can be used likie
the following.
say destination field is decimal(5).
out.field :: (decimal(5))string_lrtrim(string_substring(in.field,1,5)) /*
string_lrtrim used to trim leading and
trailing spaces */
What are primary keys and foreign keys?
In RDBMS the relationship between the two tables is represented as Primary key and
foreign key
relationship.Wheras the primary key table is the parent table and foreignkey table
is the child table.The
criteria for both the tables is there should be a matching column.
What is the difference between clustered and non-clustered indices? ...and why do
you use a
clustered index?
What is an outer join?
An outer join is used when one wants to select all the records from a port -
whether it has
satisfied the join criteria or not.
What are Cartesian joins?
joins two tables without a join key. Key should be {}.
What is the purpose of having stored procedures in a database?
Main Purpose of Stored Procedure for reduse the network trafic and all sql
statement executing in cursor
so speed too high.
Why might you create a stored procedure with the 'with recompile' option?
Recompile is useful when the tables referenced by the stored proc undergoes a lot
of
modification/deletion/addition of data. Due to the heavy modification activity the
execute plan becomes
outdated and hence the stored proc performance goes down. If we create the stored
proc with recompile
option, the sql server wont cache a plan for this stored proc and it will be
recompiled every time it is run.

What is a cursor? Within a cursor, how would you update fields on the row just
fetched

The oracle engine uses work areas for internal processing in order to the execute
sql statement is called
cursor.There are two types of cursors like Implecit cursor and Explicit
cursor.Implicit cursor is using for
internal processing and Explicit cursor is using for user open for data required.

How would you find out whether a SQL query is using the indices you expect?

explain plan can be reviewed to check the execution plan of the query. This would
guide if the expected
indexes are used or not.

How can you force the optimizer to use a particular index?

use hints /*+ */, these acts as directives to the optimizer


select /*+ index(a index_name) full(b) */ *from table1 a, table2 bwhere b.col1 =
a.col1 and b.col2= 'sid'and
b.col3 = 1;
When using multiple DML statements to perform a single unit of work, is it
preferable to use implicit or
explicit transactions, and why.
Because implicit is using for internal processing and explicit is using for user
open data requied.
Describe the elements you would review to ensure multiple scheduled "batch" jobs do
not "collide" with
each other.
Because every job depend upon another job for example if you first job result is
successfull then another
job will execute otherwise your job doesn't work.

Describe the process steps you would perform when defragmenting a data table.

This table contains mission critical data.


There are several ways to do this:
1) We can move the table in the same or other tablespace and rebuild all the
indexes on the table.
alter table move this activity reclaims the defragmented space in the table
analyze table table_name compute statistics to capture the updated statistics.
2)Reorg could be done by taking a dump of the table, truncate the table and import
the dump back into
the table.

Explain the difference between the �truncate� and "delete" commands.

The difference between the TRUNCATE and DELETE statement is Truncate belongs to DDL
command
whereas DELETE belongs to DML command.Rollback cannot be performed incase of
Truncate statement
wheras Rollback can be performed in Delete statement. "WHERE" clause cannot be used
in Truncate
where as "WHERE" clause can be used in DELETE statement.

What is the difference between a DB config and a CFG file?

A .dbc file has the information required for Ab Initio to connect to the database
to extract or load tables or
views. While .CFG file is the table configuration file created by db_config while
using components like
Load DB Table.

Describe the �Grant/Revoke� DDL facility and how it is implemented.

Basically,This is a part of D.B.A responsibilities GRANT means permissions for


example GRANT
CREATE TABLE ,CREATE VIEW AND MANY MORE .
REVOKE means cancel the grant (permissions).So,Grant or Revoke both commands depend
upon
D.B.A.

Have you worked with packages?


Ans: Multistage transform components by default uses packages. However user can
create his own set
of functions in a transfer function and can include this in other transfer
functions.
Have you used rollup component? Describe how.
Ans: If the user wants to group the records on particular field values then rollup
is best way to do that.
Rollup is a multi-stage transform function and it contains the following mandatory
functions.
1. initialise
2. rollup
3. finalise
Also need to declare one temporary variable if you want to get counts of a
particular group.
For each of the group, first it does call the initialise function once, followed
by rollup function calls for each
of the records in the group and finally calls the finalise function once at the end
of last rollup call.
How do you add default rules in transformer?
Ans: Add Default Rules � Opens the Add Default Rules dialog. Select one of the
following: Match
Names � Match names: generates a set of rules that copies input fields to output
fields with the same
name. Use Wildcard (.*) Rule � Generates one rule that copies input fields to
output fields with the same
name.
1)If it is not already displayed, display the Transform Editor Grid.
2)Click the Business Rules tab if it is not already displayed.
3)Select Edit > Add Default Rules.
In case of reformat if the destination field names are same or subset of the source
fields then no need to
write anything in the reformat xfr unless you dont want to use any real transform
other than reducing the
set of fields or split the flow into a number of flows to achive the functionality.

What is the difference between partitioning with key and round robin?
Ans: Partition by Key or hash partition -> This is a partitioning technique which
is used to partition data
when the keys are diverse. If the key is present in large volume then there can
large data skew. But this
method is used more often for parallel data processing.
Round robin partition is another partitioning technique to uniformly distribute the
data on each of the
destination data partitions. The skew is zero in this case when no of records is
divisible by number of
partitions. A real life example is how a pack of 52 cards is distributed among 4
players in a round-robin
manner.
How do you improve the performance of a graph?
Ans: There are many ways the performance of the graph can be improved.
1) Use a limited number of components in a particular phase
2) Use optimum value of max core values for sort and join components
3) Minimise the number of sort components
4) Minimise sorted join component and if possible replace them by in-memory
join/hash join
5) Use only required fields in the sort, reformat, join components
6) Use phasing/flow buffers in case of merge, sorted joins
7) If the two inputs are huge then use sorted join, otherwise use hash join with
proper driving port
8) For large dataset don't use broadcast as partitioner
9) Minimise the use of regular expression functions like re_index in the trasfer
functions
10) Avoid repartitioning of data unnecessarily
Try to run the graph as long as possible in MFS. For these input files should be
partitioned and if possible
output file should also be partitioned.
How do you truncate a table?
Ans: From Abinitio run sql component using the DDL "trucate table
By using the Truncate table component in Ab Initio
Have you ever encountered an error called "depth not equal"?
Ans: When two components are linked together if their layout doesnot match then
this problem can occur
during the compilation of the graph. A solution to this problem would be to use a
partitioning component in
between if there was change in layout.
What is the function you would use to transfer a string into a decimal?
Ans: In this case no specific function is required if the size of the string and
decimal is same. Just use
decimal cast with the size in the transform function and will suffice. For example,
if the source field is
defined as string(8) and the destination as decimal(8) then (say the field name is
field1).out.field ::
(decimal(8)) in.field
If the destination field size is lesser than the input then use of string_substring
function can be used likie
the following.
say destination field is decimal(5).
out.field :: (decimal(5))string_lrtrim(string_substring(in.field,1,5)) /*
string_lrtrim used to trim leading and
trailing spaces */
In RDBMS the relationship between the two tables is represented as Primary key and
foreign key
relationship.Wheras the primary key table is the parent table and foreignkey table
is the child table.The
criteria for both the tables is there should be a matching column.
What are Cartesian joins?
Ans: Joins two tables without a join key. Key should be {}.
What is the purpose of having stored procedures in a database?
Ans: Main Purpose of Stored Procedure for reduse the network trafic and all sql
statement executing in
cursor so speed too high.
Why might you create a stored procedure with the 'with recompile' option?
Recompile is useful when the tables referenced by the stored proc undergoes a lot
of
modification/deletion/addition of data. Due to the heavy modification activity the
execute plan becomes
outdated and hence the stored proc performance goes down. If we create the stored
proc with recompile
option, the sql server wont cache a plan for this stored proc and it will be
recompiled every time it is run.
What is a cursor? Within a cursor, how would you update fields on the row just
fetched?
Ans: The oracle engine uses work areas for internal processing in order to the
execute sql statement is
called cursor.There are two types of cursors like Implecit cursor and Explicit
cursor.Implicit cursor is using
for internal processing and Explicit cursor is using for user open for data
required.
How would you find out whether a SQL query is using the indices you expect?
Ans: Explain plan can be reviewed to check the execution plan of the query. This
would guide if the
expected indexes are used or not.
How can you force the optimizer to use a particular index?
Ans: use hints /*+ */, these acts as directives to the optimizer
select /*+ index(a index_name) full(b) */ *from table1 a, table2 bwhere b.col1 =
a.col1 and b.col2= 'sid'and
b.col3 = 1;
When using multiple DML statements to perform a single unit of work, is it
preferable to use
implicit or explicit transactions, and why.
Ans: Because implicit is using for internal processing and explicit is using for
user open data requied.
Describe the elements you would review to ensure multiple scheduled "batch" jobs do
not
"collide" with each other.
Ans: Because every job depend upon another job for example if you first job result
is successfull then
another job will execute otherwise your job doesn't work.
Describe the process steps you would perform when defragmenting a data table.
Ans: This table contains mission critical data.
There are several ways to do this:
1) We can move the table in the same or other tablespace and rebuild all the
indexes on the table.
alter table move this activity reclaims the defragmented space in the table
analyze table table_name compute statistics to capture the updated statistics.
2)Reorg could be done by taking a dump of the table, truncate the table and import
the dump back into
the table.
Explain the difference between the �truncate� and "delete" commands.
Ans: The difference between the TRUNCATE and DELETE statement is Truncate belongs
to DDL
command whereas DELETE belongs to DML command.Rollback cannot be performed incase
of Truncate
statement wheras Rollback can be performed in Delete statement. "WHERE" clause
cannot be used in
Truncate where as "WHERE" clause can be used in DELETE statement.
What is the difference between a DB config and a CFG file?
Ans: A .dbc file has the information required for Ab Initio to connect to the
database to extract or load
tables or views. While .CFG file is the table configuration file created by
db_config while using
components like Load DB Table.
Describe the �Grant/Revoke� DDL facility and how it is implemented.
Ans:Basically,This is a part of D.B.A responsibilities GRANT means permissions for
example GRANT
CREATE TABLE ,CREATE VIEW AND MANY MORE .
REVOKE means cancel the grant (permissions).So,Grant or Revoke both commands depend
upon D.B.A

What is the relation between EME , GDE and Co-operating system ?


Ans : EME is said as enterprise metdata env,
GDE as graphical devlopment env and Co-operating sytem can be said as asbinitio
server relation b/w
this CO-OP, EME AND GDE is as follows
Co operating system is the Abinitio Server.This co-op is installed on perticular
O.S platform that is called
NATIVE O.S .comming to the EME, its i just as repository in informatica , its hold
the
metadata,trnsformations,db config files source and targets informations. comming to
GDE its is end user
envirinment where we can devlop the graphs(mapping just like in informatica)
desinger uses the GDE and
designs the graphs and save to the EME or Sand box it is at user side where EME is
ast server side.
What is the use of aggregation when we have rollup as we know rollup component in
abinitio is
used to summirize group of data record. then where we will use aggregation ?
Ans: Aggregation and Rollup both can summerise the data but rollup is much more
convenient to use. In
order to understand how a particular summerisation being rollup is much more
explanatory compared to
aggregate. Rollup can do some other functionalities like input and output filtering
of records. Aggregate
and rollup perform same action, rollup display intermediat
result in main memory, Aggregate does not support intermediat result.
What are kinds of layouts does ab initio supports?
Ans: Basically there are serial and parallel layouts supported by AbInitio. A graph
can have both at the
same time. The parallel one depends on the degree of data parallelism. If the
multi-file system is 4-way
parallel then a component in a graph can run 4 way parallel if the layout is
defined such as it's same as
the degree of parallelism.
How can you run a graph infinitely?
Ans: To run a graph infinitely, the end script in the graph should call the .ksh
file of the graph. Thus if the
name of the graph is abc.mp then in the end script of the graph there should be a
call to abc.ksh. Like this
the graph will run infinitely.
How do you add default rules in transformer?
Ans : Double click on the transform parameter of parameter tab page of component
properties, it will
open transform editor. In the transform editor click on the Edit menu and then
select Add Default Rules
from the dropdown. It will show two options - 1) Match Names 2) Wildcard.
Do you know what a local lookup is?
Ans : If your lookup file is a multifile and partioned/sorted on a particular key
then local lookup function
can be used ahead of lookup function call. This is local to a particular partition
depending on the key.
Lookup File consists of data records which can be held in main memory. This makes
the transform
function to retrieve the records much faster than retirving from disk. It allows
the transform component to
process the data records of multiple files fastly.
What is the difference between look-up file and look-up, with a relevant example?
Ans: Generally
Lookup file represents one or more serial files(Flat files). The amount of data is
small enough to be held
in the memory. This allows transform functions to retrive records much more quickly
than it could retrive
from Disk.
A lookup is a component of abinitio graph where we can store data and retrieve it
by using a key
parameter.
A lookup file is the physical file where the data for the lookup is stored.
How many components in your most complicated graph? It depends the type of
components you
us.
Ans: Usually avoid using much complicated transform function in a graph.
Explain what is lookup?
Ans: Lookup is basically a specific dataset which is keyed. This can be used to
mapping values as per
the data present in a particular file (serial/multi file). The dataset can be
static as well dynamic ( in case
the lookup file is being generated in previous phase and used as lookup file in
current phase).
Sometimes, hash-joins can be replaced by using reformat and lookup if one of the
input to the join
contains less number of records with slim record length.
AbInitio has built-in functions to retrieve values using the key for the lookup
What is a ramp limit?
Ans: The limit parameter contains an integer that represents a number of reject
events
The ramp parameter contains a real number that represents a rate of reject events
in the number of
records processed.
no of bad records allowed = limit + no of records*ramp.
ramp is basically the percentage value (from 0 to 1)
This two together provides the threshold value of bad records.

Overview:Data Warehouse Abinitio Interview Questions ,Professionals are invited to


share Answers for
these Abinitio Interview Questions,Learn by sharing Ab Initio Interview Questions
asked in various
companies
What is the latest version that is available in Ab-initio?
How to take the input data from an excel sheet?
How will you test a dbc file from command prompt ?
Which one is faster for processing fixed length dmls or delimited dmls and why ?
What are the contineous components in Abinitio?
What is meant by fancing in abinitio ?
What is the relation between EME , GDE and Co-operating system ?
What is the use of aggregation when we have rollup as we know rollup component in
abinitio is used to
summirize group of data record. then where we will use aggregation ?
Describe the process steps you would perform when defragmenting a data table. This
table contains
mission critical data.
Explain the difference between the ?truncate? and "delete" commands.
When running a stored procedure definition script how would you guarantee the
definition could be "rolled
back" in the event of problems.
Describe the ?Grant/Revoke? DDL facility and how it is implemented.
Describe how you would ensure that database object definitions (Tables, Indices,
Constraints, Triggers,
Users, Logins, Connection Options, and Server Options etc) are consistent and
repeatable between
multiple database instances (i.e.: a test and production copy of a database).
What is the difference between a DB config and a CFG file?
What about DML changes dynamically?
What is backward compatibility in abinitio?
What are kinds of layouts does ab initio supports
How do you add default rules in transformer?
Have you used rollup component? Describe how.
What are primary keys and foreign keys?
What is an outer join?
What are Cartesian joins?
What is the purpose of having stored procedures in a database?
What is a cursor? Within a cursor, how would you update fields on the row just
fetched?
How would you find out whether a SQL query is using the indices you expect?
How can you force the optimizer to use a particular index?
When using multiple DML statements to perform a single unit of work, is it
preferable to use implicit or
explicit transactions, and why.
Describe the elements you would review to ensure multiple scheduled "batch" jobs do
not "collide" with
each other.
What is semi-join
How to get DML using Utilities in UNIX?
What is driving port? When do you use it?
What is local and formal parameter
What is BRODCASTING and REPLICATE ?

Explain what is lookup?


Have you worked with packages?
How to create repository in abinitio for stand alone system(LOCAL NT)?
What is the difference between .dbc and .cfg file?
What does dependency analysis mean in Ab Initio?
What do you have to give the value for the Record Required parameter for a natural
join?
When do you use Partition by Expression?
What is Adhoc File System? Give me a scenario where you used it.
What are the different commands that you used when writing wrappers?
What do the hidden files in a sandbox represent and what does start.ksh represent?
How can we test the abintio manually and automation?
What is the difference between sandbox and EME, can we perform checkin and checkout
through
sandbox/ Can anybody explain checkin and checkout?
What does layout means in terms of Ab Initio
What are different things that you have to consider when loading data into a table?

How to Create Surrogate Key using Ab Initio?


Can anyone give me an exaple of realtime start script in the graph?
What are differences between different GDE versions(1.10,1.11,1.12,1.13and 1.15)?
What are
differences between different versions of Co-op?
Do you know what a local lookup is?
How many components in your most complicated graph?
How to handle if DML changes dynamically in abinitio
Explain what is lookup?
Have you worked with packages?
How to run the graph without GDE?
What are the different versions and releases of ABinitio (GDE and Co-op version)
What is the Difference between DML Expression and XFR Expression ?
How Does MAXCORE works?
What is $mpjret? Where it is used in ab-initio?
How do you convert 4-way MFS to 8-way mfs?

What is skew and skew measurement?


What is the importance of EME in abinitio?
How do you add default rules in transformer?
What is difference between file and table in abinitio
How to create a computer program that computes the monthly interest charge on a
credit card account?
What is .abinitiorc and What it contain?
What do you mean by .profile in Abinitio and what does it contains?
What is data mapping and data modelling?
What is the difference between partitioning with key and round robin?
Can anyone tell me what happens when the graph run? i.e The Co-operating System
will be at the host,
We are running the graph at some other place. How the
How would you do performance tuning for already built graph ? Can you let me know
some examples?
How to execute the graph from start to end stages? Tell me and how to run graph in
non-Abinitio
system?
What are the most commonly used components in a Abinition graph? can anybody give
me a practical
example of a trasformation of data, say customer data in a credit card company into
meaningful output
based on business rules?
Can we load multiple files?
Can anyone please explain the environment varaibles with example.
Explain the differences between api and utility mode?
Please let me know whether we have ab initio GDE version 1.14 and what is the
latest GDE version and
Co-op version?
What are the Graph parameter?
How to find the number of arguments defined in graph..
What is the difference between rollup and scan?
How to work with parameterized graphs?
Please give us insight on Enterprise Meta Environment, and some possible questions
on that.
What are delta table and master table?
What error would you get when you use Partition by Round Robin and Join?
Do you know what a local lookup is?
How many components in your most complicated graph?
How to handle if DML changes dynamically in abinitio

How do you count the number of records in a flat file?


How do you connect EME to Abinitio Server?
Have you eveer encountered an error called 'depth not equal'? (This occurs when you
extensively create
graphs it is a trick question)
What is the difference between a DB config and a CFG file?
Do you know what a local lookup is?
What is the difference between look-up file and look-up, with a relevant example?
Have you worked with packages?
In which scenarios would you use Partition by Key and also, Partition by Round
Robin and differences
between the both?
What are the different dimension tables that you used and some columns in the fact
table?
What is the difference between a Scan component and a RollUp component?
How do we handle if DML changing dynamicaly
What is m_dump
What is the syntax of m_dump command?
Have you used rollup component? Describe how.
How do you improve the performance of a graph?
How many components are there in your most complicated graph?
What is the function you would use to transfer a string into a decimal?
For data parallelism, we can use partition components. For component parallelism,
we can use replicate
component. Like this which component(s) can we use for pipeline parallelism?
What is AB_LOCAL expression where do you use it in ab-initio?
What is mean by Co > Operating system and why it is special for Ab-initio ?
How to retrive data from database to source in that case whice componenet is used
for this?
How can you run a graph infinitely?
What is the syntax of m_dump command?
How to do we run sequences of jobs ,, like output of A JOB is Input to B How do we
co-ordinate the jobs
How do you truncate a table?

What is a ramp limit?


What is the difference between dbc and cfg? When do you use these two?
What are the compilation errors you came across while executing your graphs?
What is depth_error?
Difference between conventional loading and direct loading ? When it is used in
real time .
During the execution of graph, let us say you lost the network connection, would
you have to start the
process all over again or does it start from where it stopped?
What are the different types of partitions and scenarios.
What does dependency analysis mean in Ab Initio?
What does unused port in join component do?
Define Multi file system. Can you create multifile system on the same server? Also,
if you have a table
that has Name, Address, Status, Position attributes, can Name and Address be on one
partition and
Status and Position in the other partition?
What is a sandbox? Did the co-operating system version 2.8 have sandbox, if not how
would you store
the respective files?
How did you do version control? Which tool did you use?
How do you troubleshoot performance issues in graph?
What are the usual errors that you encounter during ETL process apart from
compilation process?
Were you involved in production support? What were the different kinds of problems
that you
encountered?
How do you count the number of records in a multifile system without using GDE?
What does Scan and Rollup component do and give a scenario where you used them?
Did you ever used user defined functions or packages? If yes, give a scenario.
What is difference between Redefine Format and Reformat components?
Sometimes you have to use dynamic length strings. Can you give me one circumstance
where you need
it?
Why might you create a stored procedure with the 'with recompile' option?
How many parallelisms are in Abinitio? Please give a definition of each.
How to Schedule Graphs in AbInitio, like workflow Schedule in Informatica? And
where we must is Unix
shell scripting in AbInitio?
How to Improve Performance of graphs in Ab initio? Give some examples or tips.

SQL

SQL interview questions and answers

By admin | July 14, 2008

1. What are two methods of retrieving SQL?


2. What cursor type do you use to retrieve multiple recordsets?
3. What is the difference between a "where" clause and a "having" clause? - "Where"
is a kind of
restiriction statement. You use where clause to restrict all the data from DB.Where
clause is using before
result retrieving. But Having clause is using after retrieving the data.Having
clause is a kind of filtering
command.
4. What is the basic form of a SQL statement to read data out of a table? The basic
form to read data
out of table is �SELECT * FROM table_name; � An answer: �SELECT * FROM table_name
WHERE xyz=
�whatever�;� cannot be called basic form because of WHERE clause.
5. What structure can you implement for the database to speed up table reads?-
Follow the rules of
DB tuning we have to: 1] properly use indexes ( different types of indexes) 2]
properly locate different DB
objects across different tablespaces, files and so on.3] create a special space
(tablespace) to locate some of
the data with special datatype ( for example CLOB, LOB and �)
6. What are the tradeoffs with having indexes? - 1. Faster selects, slower updates.
2. Extra storage space
to store indexes. Updates are slower because in addition to updating the table you
have to update the index.
7. What is a "join"? - �join� used to connect two or more tables logically with or
without common field.
8. What is "normalization"? "Denormalization"? Why do you sometimes want to
denormalize? -
Normalizing data means eliminating redundant information from a table and
organizing the data so that
future changes to the table are easier. Denormalization means allowing redundancy
in a table. The main
benefit of denormalization is improved performance with simplified data retrieval
and manipulation. This is
done by reduction in the number of joins needed for data processing.
9. What is a "constraint"? - A constraint allows you to apply simple referential
integrity checks to a table.
There are four primary types of constraints that are currently supported by SQL
Server: PRIMARY/UNIQUE
- enforces uniqueness of a particular table column. DEFAULT - specifies a default
value for a column in case
an insert operation does not provide one. FOREIGN KEY - validates that every value
in a column exists in a
column of another table. CHECK - checks that every value stored in a column is in
some specified list. Each
type of constraint performs a specific type of action. Default is not a constraint.
NOT NULL is one more
constraint which does not allow values in the specific column to be null. And also
it the only constraint which
is not a table level constraint.
10. What types of index data structures can you have? - An index helps to faster
search values in tables.
The three most commonly used index-types are: - B-Tree: builds a tree of possible
values with a list of row
IDs that have the leaf value. Needs a lot of space and is the default index type
for most databases. - Bitmap:
string of bits for each possible value of the column. Each bit string has one bit
for each row. Needs only few
space and is very fast.(however, domain of value cannot be large, e.g. SEX(m,f);
degree(BS,MS,PHD) - Hash:
A hashing algorithm is used to assign a set of characters to represent a text
string such as a composite of keys
or partial keys, and compresses the underlying data. Takes longer to build and is
supported by relatively few
databases.
11. What is a "primary key"? - A PRIMARY INDEX or PRIMARY KEY is something which
comes mainly
from
database theory. From its behavior is almost the same as an UNIQUE INDEX, i.e.
there may only be one of
each value in this column. If you call such an INDEX PRIMARY instead of UNIQUE, you
say something
about
your table design, which I am not able to explain in few words. Primary Key is a
type of a constraint
enforcing uniqueness and data integrity for each row of a table. All columns
participating in a primary key
constraint must possess the NOT NULL property.
12. What is a "functional dependency"? How does it relate to database table design?
- Functional
dependency relates to how one object depends upon the other in the database. for
example,
procedure/function sp2 may be called by procedure sp1. Then we say that sp1 has
functional dependency on
sp2.
13. What is a "trigger"? - Triggers are stored procedures created in order to
enforce integrity rules in a
database. A trigger is executed every time a data-modification operation occurs
(i.e., insert, update or
delete). Triggers are executed automatically on occurance of one of the data-
modification operations. A
trigger is a database object directly associated with a particular table. It fires
whenever a specific
statement/type of statement is issued against that table. The types of statements
are insert,update,delete and
query statements. Basically, trigger is a set of SQL statements A trigger is a
solution to the restrictions of a
constraint. For instance: 1.A database column cannot carry PSEUDO columns as
criteria where a trigger can.
2. A database constraint cannot refer old and new values for a row where a trigger
can.
14. Why can a "group by" or "order by" clause be expensive to process? - Processing
of "group by" or
"order by" clause often requires creation of Temporary tables to process the
results of the query. Which
depending of the result set can be very expensive.
15. What is "index covering" of a query? - Index covering means that "Data can be
found only using
indexes, without touching the tables"
16. What types of join algorithms can you have?
17. What is a SQL view? - An output of a query can be stored as a view. View acts
like small table which meets
our criterion. View is a precomplied SQL query which is used to select data from
one or more tables. A view
is like a table but it doesn�t physically take any space. View is a good way to
present data in a particular
format if you use that query quite often. View can also be used to restrict users
from accessing the tables
directly.

Linux command line Q&A

By admin | July 15, 2008

1. You need to see the last fifteen lines of the files dog, cat and horse. What
command should
you use?
tail -15 dog cat horse
The tail utility displays the end of a file. The -15 tells tail to display the last
fifteen lines of each specified file.
2. Who owns the data dictionary?
The SYS user owns the data dictionary. The SYS and SYSTEM users are created when
the database is
created.
3. You routinely compress old log files. You now need to examine a log from two
months ago. In
order to view its contents without first having to decompress it, use the _________
utility.
zcat
The zcat utility allows you to examine the contents of a compressed file much the
same way that cat displays
a file.
4. You suspect that you have two commands with the same name as the command is not
producing the expected results. What command can you use to determine the location
of the
command being run?
which
The which command searches your path until it finds a command that matches the
command you are
looking for and displays its full path.
5. You locate a command in the /bin directory but do not know what it does. What
command
can you use to determine its purpose.
whatis
The whatis command displays a summary line from the man page for the specified
command.
6. You wish to create a link to the /data directory in bob�s home directory so you
issue the
command ln /data /home/bob/datalink but the command fails. What option should you
use
in this command line to be successful.
Use the -F option
In order to create a link to a directory you must use the -F option.
7. When you issue the command ls -l, the first character of the resulting display
represents the
file�s ___________.
type
The first character of the permission block designates the type of file that is
being displayed.
8. What utility can you use to show a dynamic listing of running processes?
__________
top
The top utility shows a listing of all running processes that is dynamically
updated.
9. Where is standard output usually directed?
to the screen or display
By default, your shell directs standard output to your screen or display.
10. You wish to restore the file memo.ben which was backed up in the tarfile
MyBackup.tar.
What command should you type?
tar xf MyBackup.tar memo.ben
This command uses the x switch to extract a file. Here the file memo.ben will be
restored from the tarfile
MyBackup.tar.
11. You need to view the contents of the tarfile called MyBackup.tar. What command
would you
use?
tar tf MyBackup.tar
The t switch tells tar to display the contents and the f modifier specifies which
file to examine.
12. You want to create a compressed backup of the users� home directories. What
utility should
you use?
tar
You can use the z modifier with tar to compress your archive at the same time as
creating it.
13. What daemon is responsible for tracking events on your system?
syslogd
The syslogd daemon is responsible for tracking system information and saving it to
specified log files.
14. You have a file called phonenos that is almost 4,000 lines long. What text
filter can you use to
split it into four pieces each 1,000 lines long?
split
The split text filter will divide files into equally sized pieces. The default
length of each piece is 1,000 lines.
15. You would like to temporarily change your command line editor to be vi. What
command
should you type to change it?
set -o vi
The set command is used to assign environment variables. In this case, you are
instructing your shell to
assign vi as your command line editor. However, once you log off and log back in
you will return to the
previously defined command line editor.
16. What account is created when you install Linux?
root
Whenever you install Linux, only one user account is created. This is the superuser
account also known as
root.
17. What command should you use to check the number of files and disk space used
and each
user�s defined quotas?
repquota
The repquota command is used to get a report on the status of the quotas you have
set including the amount
of allocated space and amount of used space.

SQL knowledge is usually basic knowledge required for almost all database related
technical jobs. Therefore it is good to know some SQL Interview questions and
answers. This post will mainly contain "generic" SQL questions and will focus on
questions that allow testing the candidate's knowledge about sql itself but also
logical
thinking. It will start from basic questions and finish on questions and answers
for
experienced candidates. If you are after broader set of questions I recommend
visiting
links at the bottom that will point you to more interview questions and answers
related
to SQL Server.
I will start with one general sql interview question and then go into basic sql
questions
and increase the difficulty. I will explain questions using standard sql knowledge
but at
the end I will add comments related to sql server. Who is it for?

. People doing SQL related Interviews (face to face)


. Recruiters trying to check the candidate's proficiency with SQL
. Candidates who can prepare better for the interview (You won't get explicit
answers here)

These questions are mainly small tasks where the candidate can present not only
their
SQL knowledge but analytical skills and relational database understanding.
Remember if you know exactly what you need (or you know how you work) make sure
you include these kinds of questions and make them very clear to the candidate so
they
have a chance to answer them (without guessing).
SQL INTERVIEW QUESTIONS

Below is a list of questions in this blog post so you can test your knowledge
without
seeing answers. If you would like to see questions and answers please scrool down.
Question: What type of joins have you used?
Question: How can you combine two tables/views together? For instance one table
contains 100 rows and the other one contains 200 rows, have exactly the same fields

and you want to show a query with all data (300 rows). This sql interview question
can
get complicated.
Question: What is the difference between where and having clause?
Question: How would apply date range filter?
Question: What type of wildcards have you used? This is usually one of mandatory
sql
interview question.
Question: How do you find orphans?
Question: How would you solve the following sql queries using today's date?
First day of previous month
First day of current month
Last day of previous month
Last day of current month
Question: You have a table that records website traffic. The table contains website
name
(multiple websites), page name, IP address and UTC date time. What would be the
query
to show all websites visited in the last 30 days with total number or visits, total
number if
unique page view and total number of unique visitors (using IP Address)?
Question: How to display top 5 employees with the higest number of sales (total)
and
display position as a field. Note that if both of employees have the same total
sales
values they should receive the same position, in other words Top 5 employees might
return more than 5 employees.
Question: How to get accurate age of an employee using SQL?
Question: This is SQL Server interview question. You have three fields ID, Date and

Total. Your table contains multiple rows for the same day which is valid data
however for
reporting purpose you need to show only one row per day. The row with the highest
ID
per day should be returned the rest should be hidden from users (not returned).
Question: How to return truly random data from a table? Let say top 100 random
rows?
Question: How to create recursive query in SQL Server?

GENERAL SQL INTERVIEW QUESTIONS AND ANSWERS

Question: How long have you used SQL for? Did you have any breaks?
Answer: SQL skills vary a lot depending on the type of job and experience of the
candidate so I wouldn�t pay too much attention to this sql interview question but
it is
always worth having this information before asking SQL tasks so you know if you
deal
with someone who is truly interested in SQL (might just have 1 year experience but
be
really good at it and at answering the questions) or someone who doesn�t pay too
much
attention to gain proper knowledge and has been like that for many years (which
doesn�t always mean you don�t want them).

BASIC SQL INTERVIEW QUESTIONS AND ANSWERS

Question: What type of joins have you used?


Answer: Joins knowledge is MUST HAVE. This interview question is quite nice because

most people used inner join and (left/right) outer join which is rather mandatory
knowledge but those more experienced will also mention cross join and self-join. In
SQL
Server you can also get full outer join.
Question: How can you combine two tables/views together? For instance one table
contains 100 rows and the other one contains 200 rows, have exactly the same
fields and you want to show a query with all data (300 rows). This sql interview
question can get complicated.
Answer: You use UNION operator. You can drill down this question and ask what is
the
different between UNION and UNION ALL (the first one removes duplicates (not always

desirable)� in other words shows only DISTINCT rows�.Union ALL just combines so it
is
also faster). More tricky question are how to sort the view (you use order by at
the last
query), how to name fields so they appear in query results/view schema (first query
field
names are used). How to filter groups when you use union using SQL (you would
create
separate query or use common table expression (CTE) or use unions in from with ().
Question: What is the difference between where and having clause?
Answer: in SQL Where filters data on lowest row level. Having filters data after
group by
has been performed so it filters on "groups"
Question: How would apply date range filter?
Answer: This is tricky question. You can use simple condition >= and <= or similar
or
use between/and but the trick is to know your exact data type. Sometimes date
fields
contain time and that is where the query can go wrong so it is recommended to use
some date related functions to remove the time issue. In SQL Server common function

to do that is datediff function. You also have to be aware of different time zones
and
server time zone.
Question: What type of wildcards have you used? This is usually one of mandatory
sql interview question.
Answer: First question is what is a wildcard? Wildcards are special characters that
allow
matching string without having exact match. In simple word they work like contains
or
begins with. Wildcard characters are software specific and in SQL Server we have %
which represent any groups of characters, _ that represent one character (any) and
you
also get [] where we can [ab] which means characters with letter a or b in a
specific
place.
Question: How do you find orphans?
Answer: This is more comprehensive SQL and database interview question. First of
all
we test if the candidate knows what an orphan is. An Orphan is a foreign key value
in
"child table" which doesn�t exist in primary key column in parent table. To get it
you can
use left outer join (important: child table on left side) with join condition on
primary/foreign key columns and with where clause where primary key is null. Adding

distinct or count to select is common practise. In SQL Server you can also you
except
which will show all unique values from first query that don't exist in second
query.
Question: How would you solve the following sql queries using today's date?
First day of previous month
First day of current month
Last day of previous month
Last day of current month
Answer: These tasks require good grasp of SQL functions but also logical thinking
which
is one of the primary skills involved in solving sql questions. In this case I
provided links
to actual answers with code samples. Experienced people should give correct answer
almost immediately. People with less experience might need more time or would
require some help (Google).

INTERMEDIATE SQL INTERVIEW QUESTIONS AND ANSWERS

Question: You have a table that records website traffic. The table contains website

name (multiple websites), page name, IP address and UTC date time. What would
be the query to show all websites visited in the last 30 days with total number or
visits, total number if unique page view and total number of unique visitors (using
IP Address)?
Answer: This test is mainly about good understanding of aggregate functions and
date
time. In this we need to group by Website, Filter data using datediff but the trick
in here
is to use correct time zone. If I want to do that using UTC time than I could use
GetUTCDate() in sql server and the final answer related to calculated fields using
aggregate functions that I will list on separate lines below:
TotalNumberOfClicks = Count(*) 'nothing special here
TotalUniqueVisitors = Count(distinct Ipaddress) ' we count ipaddress fields but
only
unique ip addresses. The next field should be in here but as it is more complicated
I put
it as third field.
TotalNumberOfUniquePageViews = Count(distinct PageName+IPAddress) 'This one is
tricky to get unique pageview we need to count all visits but per page but only for

unique IP address. So I combined pagename with ipaddress to counted unique values.


Just to explain one page could receive 3 vists from 2 unique visits and another
page
could receive one visit from ip that visited previous page so Unique IP is 2,
PageView is
3 (1 visitor 2 pages and 1 visitor 1 page) and visits is 4
Question: How to display top 5 employees with the higest number of sales (total)
and display position as a field. Note that if both of employees have the same total

sales values they should receive the same position, in other words Top 5
employees might return more than 5 employees.
Answer: Microsoft introduced in SQL Server 2005 ranking function and it is ideal to

solve this query. RANK() function can be used to do that, DENSE_Rank() can also be
used. Actually the question is ambiguous because if your two top employees have the

same total sales which position should the third employee get 2 (Dense_Rank()
function)
or 3 (Rank() Function)? In order to filter the query Common Table Expression (CTE)
can
be used or query can be put inside FROM using brackets ().
Now that we covered basic and intermediate questions let's continue with more
complicate ones. These questions and answers are suitable for experienced
candidates:

ADVANCED SQL INTERVIEW QUESTIONS AND ANSWERS

Question: How to get accurate age of an employee using SQL?


Answer: The word accurate is crucial here. The short answer is you have to play
with
several functions. For more comprehensive answer see the following link SQL Age
Function. Calculate accurate age using SQL Server
Question: This is SQL Server interview question. You have three fields ID, Date and

Total. Your table contains multiple rows for the same day which is valid data
however for reporting purpose you need to show only one row per day. The row
with the highest ID per day should be returned the rest should be hidden from
users (not returned).
To better picture the question below is sample data and sample output:
ID, Date, Total
1, 2011-12-22, 50
2, 2011-12-22, 150
The correct result is:
2, 2012-12-22, 150
The correct output is single row for 2011-12-22 date and this row was chosen
because it
has the highest ID (2>1)
Answer: Usually Group By and aggregate function are used (MAX/MIN) but in this case

that will not work. Removing duplications with this kind of rules is not so easy
however
SQL Server provides ranking functions and the candidate can use dense_rank function

partition by Date and order by id (desc) and then use cte/from query and filter it
using
rank = 1. There are several other ways to solve that but I found this way to be
most
efficient and simple.
Question: How to return truly random data from a table? Let say top 100 random
rows?
I must admit I didn't answer correctly this sql interview question a few years
back.
Answer: Again this is more SQL Server answer and you can do that using new_id()
function in order by clause and using top 100 in select. There is also table sample

function but it is not truly random as it operates on pages not rows and it might
not
also return the number of rows you wanted.
Question: How to create recursive query in SQL Server?
Answer: The first question is actually what is a recursive query? The most common
example is parent child hierarchy for instance employee hierarchy where employee
can
have only one manager and manager can have none or many employees reporting to it.
Recursive query can be create in sql using stored procedure but you can also use
CTE
(Common table expression) for more information visit SQL Interview question -
recursive
query (microsoft). It might be also worth asking about performance as CTE is not
always
very fast but in this case I don't know which one is would perform betters.
I will try to find time to add more questions soon. Feel free to suggest new
questions
(add comments).
The following link shows SQL Queries Examples from beginner to advanced

See also:
SSIS Interview questions and answers
SSRS Interview questions and answers
SQL Server Interview questions and answers
In this blog I will post SQL queries examples as learning on examples usually is
very effective
sometimes better than any tutorial but this can also help with SQL Interview
questions and
answers. I will start from basic SQL queries and go to advanced and complex
queries. I will use
SQL Server 2008 R2 and I will try to remember to add comments for features that are
new. The
database I use is called AdventureWorksDW2008R2 and it is Microsoft training
database that
you can download from Microsoft site. I will be posting new samples for the next
several weeks.

BASIC SQL QUERIES

Date Related Examples


In this example I will show several date related functions.
--today's date
SELECT GETDATE() TodaysDate
,GETUTCDATE() UTCTodaysDate
,CAST(GETDATE() AS float) TodaysDate
,CAST(GETDATE() AS INT) TodaysDate
--is string date
SELECT ISDATE('2011-01-01') CheckIfValueIsDate
,ISDATE('20110101') ThisIsStillDate
,CAST('20110101' as int) as ChangeToInteger
,CAST(CAST('20110101' as Datetime) as in
t) as ChangeToDateThanInteger
-- date difference and add days
SELECT DATEDIFF(d,'2011-01-05','2011-01-15') as DayDifference
,DATEADD(d, -5, GETDATE()) as MinusFiveDays
Click image to enlarge
SQL Query date related examples
Extract website name from link
Text manipulation function are very common in SQL and in this example I will show
how to
extract website name from given link. I will store page link in a variable so it it
easier to read it
main part of the code that will perform all the work.
declare @PageLink as nvarchar(1000) = 'http://www.sql-server-business-
intelligence.com/sql-
server/interview-questions-and-answers/sql-interview-questions-and-answers-pdf-
download'
select
substring(replace(@PageLink,'http://
www.',''),0,charindex('/',replace(@PageLink,'http://www.','')))
as WebsiteName
I have used replace function to remove http://www. Then I used substring to find /
that is the
first character after the website name that I am after and then I use substring
finctuon to extract
all characters up to the forward slash position.
Top 5 Employees - single table
The first example is very simple. I have writen SQL query to show TOP 5 Employees
with the
highest BaseRate and to do that I use OrderBy BaseRate DESC.
Click image to enlarge
SQL query show top employee base rate

---

Top 5 Customer - two tables


In the next query I want to show TOP 5 customers with the highest TotalSales
Amount. To do
that I join Customer table with InternetSales table on customerkey group by
firstname and
lastname (for large tables it would be worth adding DateOfBirth to avoid grouping
different
people together) and order by TotalSales which is SUM of SalesAmountField
SQL Query show top five customers

---

Departments with high level of female sick leave hours


In this SQL Query I want to show total female employees per department where sick
leave hours
is more than 40 but only departments where there are at least three females who
meet the
criteria. So first I filter Gender and SickLeaveHours then I group by department
and then I filter
the group and leave only department where total number of females that meet the
criteria is
greater than 3.
SQL Query show female high sick count per department

---

ADVANCED SQL QUERIES

Show field in every table that contains certain word


This SQL Query is one of my favourite ones. I use Information_schema.columns to
find fields in
every table (or view) of the selected database that contain word geo in field name.
SQL Query find field containing geo word

Technical Sample Questions : C | C++ | Oracle | Java | Unix | Operating Systems |


Data
Structure
Oracle Sample Questions �� Oracle Interview Questions

Technical Sample Questions

Oracle Sample Questions : Oracle Interview Questions

1. What are the various types of queries ?

Answer: The types of queries are:

. Normal Queries
. Sub Queries
. Co-related queries
. Nested queries
. Compound queries

2. What is a transaction ?

Answer: A transaction is a set of SQL statements between any two COMMIT and
ROLLBACK statements.

3. What is implicit cursor and how is it used by Oracle ?

Answer: An implicit cursor is a cursor which is internally created by Oracle.It is


created by Oracle for each individual SQL.

4. Which of the following is not a schema object : Indexes, tables, public


synonyms, triggers and packages ?

Answer: Public synonyms

5. What is PL/SQL?

Answer: PL/SQL is Oracle's Procedural Language extension to SQL.The language


includes object oriented programming techniques such as encapsulation, function
overloading, information hiding (all but inheritance), and so, brings state-of-the-
art
programming to the Oracle database server and a variety of Oracle tools.

6. Is there a PL/SQL Engine in SQL*Plus?

Answer: No.Unlike Oracle Forms, SQL*Plus does not have a PL/SQL engine.Thus, all
your PL/SQL are send directly to the database engine for execution.This makes it
much more efficient as SQL statements are not stripped off and send to the database
individually.

7. Is there a limit on the size of a PL/SQL block?

Answer: Currently, the maximum parsed/compiled size of a PL/SQL block is 64K and
the maximum code size is 100K.You can run the following select statement to query
the size of an existing package or procedure. SQL> select * from dba_object_size
where name = 'procedure_name'

8. Can one read/write files from PL/SQL?

Answer: Included in Oracle 7.3 is a UTL_FILE package that can read and write
files.The directory you intend writing to has to be in your INIT.ORA file (see
UTL_FILE_DIR=...parameter).Before Oracle 7.3 the only means of writing a file was
to use DBMS_OUTPUT with the SQL*Plus SPOOL command.
DECLARE
fileHandler UTL_FILE.FILE_TYPE;
BEGIN
fileHandler := UTL_FILE.FOPEN('/home/oracle/tmp', 'myoutput','W');
UTL_FILE.PUTF(fileHandler, 'Value of func1 is %sn', func1(1));
UTL_FILE.FCLOSE(fileHandler);
END;

9. How can I protect my PL/SQL source code?

Answer: PL/SQL V2.2, available with Oracle7.2, implements a binary wrapper for
PL/SQL programs to protect the source code.This is done via a standalone utility
that
transforms the PL/SQL source code into portable binary object code (somewhat
larger than the original).This way you can distribute software without having to
worry about exposing your proprietary algorithms and methods.SQL*Plus and
SQL*DBA will still understand and know how to execute such scripts.Just be careful,

there is no "decode" command available. The syntax is: wrap iname=myscript.sql


oname=xxxx.yyy

10. Can one use dynamic SQL within PL/SQL? OR Can you use a DDL in a
procedure ? How ?

Answer: From PL/SQL V2.1 one can use the DBMS_SQL package to execute dynamic
SQL statements.
Eg: CREATE OR REPLACE PROCEDURE DYNSQL AS
cur integer;
rc integer;
BEGIN
cur := DBMS_SQL.OPEN_CURSOR;
DBMS_SQL.PARSE(cur,'CREATE TABLE X (Y DATE)', DBMS_SQL.NATIVE);
rc := DBMS_SQL.EXECUTE(cur);
DBMS_SQL.CLOSE_CURSOR(cur);
END;

21. What are the various types of Exceptions ?

Answer: User defined and Predefined Exceptions.

22. Can we define exceptions twice in same block ?

Answer: No.

23. What is the difference between a procedure and a function ?

Answer: Functions return a single variable by value whereas procedures do not


return any variable by value.Rather they return multiple variables by passing
variables by reference through their OUT parameter.

24. Can you have two functions with the same name in a PL/SQL block ?

Answer: Yes.

25. Can you have two stored functions with the same name ?

Answer: Yes.

26. Can you call a stored function in the constraint of a table ?

Answer: No.

27. What are the various types of parameter modes in a procedure ?

Answer: IN, OUT AND INOUT.

28. What is Over Loading and what are its restrictions ?

Answer: OverLoading means an object performing different functions depending


upon the no.of parameters or the data type of the parameters passed to it.

29. Can functions be overloaded ?


Answer: Yes.

30. Can 2 functions have same name & input parameters but differ only by
return datatype

Answer: No.

31. What are the constructs of a procedure, function or a package ?

Answer: The constructs of a procedure, function or a package are :

. variables and constants


. cursors
. exceptions

32. Why Create or Replace and not Drop and recreate procedures ?

Answer: So that Grants are not dropped.

33. Can you pass parameters in packages ? How ?

Answer: Yes.You can pass parameters to procedures or functions in a package.

34. What are the parts of a database trigger ?

Answer: The parts of a trigger are:

. A triggering event or statement


. A trigger restriction
. A trigger action

35. What are the various types of database triggers ?

Answer: There are 12 types of triggers, they are combination of :

. Insert, Delete and Update Triggers.


. Before and After Triggers.
. Row and Statement Triggers.

36. What is the advantage of a stored procedure over a database trigger ?

Answer: We have control over the firing of a stored procedure but we have no
control over the firing of a trigger.
37. What is the maximum no.of statements that can be specified in a trigger
statement ?

Answer: One.

38. Can views be specified in a trigger statement ?

Answer: No

39. What are the values of :new and :old in Insert/Delete/Update Triggers ?

Answer: INSERT : new = new value, old = NULL


DELETE : new = NULL, old = old value
UPDATE : new = new value, old = old value

40. What are cascading triggers? What is the maximum no of cascading triggers
at a time?

Answer: When a statement in a trigger body causes another trigger to be fired, the
triggers are said to be cascading.Max = 32.

41. What are mutating triggers ?

Answer: A trigger giving a SELECT on the table on which the trigger is written.

42. What are constraining triggers ?

Answer: A trigger giving an Insert/Updat e on a table having referential integrity


constraint on the triggering table.

43. Describe Oracle database's physical and logical structure ?

Answer:

. Physical : Data files, Redo Log files, Control file.


. Logical : Tables, Views, Tablespaces, etc.

44. Can you increase the size of a tablespace ? How ?

Answer: Yes, by adding datafiles to it.

45. Can you increase the size of datafiles ? How ?

Answer: No (for Oracle 7.0)


Yes (for Oracle 7.3 by using the Resize clause )
46. What is the use of Control files ?

Answer: Contains pointers to locations of various data files, redo log files, etc.

47. What is the use of Data Dictionary ?

Answer: It Used by Oracle to store information about various physical and logical
Oracle structures e.g.Tables, Tablespaces, datafiles, etc

48. What are the advantages of clusters ?

Answer: Access time reduced for joins.

49. What are the disadvantages of clusters ?

Answer: The time for Insert increases.

50. Can Long/Long RAW be clustered ?

Answer: No.

51. Can null keys be entered in cluster index, normal index ?

Answer: Yes.

52. Can Check constraint be used for self referential integrity ? How ?

Answer: Yes.In the CHECK condition for a column of a table, we can reference some
other column of the same table and thus enforce self referential integrity.

53. What are the min.extents allocated to a rollback extent ?

Answer: Two

54. What are the states of a rollback segment ? What is the difference between
partly available and needs recovery ?

Answer: The various states of a rollback segment are :

. ONLINE
. OFFLINE
. PARTLY AVAILABLE
. NEEDS RECOVERY
. INVALID.
55. What is the difference between unique key and primary key ?

Answer: Unique key can be null; Primary key cannot be null.

56. An insert statement followed by a create table statement followed by


rollback ? Will the rows be inserted ?

Answer: No.

57. Can you define multiple savepoints ?

Answer: Yes.

58. Can you Rollback to any savepoint ?

Answer: Yes.

59. What is the maximum no.of columns a table can have ?

Answer: 254.

60. What is the significance of the & and && operators in PL SQL ?

Answer: The & operator means that the PL SQL block requires user input for a
variable.
The && operator means that the value of this variable should be the same as
inputted by the user previously for this same variable

61. Can you pass a parameter to a cursor ?

Answer: Explicit cursors can take parameters, as the example below shows.A cursor
parameter can appear in a query wherever a constant can appear.

CURSOR c1 (median IN NUMBER) IS


SELECT job, ename FROM emp WHERE sal > median;

62. What are the various types of RollBack Segments ?

Answer: The types of Rollback sagments are as follows :

. Public Available to all instances


. Private Available to specific instance

63. Can you use %RowCount as a parameter to a cursor ?


Answer: Yes

64. Is the query below allowed :


Select sal, ename Into x From emp Where ename = 'KING' (Where x is a record of
Number(4) and Char(15))

Answer: Yes

65. Is the assignment given below allowed :


ABC = PQR (Where ABC and PQR are records)

Answer: Yes

66. Is this for loop allowed :


For x in &Start..&End Loop

Answer: Yes

67. How many rows will the following SQL return :


Select * from emp Where rownum < 10;

Answer: 9 rows

68. How many rows will the following SQL return :


Select * from emp Where rownum = 10;

Answer: No rows

69. Which symbol preceeds the path to the table in the remote database ?

Answer: @

70. Are views automatically updated when base tables are updated ?

Answer: Yes

71. Can a trigger written for a view ?

Answer: No

72. If all the values from a cursor have been fetched and another fetch is
issued, the output will be : error, last record or first record ?

Answer: Last Record


73. A table has the following data : [[5, Null, 10]].What will the average
function return ?

Answer: 7.5

74. Is Sysdate a system variable or a system function?

Answer: System Function

75. Consider a sequence whose currval is 1 and gets incremented by 1 by using


the nextval reference we get the next number 2.Suppose at this point we
issue an rollback and again issue a nextval.What will the output be ?

Answer: 3

76. Definition of relational DataBase by Dr.Codd (IBM)?

Answer: A Relational Database is a database where all data visible to the user is
organized strictly as tables of data values and where all database operations work
on
these tables.

77. What is Multi Threaded Server (MTA) ?

Answer: In a Single Threaded Architecture (or a dedicated server configuration) the

database manager creates a separate process for each database user.But in MTA the
database manager can assign multiple users (multiple user processes) to a single
dispatcher (server process), a controlling process that queues request for work
thus
reducing the databases memory requirement and resources.

78. Which are initial RDBMS, Hierarchical & N/w database ?

Answer:

. RDBMS - R system
. Hierarchical - IMS
. N/W - DBTG

79. Difference between Oracle 6 and Oracle 7

Answer:

ORACLE 7

ORACLE 6

Cost based optimizer


Rule based optimizer

Shared SQL Area

SQL area allocated for each user

Multi Threaded Server

Single Threaded Server


Hash Clusters

Only B-Tree indexing

Roll back Size Adjustment

No provision

Truncate command

No provision

Distributed Database

Distributed Query

Table replication & snapshots

No provision

Client/Server Tech

No provision

80. What is Functional Dependency

Answer: Given a relation R, attribute Y of R is functionally dependent on attribute


X
of R if and only if each X-value has associated with it precisely one -Y value in R

81. What is Auditing ?

Answer: The database has the ability to audit all actions that take place within
it. a)
Login attempts, b) Object Accesss, c) Database Action Result of Greatest(1,NULL) or

Least(1,NULL) NULL

82. While designing in client/server what are the 2 imp.things to be considered


?

Answer: Network Overhead (traffic), Speed and Load of client server

83. What are the disadvantages of SQL ?

Answer: Disadvantages of SQL are :

. Cannot drop a field


. Cannot rename a field
. Cannot manage memory
. Procedural Language option not provided
. Index on view or index on index not provided
. View updation problem

84. When to create indexes ?

Answer: To be created when table is queried for less than 2% or 4% to 25% of the
table rows.

85. How can you avoid indexes ?

Answer: To make index access path unavailable


. Use FULL hint to optimizer for full table scan
. Use INDEX or AND-EQUAL hint to optimizer to use one index or set to indexes
instead of another.
. Use an expression in the Where Clause of the SQL.

86. What is the result of the following SQL :


Select 1 from dual UNION Select 'A' from dual;

Answer: Error

87. Can database trigger written on synonym of a table and if it can be then
what would be the effect if original table is accessed.

Answer: Yes, database trigger would fire.

88. Can you alter synonym of view or view ?

Answer: No

89. Can you create index on view

Answer: No.

90. What is the difference between a view and a synonym ?

Answer: Synonym is just a second name of table used for multiple link of
database.View can be created with many tables, and with virtual columns and with
conditions.But synonym can be on view.

91. What's the length of SQL integer ?

Answer: 32 bit length

92. What is the difference between foreign key and reference key ?

Answer: Foreign key is the key i.e.attribute which refers to another table primary
key. Reference key is the primary key of table referred by another table.

93. Can dual table be deleted, dropped or altered or updated or inserted ?

Answer: Yes

94. If content of dual is updated to some value computation takes place or not ?

Answer: Yes
95. If any other table same as dual is created would it act similar to dual?

Answer: Yes

96. For which relational operators in where clause, index is not used ?

Answer: <> , like '%...' is NOT functions, field +constant, field||''

97. .Assume that there are multiple databases running on one machine.How can
you switch from one to another ?

Answer: Changing the ORACLE_SID

98. What are the advantages of Oracle ?

Answer: Portability : Oracle is ported to more platforms than any of its


competitors,
running on more than 100 hardware platforms and 20 networking protocols. Market
Presence : Oracle is by far the largest RDBMS vendor and spends more on R & D
than most of its competitors earn in total revenue.This market clout means that you

are unlikely to be left in the lurch by Oracle and there are always lots of third
party
interfaces available. Backup and Recovery : Oracle provides industrial strength
support for on-line backup and recovery and good software fault tolerence to disk
failure.You can also do point-in-time recovery. Performance : Speed of a 'tuned'
Oracle Database and application is quite good, even with large databases.Oracle can

manage > 100GB databases. Multiple database support : Oracle has a superior
ability to manage multiple databases within the same transaction using a two-phase
commit protocol.

99. What is a forward declaration ? What is its use ?

Answer: PL/SQL requires that you declare an identifier before using it.Therefore,
you
must declare a subprogram before calling it.This declaration at the start of a
subprogram is called forward declaration.A forward declaration consists of a
subprogram specification terminated by a semicolon.

100. What are actual and formal parameters ?

Answer: Actual Parameters : Subprograms pass information using parameters.The


variables or expressions referenced in the parameter list of a subprogram call are
actual parameters.For example, the following procedure call lists two actual
parameters named emp_num and amount:
Eg.raise_salary(emp_num, amount);

Formal Parameters : The variables declared in a subprogram specification and


referenced in the subprogram body are formal parameters.For example, the following
procedure declares two formal parameters named emp_id and increase:
Eg.PROCEDURE raise_salary (emp_id INTEGER, increase REAL) IS current_salary
REAL;

101. What are the types of Notation ?

Answer: Position, Named, Mixed and Restrictions.

102. What all important parameters of the init.ora are supposed to be


increased if you want to increase the SGA size ?

Answer: In our case, db_block_buffers was changed from 60 to 1000 (std values are
60, 550 & 3500) shared_pool_size was changed from 3.5MB to 9MB (std values are
3.5, 5 & 9MB) open_cursors was changed from 200 to 300 (std values are 200 &
300) db_block_size was changed from 2048 (2K) to 4096 (4K) {at the time of
database creation}. The initial SGA was around 4MB when the server RAM was 32MB
and The new SGA was around 13MB when the server RAM was increased to 128MB.

103. .If I have an execute privilege on a procedure in another users


schema, can I execute his procedure even though I do not have privileges
on the tables within the procedure ?

Answer: Yes

104. What are various types of joins ?

Answer: Types of joins are:

. Equijoins
. Non-equijoins
. self join
. outer join

105. What is a package cursor ?

Answer: A package cursor is a cursor which you declare in the package specification

without an SQL statement.The SQL statement for the cursor is attached dynamically
at runtime from calling procedures.

106. If you insert a row in a table, then create another table and then say
Rollback.In this case will the row be inserted ?
Answer: Yes.Because Create table is a DDL which commits automatically as soon as
it is executed.The DDL commits the transaction even if the create statement fails
internally (eg table already exists error) and not syntactically.

Technical Sample Questions : C | C++ | Oracle | Java | Unix | Operating Systems |


Data
Structure

Sample Technical Questions

Unix Sample Questions

Questions on file management in uinx

Following are some unix sample questions.

1. How are devices represented in UNIX?

Answer:

All devices are represented by files called special files that are located in/dev
directory. Thus, device files and other files are named and accessed in the same
way. A 'regular file' is just an ordinary data file in the disk. A 'block special
file'
represents a device with characteristics similar to a disk (data transfer in terms
of
blocks). A 'character special file' represents a device with characteristics
similar to a
keyboard (data transfer is by stream of bits in sequential order).

2. What is 'inode'?

Answer:

All UNIX files have its description stored in a structure called 'inode'. The inode

contains info about the file-size, its location, time of last access, time of last
modification, permission and so on. Directories are also represented as files and
have an associated inode. In addition to descriptions about the file, the inode
contains pointers to the data blocks of the file. If the file is large, inode has
indirect
pointer to a block of pointers to additional data blocks (this further aggregates
for
larger files). A block is typically 8k.

Inode consists of the following fields:

o File owner identifier


o File type
o File access permissions
o File access times
o Number of links
o File size
o Location of the file data

3. Brief about the directory representation in UNIX

Answer:

A Unix directory is a file containing a correspondence between filenames and


inodes.
A directory is a special file that the kernel maintains. Only kernel modifies
directories, but processes can read directories. The contents of a directory are a
list
of filename and inode number pairs. When new directories are created, kernel makes
two entries named '.' (refers to the directory itself) and '..' (refers to parent
directory). System call for creating directory is mkdir (pathname, mode).

4. What are the Unix system calls for I/O?

Answer:

o open(pathname,flag,mode) - open file


o creat(pathname,mode) - create file
o close(filedes) - close an open file
o read(filedes,buffer,bytes) - read data from an open file
o write(filedes,buffer,bytes) - write data to an open file
o lseek(filedes,offset,from) - position an open file
o dup(filedes) - duplicate an existing file descriptor
o dup2(oldfd,newfd) - duplicate to a desired file descriptor
o fcntl(filedes,cmd,arg) - change properties of an open file
o ioctl(filedes,request,arg) - change the behaviour of an open file

The difference between fcntl anf ioctl is that the former is intended for any open
file,
while the latter is for device-specific operations.

5. How do you change File Access Permissions?

Answer:

Every file has following attributes:

o owner's user ID ( 16 bit integer )


o owner's group ID ( 16 bit integer )
o File access mode word
'r w x -r w x- r w x'
(user permission-group permission-others permission)
r-read, w-write, x-execute

To change the access mode, we use chmod(filename,mode).

Example 1:

To change mode of myfile to 'rw-rw-r--' (ie. read, write permission for user -
read,write permission for group - only read permission for others) we give the args

as:
chmod(myfile,0664) .
Each operation is represented by discrete values
'r' is 4
'w' is 2
'x' is 1
Therefore, for 'rw' the value is 6(4+2).

Example 2:
To change mode of myfile to 'rwxr--r--' we give the args as:
chmod(myfile,0744).

6. What are links and symbolic links in UNIX file system?

Answer:

A link is a second name (not a file) for a file. Links can be used to assign
more than one name to a file, but cannot be used to assign a directory more
than one name or link filenames on different computers.

Symbolic link 'is' a file that only contains the name of another file.Operation
on the symbolic link is directed to the file pointed by the it.Both the
limitations of links are eliminated in symbolic links.

Commands for linking files are:


Link ln filename1 filename2
Symbolic link ln -s filename1 filename2

7. What is a FIFO?

Answer:
FIFO are otherwise called as 'named pipes'. FIFO (first-in-first-out) is a special
file which is said to be data transient. Once data is read from named pipe, it
cannot be read again. Also, data can be read only in the order written. It is
used in interprocess communication where a process writes to one end of the
pipe (producer) and the other reads from the other end (consumer).

8. How do you create special files like named pipes and device files?

Answer:

The system call mknod creates special files in the following sequence.

1. kernel assigns new inode,


2. sets the file type to indicate that the file is a pipe, directory or special
file,
3. If it is a device file, it makes the other entries like major, minor device
numbers.

For example:
If the device is a disk, major device number refers to the disk controller and
minor device number is the disk.

9. Discuss the mount and unmount system calls

Answer:

The privileged mount system call is used to attach a file system to a directory
of another file system; the unmount system call detaches a file system. When
you mount another file system on to your directory, you are essentially
splicing one directory tree onto a branch in another directory tree. The first
argument to mount call is the mount point, that is , a directory in the current
file naming system. The second argument is the file system to mount to that
point. When you insert a cdrom to your unix system's drive, the file system in
the cdrom automatically mounts to /dev/cdrom in your system.

10. How does the inode map to data block of a file?

Answer:

Inode has 13 block addresses. The first 10 are direct block addresses of the
first 10 data blocks in the file. The 11th address points to a one-level index
block. The 12th address points to a two-level (double in-direction) index
block. The 13th address points to a three-level(triple in-direction)index block.
This provides a very large maximum file size with efficient access to large
files, but also small files are accessed directly in one disk read.
11. What is a shell?

Answer:

A shell is an interactive user interface to an operating system services that


allows an user to enter commands as character strings or through a graphical
user interface. The shell converts them to system calls to the OS or forks off a
process to execute the command. System call results and other information
from the OS are presented to the user through an interactive interface.
Commonly used shells are sh,csh,ks etc.

12. Brief about the initial process sequence while the system boots up.

Answer:

While booting, special process called the 'swapper' or 'scheduler' is created


with Process-ID 0. The swapper manages memory allocation for processes
and influences CPU allocation.

The swapper inturn creates 3 children:

the process dispatcher,


vhand and
dbflush
with IDs 1,2 and 3 respectively.

This is done by executing the file /etc/init. Process dispatcher gives birth to
the shell. Unix keeps track of all the processes in an internal data structure
called the Process Table (listing command is ps -el).

13. What are various IDs associated with a process?

Answer:

Unix identifies each process with a unique integer called ProcessID. The
process that executes the request for creation of a process is called the
'parent process' whose PID is 'Parent Process ID'. Every process is associated
with a particular user called the 'owner' who has privileges over the process.
The identification for the user is 'UserID'. Owner is the user who executes the
process. Process also has 'Effective User ID' which determines the access
privileges for accessing resources like files.
getpid() -process id
getppid() -parent process id
getuid() -user id
geteuid() -effective user id

14. Explain fork() system call.

Answer:

The 'fork()' used to create a new process from an existing process. The new
process is called the child process, and the existing process is called the
parent. We can tell which is which by checking the return value from 'fork()'.
The parent gets the child's pid returned to him, but the child gets 0 returned
to him.

15. Predict the output of the following program code


16. main()
17. {
18. fork();
19. printf("Hello World!");

Answer:

Hello World!Hello World!

Explanation:

The fork creates a child that is a duplicate of the parent process. The child
begins from the fork().All the statements after the call to fork() will be
executed twice.(once by the parent process and other by child). The
statement before fork() is executed only by the parent process.

20. Predict the output of the following program code


21. main()
22. {
23. fork(); fork(); fork();
24. printf("Hello World!");

Answer:

"Hello World" will be printed 8 times.

Explanation:

2^n times where n is the number of calls to fork()


17. List the system calls used for process management:

Answer:

System calls Description


fork() To create a new process
exec() To execute a new program in a process
wait() To wait until a created process completes its execution
exit() To exit from a process execution
getpid() To get a process identifier of the current process
getppid() To get parent process identifier
nice() To bias the existing priority of a process
brk() To increase/decrease the data segment size of a process

18. How can you get/set an environment variable from a program?

Answer:

Getting the value of an environment variable is done by using 'getenv()'.


Setting the value of an environment variable is done by using 'putenv()'.

19. How can a parent and child process communicate?

Answer:

A parent and child can communicate through any of the normal inter-process
communication schemes (pipes, sockets, message queues, shared memory),
but also have some special ways to communicate that take advantage of their
relationship as a parent and child. One of the most obvious is that the parent
can get the exit status of the child.

20. What is a zombie?

Answer:

When a program forks and the child finishes before the parent, the kernel still
keeps some of its information about the child in case the parent might need it
- for example, the parent may need to check the child's exit status. To be
able to get this information, the parent calls 'wait()'; In the interval between
the child terminating and the parent calling 'wait()', the child is said to be a
'zombie' (If you do 'ps', the child will have a 'Z' in its status field to indicate

this.)
21. What are the process states in Unix?

Answer:

As a process executes it changes state according to its circumstances. Unix


processes have the following states:
Running : The process is either running or it is ready to run .
Waiting : The process is waiting for an event or for a resource.
Stopped : The process has been stopped, usually by receiving a signal.
Zombie : The process is dead but have not been removed from the process
table.

22. What Happens when you execute a program?

Answer:

When you execute a program on your UNIX system, the system creates a
special environment for that program. This environment contains everything
needed for the system to run the program as if no other program were
running on the system. Each process has process context, which is everything
that is unique about the state of the program you are currently running.
Every time you execute a program the UNIX system does a fork, which
performs a series of operations to create a process context and then execute
your program in that context. The steps include the following: Allocate a slot
in the process table, a list of currently running programs kept by UNIX. Assign
a unique process identifier (PID) to the process. iCopy the context of the
parent, the process that requested the spawning of the new process. Return
the new PID to the parent process. This enables the parent process to
examine or control the process directly.
After the fork is complete, UNIX runs your program.

23. What Happens when you execute a command?

Answer:

When you enter 'ls' command to look at the contents of your current working
directory, UNIX does a series of things to create an environment for ls and
the run it: The shell has UNIX perform a fork. This creates a new process that
the shell will use to run the ls program. The shell has UNIX perform an exec
of the ls program. This replaces the shell program and data with the program
and data for ls and then starts running that new program. The ls program is
loaded into the new process context, replacing the text and data of the shell.
The ls program performs its task, listing the contents of the current directory.

24. What is a Daemon?


Answer:

A daemon is a process that detaches itself from the terminal and runs,
disconnected, in the background, waiting for requests and responding to
them. It can also be defined as the background process that does not belong
to a terminal session. Many system functions are commonly performed by
daemons, including the sendmail daemon, which handles mail, and the NNTP
daemon, which handles USENET news. Many other daemons may exist. Some
of the most common daemons are: init: Takes over the basic running of the
system when the kernel has finished the boot process. inetd: Responsible for
starting network services that do not have their own stand-alone daemons.
For example, inetd usually takes care of incoming rlogin, telnet, and ftp
connections. cron: Responsible for running repetitive tasks on a regular
schedule.

25. What is 'ps' command for?

Answer:

The ps command prints the process status for some or all of the running
processes. The information given are the process identification number
(PID),the amount of time that the process has taken to execute so far etc.

26. How would you kill a process?

Answer:

The kill command takes the PID as one argument; this identifies which
process to terminate. The PID of a process can be got using 'ps' command.

27. What is an advantage of executing a process in background?

Answer:

The most common reason to put a process in the background is to allow you
to do something else interactively without waiting for the process to
complete. At the end of the command you add the special background
symbol, &. This symbol tells your shell to execute the given command in the
background.

Example:
cp *.* ../backup& (cp is for copy)

28. How do you execute one program from within another?


Answer:

The system calls used for low-level process creation are execlp() and
execvp(). The execlp call overlays the existing program with the new one ,
runs that and exits. The original program gets back control only when an
error occurs.
execlp(path,file_name,arguments..); //last argument must be NULL
A variant of execlp called execvp is used when the number of arguments is
not known in advance. execvp(path,argument_array); //argument array
should be terminated by NULL

29. What is IPC? What are the various schemes available?

Answer:

The term IPC (Inter-Process Communication) describes various ways by which


different process running on some operating system communicate between
each other. Various schemes available are as follows:

Pipes:
One-way communication scheme through which different process can
communicate. The problem is that the two processes should have a common
ancestor (parent-child relationship). However this problem was fixed with the
introduction of named-pipes (FIFO).

Message Queues :
Message queues can be used between related and unrelated processes
running on a machine.

Shared Memory:
This is the fastest of all IPC schemes. The memory to be shared is mapped
into the address space of the processes (that are sharing). The speed
achieved is attributed to the fact that there is no kernel involvement. But this
scheme needs synchronization.

Various forms of synchronisation are mutexes, condition-variables, read-write


locks, record-locks, and semaphores.

30. hat is the difference between Swapping and Paging?

Answer:

Swapping:
Whole process is moved from the swap device to the main memory for
execution. Process size must be less than or equal to the available main
memory. It is easier to implementation and overhead to the system.
Swapping systems does not handle the memory more flexibly as compared to
the paging systems.

Paging:
Only the required memory pages are moved to main memory from the swap
device for execution. Process size does not matter. Gives the concept of the
virtual memory. It provides greater flexibility in mapping the virtual address
space into the physical memory of the machine. Allows more number of
processes to fit in the main memory simultaneously. Allows the greater
process size than the available physical memory. Demand paging systems
handle the memory more flexibly.

31. What is major difference between the Historic Unix and the new BSD release of
Unix
System V in terms of Memory Management?

Answer:

Historic Unix uses Swapping – entire process is transferred to the main


memory from the swap device, whereas the Unix System V uses Demand
Paging – only the part of the process is moved to the main memory.
Historic Unix uses one Swap Device and Unix System V allow multiple Swap
Devices.

32. What is the main goal of the Memory Management?

Answer:

It decides which process should reside in the main memory, Manages the
parts of the virtual address space of a process which is non-core resident,
Monitors the available main memory and periodically write the processes into
the swap device to provide more processes fit in the main memory
simultaneously.

33. What is a Map?

Answer:

A Map is an Array, which contains the addresses of the free space in the swap
device that are allocatable resources, and the number of the resource units
available there. This allows First-Fit allocation of contiguous blocks of a
resource. Initially the Map contains one entry – address (block offset from
the starting of the swap area) and the total number of resources.
Kernel treats each unit of Map as a group of disk blocks. On the allocation and
freeing of the resources Kernel updates the Map for accurate information.

34. What scheme does the Kernel in Unix System V follow while choosing a swap
device
among the multiple swap devices?

Answer:

Kernel follows Round Robin scheme choosing a swap device among the
multiple swap devices in Unix System V.

35. What is a Region?

Answer:

A Region is a continuous area of a process's address space (such as text, data


and stack). The kernel in a 'Region Table' that is local to the process
maintains region. Regions are sharable among the process.

36. What are the events done by the Kernel after a process is being swapped out
from
the main memory?

Answer:

When Kernel swaps the process out of the primary memory, it performs the
following:

o Kernel decrements the Reference Count of each region of the process.


If the reference count becomes zero, swaps the region out of the main
memory,
o Kernel allocates the space for the swapping process in the swap
device,
o Kernel locks the other swapping process while the current swapping
operation is going on,
o The Kernel saves the swap address of the region in the region table.

37. Is the Process before and after the swap are the same? Give reason.

Answer:

Process before swapping is residing in the primary memory in its original


form. The regions (text, data and stack) may not be occupied fully by the
process, there may be few empty slots in any of the regions and while
swapping Kernel do not bother about the empty slots while swapping the
process out.
After swapping the process resides in the swap (secondary memory) device.
The regions swapped out will be present but only the occupied region slots
but not the empty slots that were present before assigning.

While swapping the process once again into the main memory, the Kernel
referring to the Process Memory Map, it assigns the main memory accordingly
taking care of the empty slots in the regions.

38. What do you mean by u-area (user area) or u-block?

Answer:

This contains the private data that is manipulated only by the Kernel. This is
local to the Process, i.e. each process is allocated a u-area.

39. What are the entities that are swapped out of the main memory while swapping
the
process out of the main memory?

Answer:

All memory space occupied by the process, process's u-area, and Kernel stack
are swapped out, theoretically.

Practically, if the process's u-area contains the Address Translation Tables for
the process then Kernel implementations do not swap the u-area.

40. What is Fork swap?

Answer:

fork() is a system call to create a child process. When the parent process calls
fork() system call, the child process is created and if there is short of memory
then the child process is sent to the read-to-run state in the swap device, and
return to the user state without swapping the parent process. When the
memory will be available the child process will be swapped into the main
memory.

41. What is Expansion swap?

Answer:

At the time when any process requires more memory than it is currently
allocated, the Kernel performs Expansion swap. To do this Kernel reserves
enough space in the swap device. Then the address translation mapping is
adjusted for the new virtual address space but the physical memory is not
allocated. At last Kernel swaps the process into the assigned space in the
swap device. Later when the Kernel swaps the process into the main memory
this assigns memory according to the new address translation mapping.

42. How the Swapper works?

Answer:

The swapper is the only process that swaps the processes. The Swapper
operates only in the Kernel mode and it does not uses System calls instead it
uses internal Kernel functions for swapping. It is the archetype of all kernel
process.

43. What are the processes that are not bothered by the swapper? Give Reason.

Answer:

Zombie process: They do not take any up physical memory.


Processes locked in memories that are updating the region of the process.
Kernel swaps only the sleeping processes rather than the 'ready-to-run'
processes, as they have the higher probability of being scheduled than the
Sleeping processes.

44. What are the requirements for a swapper to work?

Answer:

The swapper works on the highest scheduling priority. Firstly it will look for
any sleeping process, if not found then it will look for the ready-to-run
process for swapping. But the major requirement for the swapper to work the
ready-to-run process must be core-resident for at least 2 seconds before
swapping out. And for swapping in the process must have been resided in the
swap device for at least 2 seconds. If the requirement is not satisfied then the
swapper will go into the wait state on that event and it is awaken once in a
second by the Kernel.

45. What are the criteria for choosing a process for swapping into memory from the
swap device?

Answer:

The resident time of the processes in the swap device, the priority of the
processes and the amount of time the processes had been swapped out.
46. What are the criteria for choosing a process for swapping out of the memory to
the
swap device?

Answer:

The process's memory resident time,


Priority of the process and
The nice value.

47. What do you mean by nice value?

Answer:

Nice value is the value that controls {increments or decrements} the priority
of the process. This value that is returned by the nice () system call. The
equation for using nice value is:
Priority = ("recent CPU usage"/constant) + (base- priority) + (nice value)
Only the administrator can supply the nice value. The nice () system call
works for the running process only. Nice value of one process cannot affect
the nice value of the other process.

48. What are conditions on which deadlock can occur while swapping the processes?

Answer:

All processes in the main memory are asleep.


All 'ready-to-run' processes are swapped out.
There is no space in the swap device for the new incoming process that are
swapped out of the main memory.
There is no space in the main memory for the new incoming process.

49. What are conditions for a machine to support Demand Paging?

Answer:

Memory architecture must based on Pages,


The machine must support the 'restartable' instructions.

50. What is 'the principle of locality'?

Answer:
It's the nature of the processes that they refer only to the small subset of the
total data space of the process. i.e. the process frequently calls the same
subroutines or executes the loop instructions.

51. What is the working set of a process?

Answer:

The set of pages that are referred by the process in the last 'n', references,
where 'n' is called the window of the working set of the process.

52. What is the window of the working set of a process?

Answer:

The window of the working set of a process is the total number in which the
process had referred the set of pages in the working set of the process.

53. What is called a page fault?

Answer:

Page fault is referred to the situation when the process addresses a page in
the working set of the process but the process fails to locate the page in the
working set. And on a page fault the kernel updates the working set by
reading the page from the secondary device.

54. What are data structures that are used for Demand Paging?

Answer:

Kernel contains 4 data structures for Demand paging. They are,

o Page table entries,


o Disk block descriptors,
o Page frame data table (pfdata),
o Swap-use table.

55. What are the bits that support the demand paging?

Answer:

Valid, Reference, Modify, Copy on write, Age. These bits are the part of the
page table entry, which includes physical address of the page and protection
bits.
Page
address

Age

Copy
on
write

Modify

Reference

Valid

Protection

56. How the Kernel handles the fork() system call in traditional Unix and in the
System V
Unix, while swapping?

Answer:

Kernel in traditional Unix, makes the duplicate copy of the parent's address
space and attaches it to the child's process, while swapping. Kernel in System
V Unix, manipulates the region tables, page table, and pfdata table entries,
by incrementing the reference count of the region table of shared regions.

57. Difference between the fork() and vfork() system call?

Answer:

During the fork() system call the Kernel makes a copy of the parent process's
address space and attaches it to the child process.
But the vfork() system call do not makes any copy of the parent's address
space, so it is faster than the fork() system call. The child process as a result
of the vfork() system call executes exec() system call. The child process from
vfork() system call executes in the parent's address space (this can overwrite
the parent's data and stack ) which suspends the parent process until the
child process exits.

58. What is BSS(Block Started by Symbol)?

Answer:

A data representation at the machine level, that has initial values when a
program starts and tells about how much space the kernel allocates for the
un-initialized data. Kernel initializes it to zero at run-time.

59. What is Page-Stealer process?


Answer:

This is the Kernel process that makes rooms for the incoming pages, by
swapping the memory pages that are not the part of the working set of a
process. Page-Stealer is created by the Kernel at the system initialization and
invokes it throughout the lifetime of the system. Kernel locks a region when a
process faults on a page in the region, so that page stealer cannot steal the
page, which is being faulted in.

60. Name two paging states for a page in memory?

Answer:

The two paging states are:


The page is aging and is not yet eligible for swapping,
The page is eligible for swapping but not yet eligible for reassignment to other
virtual address space.

61. What are the phases of swapping a page from the memory?

Answer:

Page stealer finds the page eligible for swapping and places the page number
in the list of pages to be swapped.

Kernel copies the page to a swap device when necessary and clears the valid
bit in the page table entry, decrements the pfdata reference count, and places
the pfdata table entry at the end of the free list if its reference count is 0.

62. What is page fault? Its types?

Answer:

Page fault refers to the situation of not having a page in the main memory
when any process references it.

There are two types of page fault :


Validity fault,
Protection fault.

63. In what way the Fault Handlers and the Interrupt handlers are different?

Answer:

Fault handlers are also an interrupt handler with an exception that the
interrupt handlers cannot sleep. Fault handlers sleep in the context of the
process that caused the memory fault. The fault refers to the running process
and no arbitrary processes are put to sleep.
64. What is validity fault?

Answer:

If a process referring a page in the main memory whose valid bit is not set, it
results in validity fault.

The valid bit is not set for those pages: that are outside the virtual address
space of a process, that are the part of the virtual address space of the
process but no physical address is assigned to it.

65. What does the swapping system do if it identifies the illegal page for
swapping?

Answer:

If the disk block descriptor does not contain any record of the faulted page,
then this causes the attempted memory reference is invalid and the kernel
sends a "Segmentation violation" signal to the offending process. This
happens when the swapping system identifies any invalid memory reference.

66. What are states that the page can be in, after causing a page fault?

Answer:

On a swap device and not in memory,


On the free page list in the main memory,
In an executable file,
Marked "demand zero",
Marked "demand fill".

67. In what way the validity fault handler concludes?

Answer:

It sets the valid bit of the page by clearing the modify bit.
It recalculates the process priority.

68. At what mode the fault handler executes?

Answer:

At the Kernel Mode.

69. What do you mean by the protection fault?


Answer:

Protection fault refers to the process accessing the pages, which do not have
the access permission. A process also incur the protection fault when it
attempts to write a page whose copy on write bit was set during the fork()
system call.

70. How the Kernel handles the copy on write bit of a page, when the bit is set?

Answer:

In situations like, where the copy on write bit of a page is set and that page is
shared by more than one process, the Kernel allocates new page and copies
the content to the new page and the other processes retain their references
to the old page. After copying the Kernel updates the page table entry with
the new page number. Then Kernel decrements the reference count of the old
pfdata table entry.

In cases like, where the copy on write bit is set and no processes are sharing
the page, the Kernel allows the physical page to be reused by the processes.
By doing so, it clears the copy on write bit and disassociates the page from its
disk copy (if one exists), because other process may share the disk copy.
Then it removes the pfdata table entry from the page-queue as the new copy
of the virtual page is not on the swap device. It decrements the swap-use
count for the page and if count drops to 0, frees the swap space.

71. For which kind of fault the page is checked first?

Answer:

The page is first checked for the validity fault, as soon as it is found that the
page is invalid (valid bit is clear), the validity fault handler returns
immediately, and the process incur the validity page fault. Kernel handles the
validity fault and the process will incur the protection fault if any one is
present.

72. In what way the protection fault handler concludes?

Answer:

After finishing the execution of the fault handler, it sets the modify and
protection bits and clears the copy on write bit. It recalculates the process-
priority and checks for signals.
73. How the Kernel handles both the page stealer and the fault handler?

Answer:

The page stealer and the fault handler thrash because of the shortage of the
memory. If the sum of the working sets of all processes is greater that the
physical memory then the fault handler will usually sleep because it cannot
allocate pages for a process. This results in the reduction of the system
throughput because Kernel spends too much time in overhead, rearranging
the memory in the frantic pace.

You might also like