DW Bi
DW Bi
OLTP is the transaction system that collects business data. Whereas OLAP is the
reporting and analysis system on
that data.
OLTP systems are optimized for INSERT, UPDATE operations and therefore highly
normalized. On the other hand,
OLAP systems are deliberately denormalized for fast data retrieval through SELECT
operations.
Explanatory Note:
In a departmental shop, when we pay the prices at the check-out counter, the sales
person at the
counter keys-in all the data into a "Point-Of-Sales" machine. That data is
transaction data and the
related system is a OLTP system. On the other hand, the manager of the store might
want to view a
report on out-of-stock materials, so that he can place purchase order for them.
Such report will come
out from OLAP system
What is data mart?
Data marts are generally designed for a single subject area. An organization may
have data pertaining to different
departments like Finance, HR, Marketing etc. stored in data warehouse and each
department may have separate data
marts. These data marts can be built on top of the data warehouse.
What is ER model?
Dimensional model consists of dimension and fact tables. Fact tables store
different transactional measurements and
the foreign keys from dimension tables that qualifies the data. The goal of
Dimensional model is not to achieve high
degree of normalization but to facilitate easy and faster data retrieval.
What is dimension?
If I just say� �20kg�, it does not mean anything. But 20kg of Rice (Product) is
sold to Ramesh (customer) on 5th April
(date), gives a meaningful sense. These product, customer and dates are some
dimension that qualified the measure.
Dimensions are mutually independent.
What is fact?
A fact is something that is quantifiable (Or measurable). Facts are typically (but
not always) numerical values that
can be aggregated.
Semi-additive measures are those where only a subset of aggregation function can be
applied. Let�s say account
balance. A sum() function on balance does not give a useful result but max() or
min() balance might be useful.
Consider price rate or currency rate. Sum is meaningless on rate; however, average
function might be useful.
Additive measures can be used with any aggregation function like Sum(), Avg() etc.
Example is Sales Quantity etc.
What is Star-schema?
This schema is used in data warehouse models where one centralized fact table
references number of dimension
tables so as the keys (primary key) from all the dimension tables flow into the
fact table (as foreign key) where
measures are stored. This entity-relationship diagram looks like a star, hence the
name.
Consider a fact table that stores sales quantity for each product and customer on a
certain time. Sales quantity will be
the measure here and keys from customer, product and time dimension tables will
flow into the fact table.
Consider a fact table that stores sales quantity for each product and customer on a
certain time. Sales quantity will be
the measure here and keys from customer, product and time dimension tables will
flow into the fact table.
Additionally all the products can be further grouped under different product
families stored in a different table so
that primary key of product family tables also goes into the product table as a
foreign key. Such construct will be
called a snow-flake schema as product table is further snow-flaked into product
family.
Note
Snow-flake increases degree of normalization in the design.
1. Conformed Dimension
2. Junk Dimension
3. Degenerated Dimension
4. Role Playing Dimension
Based on how frequently the data inside a dimension changes, we can further
classify dimension as
A conformed dimension is the dimension that is shared across multiple subject area.
Consider 'Customer' dimension.
Both marketing and sales department may use the same customer dimension table in
their reports. Similarly, a 'Time'
or 'Date' dimension will be shared by different subject areas. These dimensions are
conformed dimension.
A degenerated dimension is a dimension that is derived from fact table and does not
have its own dimension table.
A dimension key, such as transaction number, receipt number, Invoice number etc.
does not have any more
associated attributes and hence can not be designed as a dimension table.
These junk dimension attributes might not be related. The only purpose of this
table is to store all the combinations of
the dimensional attributes which you could not fit into the different dimension
tables otherwise. One may want to
read an interesting document, De-clutter with Junk (Dimension)
Dimensions are often reused for multiple applications within the same database with
different contextual meaning.
For instance, a "Date" dimension can be used for "Date of Sale", as well as "Date
of Delivery", or "Date of Hire". This is
often referred to as a 'role-playing dimension'
What is SCD?
SCD stands for slowly changing dimension, i.e. the dimensions where data is slowly
changing. These can be of many
types, e.g. Type 0, Type 1, Type 2, Type 3 and Type 6, although Type 1, 2 and 3 are
most common.
Type 0:
A Type 0 dimension is where dimensional changes are not considered. This does not
mean that the attributes of the
dimension do not change in actual business situation. It just means that, even if
the value of the attributes change,
history is not kept and the table holds all the previous data.
Type 1:
A type 1 dimension is where history is not maintained and the table always shows
the recent data. This effectively
means that such dimension table is always updated with recent data whenever there
is a change, and because of this
update, we lose the previous values.
Type 2:
A type 2 dimension table tracks the historical changes by creating separate rows in
the table with different surrogate
keys. Consider there is a customer C1 under group G1 first and later on the
customer is changed to group G2. Then
there will be two separate records in dimension table like below,
Key
Customer
Group
Start Date
End Date
C1
G1
2
C1
G2
NULL
Note that separate surrogate keys are generated for the two records. NULL end date
in the second row denotes that
the record is the current record. Also note that, instead of start and end dates,
one could also keep version number
column (1, 2 � etc.) to denote different versions of the record.
Type 3:
Key
Customer
Previous Group
Current Group
C1
G1
G2
This is only good when you need not store many consecutive histories and when date
of change is not required to be
stored.
Type 6:
A type 6 dimension is a hybrid of type 1, 2 and 3 (1+2+3) which acts very similar
to type 2, but only you add one extra
column to denote which record is the current record.
Key
Customer
Group
Start Date
End Date
Current Flag
C1
G1
1st Jan 2000
C1
G2
NULL
A fact table that does not contain any measure is called a fact-less fact. This
table will only contain keys from different
dimension tables. This is often used to resolve a many-to-many cardinality issue.
Explanatory Note:
Consider a school, where a single student may be taught by many teachers and a
single teacher may have many
students. To model this situation in dimensional model, one might introduce a fact-
less-fact table joining teacher and
student keys. Such a fact table will then be able to answer queries like,
A fact-less-fact table can only answer 'optimistic' queries (positive query) but
can not answer a negative query. Again
consider the illustration in the above example. A fact-less fact containing the
keys of tutors and students can not
answer a query like below,
Why not? Because fact-less fact table only stores the positive scenarios (like
student being taught by a tutor) but if
there is a student who is not being taught by a teacher, then that student's key
does not appear in this table, thereby
reducing the coverage of the table.
Coverage fact table attempts to answer this - often by adding an extra flag column.
Flag = 0 indicates a negative
condition and flag = 1 indicates a positive condition. To understand this better,
let's consider a class where there are
100 students and 5 teachers. So coverage fact table will ideally store 100 X 5 =
500 records (all combinations) and if a
certain teacher is not teaching a certain student, the corresponding flag for that
record will be 0.
A fact table stores some kind of measurements. Usually these measurements are
stored (or captured) against a
specific time and these measurements vary with respect to time. Now it might so
happen that the business might not
able to capture all of its measures always for every point in time. Then those
unavailable measurements can be kept
empty (Null) or can be filled up with the last available measurements. The first
case is the example of incident fact
and the second one is the example of snapshot fact.
A data warehouse usually captures data with same degree of details as available in
source. The "degree of detail" is
termed as granularity. But all reporting requirements from that data warehouse do
not need the same degree of
details.
Each shop manager can access the data warehouse and they can see which products are
sold by whom and in what
quantity on any given date. Thus the data warehouse helps the shop managers with
the detail level data that can be
used for inventory management, trend prediction etc.
Now think about the CEO of that retail chain. He does not really care about which
certain sales girl in London sold
the highest number of chopsticks or which shop is the best seller of 'brown
breads'. All he is interested is, perhaps to
check the percentage increase of his revenue margin accross Europe. Or may be year
to year sales growth on eastern
Europe. Such data is aggregated in nature. Because Sales of goods in East Europe is
derived by summing up the
individual sales data from each shop in East Europe.
What is slicing-dicing?
Slicing means showing the slice of a data, given a certain set of dimension (e.g.
Product) and value (e.g. Brown Bread)
and measures (e.g. sales).
Dicing means viewing the slice with respect to different dimensions and in
different level of aggregations.
What is drill-through?
Drill through is the process of going to the detail level data from summary data.
Consider the above example on retail shops. If the CEO finds out that sales in East
Europe has declined this year
compared to last year, he then might want to know the root cause of the decrease.
For this, he may start drilling
through his report to more detail level and eventually find out that even though
individual shop sales has actually
increased, the overall sales figure has decreased because a certain shop in Turkey
has stopped operating the business.
The detail level of data, which CEO was not much interested on earlier, has this
time helped him to pin point the root
cause of declined sales. And the method he has followed to obtain the details from
the aggregated data is called drill
through.
Informatica Questions
Connected Lookup
Unconnected Lookup
Router
Filter
Lookups can be cached or uncached (No cache). Cached lookup can be either static or
dynamic. A static cache is one
which does not modify the cache once it is built and it remains same during the
session run. On the other hand, A
dynamic cache is refreshed during the session run by inserting or updating the
records in cache based on the
incoming source data.
How can we update a record in target table without using Update strategy?
A target table can be updated without using 'Update Strategy'. For this, we need to
define the key in the target table
in Informatica level and then we need to connect the key and the field we want to
update in the mapping Target. In
the session level, we should set the target property as "Update as Update" and
check the "Update" check-box.
Let's assume we have a target table "Customer" with fields as "Customer ID",
"Customer Name" and "Customer
Address". Suppose we want to update "Customer Address" without an Update Strategy.
Then we have to define
"Customer ID" as primary key in Informatica level and we will have to connect
Customer ID and Customer Address
fields in the mapping. If the session properties are set correctly as described
above, then the mapping will only
update the customer address field for all matching customer IDs.
Under what condition selecting Sorted Input in aggregator may fail the session?
. If the input data is not sorted correctly, the session will fail.
. Also if the input data is properly sorted, the session may fail if the sort order
by ports and the group by ports
of the aggregator are not in the same order.
Ans. When we issue the STOP command on the executing session task, the Integration
Service stops reading data
from source. It continues processing, writing and committing the data to targets.
If the Integration Service cannot
finish processing and committing data, we can issue the abort command.
Now suppose the source system is a Flat File. Here in the Source Qualifier you will
not be able to select the distinct
clause as it is disabled due to flat file source table. Hence the next approach may
be we use a Sorter Transformation
and check the Distinct option. When we select the distinct option all the columns
will the selected as keys, in
ascending order by default.
Sorter Transformation DISTINCT clause
Other ways to handle duplicate records in source batch run is to use an Aggregator
Transformation and using the
Group By checkbox on the ports having duplicate occurring data. Here you can have
the flexibility to select the last or
the first of the duplicate column value records. Apart from that using Dynamic
Lookup Cache of the target table and
associating the input ports with the lookup port and checking the Insert Else
Update option will help to eliminate the
duplicate records in source and hence loading unique records in the target.
Q2. Suppose we have some serial numbers in a flat file source. We want to load the
serial numbers in two target files
one containing the EVEN serial numbers and the other file having the ODD ones.
Ans. After the Source Qualifier place a Router Transformation. Create two Groups
namely EVEN and ODD, with
filter conditions as MOD(SERIAL_NO,2)=0 and MOD(SERIAL_NO,2)=1 respectively. Then
output the two groups
into two flat file targets.
Router Transformation Groups Tab
Student Name
Maths
Life Science
Physical Science
Sam
100
70
80
John
75
100
85
Tom
80
100
85
Student Name
Subject Name
Marks
Sam
Maths
100
Sam
Life Science
70
Sam
Physical Science
80
John
Maths
75
John
Life Science
100
John
Physical Science
85
Tom
Maths
80
Tom
Life Science
100
Tom
Physical Science
85
Ans. Here to convert the Rows to Columns we have to use the Normalizer
Transformation followed by an
Expression Transformation to Decode the column taken into consideration. For more
details on how the mapping is
performed please visit Working with Normalizer
Q4. Name the transformations which converts one to many rows i.e increases the
i/p:o/p row count. Also what is the
name of its reverse transformation.
Q5. Suppose we have a source table and we want to load three target tables based on
source rows such that first row
moves to first target table, secord row in second target table, third row in third
target table, fourth row again in first
target table so on and so forth. Describe your approach.
Router Transformation Groups Tab
Ans. We can clearly understand that we need a Router transformation to route or
filter source data to the three target
tables. Now the question is what will be the filter conditions. First of all we
need an Expression Transformation
where we have all the source table columns and along with that we have another i/o
port say seq_num, which is gets
sequence numbers for each source row from the port NextVal of a Sequence Generator
start value 0 and increment
by 1. Now the filter condition for the three router groups will be:
Q6. Suppose we have ten source flat files of same structure. How can we load all
the files in target database in a
single batch run using a single mapping.
Ans. After we create a mapping to load data in target database from flat files,
next we move on to the session
property of the Source Qualifier. To load a set of source files we need to create a
file say final.txt containing the
source falt file names, ten files in our case and set the Source filetype option as
Indirect. Next point this flat file
final.txt fully qualified through Source file directory and Source filename.
Session Property Flat File
Ans. We will use the very basic concept of the Expression Transformation that at a
time we can access the previous
row data as well as the currently processed data in an expression transformation.
What we need is simple Sorter,
Expression and Filter transformation to achieve aggregation at Informatica level.
Student Name
Subject Name
Marks
Sam
Maths
100
Tom
Maths
80
Sam
Physical Science
80
Mapping using sorter and Aggregator
John
Maths
75
Sam
Life Science
70
John
Life Science
100
John
Physical Science
85
Tom
Life Science
100
Tom
Physical Science
85
Student Name
Maths
Life Science
Physical Science
Sam
100
70
80
John
75
100
85
Tom
80
100
85
Ans. Here our scenario is to convert many rows to one rows, and the transformation
which will help us to achieve
this is Aggregator.
Now based on STUDENT_NAME in GROUP BY clause the following output subject columns
are populated as
Q9. What is a Source Qualifier? What are the tasks we can perform using a SQ and
why it is an ACTIVE
transformation?
. We can configure the SQ to join [Both INNER as well as OUTER JOIN] data
originating from the same
source database.
. We can use a source filter to reduce the number of rows the Integration Service
queries.
. We can specify a number for sorted ports and the Integration Service adds an
ORDER BY clause to the
default SQL query.
. We can choose Select Distinctoption for relational databases and the Integration
Service adds a SELECT
DISTINCT clause to the default SQL query.
. Also we can write Custom/Used Defined SQL query which will override the default
query in the SQ by
changing the default settings of the transformation properties.
. Also we have the option to write Pre as well as Post SQL statements to be
executed before and after the SQ
query in the source database.
Since the transformation provides us with the property Select Distinct, when the
Integration Service adds a SELECT
DISTINCT clause to the default SQL query, which in turn affects the number of rows
returned by the Database to the
Integration Service and hence it is an Active transformation.
Q10. What happens to a mapping if we alter the datatypes between Source and its
corresponding Source Qualifier?
Ans. The Source Qualifier transformation displays the transformation datatypes. The
transformation datatypes
determine how the source database binds data when the Integration Service reads it.
Q11. Suppose we have used the Select Distinct and the Number Of Sorted Ports
property in the SQ and then we add
Custom SQL Query. Explain what will happen.
Ans. Whenever we add Custom SQL or SQL override query it overrides the User-Defined
Join, Source Filter,
Number of Sorted Ports, and Select Distinct settings in the Source Qualifier
transformation. Hence only the user
defined SQL Query will be fired in the database and all the other options will be
ignored .
Q12. Describe the situations where we will use the Source Filter, Select Distinct
and Number Of Sorted Ports
properties of Source Qualifier transformation.
Ans. Source Filter option is used basically to reduce the number of rows the
Integration Service queries so as to
improve performance.
Select Distinct option is used when we want the Integration Service to select
unique values from a source, filtering
out unnecessary data earlier in the data flow, which might improve performance.
Number Of Sorted Ports option is used when we want the source data to be in a
sorted fashion so as to use the same
in some following transformations like Aggregator or Joiner, those when configured
for sorted input will improve
the performance.
Q13. What will happen if the SELECT list COLUMNS in the Custom override SQL Query
and the OUTPUT PORTS
order in SQ transformation do not match?
Ans. Mismatch or Changing the order of the list of selected columns to that of the
connected transformation output
ports may result is session failure.
Q14. What happens if in the Source Filter property of SQ transformation we include
keyword WHERE say, WHERE
CUSTOMERS.CUSTOMER_ID > 1000.
Ans. We use source filter to reduce the number of source records. If we include the
string WHERE in the source filter,
the Integration Service fails the session.
Q15. Describe the scenarios where we go for Joiner transformation instead of Source
Qualifier transformation.
Ans. While joining Source Data of heterogeneous sources as well as to join flat
files we will use the Joiner
transformation. Use the Joiner transformation when we need to join the following
types of sources:
Q16. What is the maximum number we can use in Number Of Sorted Ports for Sybase
source system.
Q17. Suppose we have two Source Qualifier transformations SQ1 and SQ2 connected to
Target tables TGT1 and
TGT2 respectively. How do you ensure TGT2 is loaded after TGT1?
In the Mapping Designer, We need to configure the Target Load Plan based on the
Source Qualifier transformations
in a mapping to specify the required loading order.
Target Load Plan
Target Load Plan Ordering
Q18. Suppose we have a Source Qualifier transformation that populates two target
tables. How do you ensure TGT2
is loaded after TGT1?
Ans. In the Workflow Manager, we can Configure Constraint based load ordering for a
session. The Integration
Service orders the target load on a row-by-row basis. For every row generated by an
active source, the Integration
Service loads the corresponding transformed row first to the primary key table,
then to the foreign key table.
Constraint based loading
Hence if we have one Source Qualifier transformation that provides data for
multiple target tables having primary
and foreign key relationships, we will go for Constraint based load ordering.
Only the rows that meet the Filter Condition pass through the Filter transformation
to the next transformation in the
pipeline. TRUE and FALSE are the implicit return values from any filter condition
we set. If the filter condition
evaluates to NULL, the row is assumed to be FALSE.
The numeric equivalent of FALSE is zero (0) and any non-zero value is the
equivalent of TRUE.
Ans.
SQ Source Filter
Filter Transformation
Source Qualifier
transformation filters rows
when read from a source.
Source Qualifier
transformation can only
filter rows from Relational
Sources.
Ans. A Joiner is an Active and Connected transformation used to join source data
from the same source system or
from two related heterogeneous sources residing in different locations or file
systems.
The Joiner transformation joins sources with at least one matching column. The
Joiner transformation uses a
condition that matches one or more pairs of columns between the two sources.
The two input pipelines include a master pipeline and a detail pipeline or a master
and a detail branch. The master
pipeline ends at the Joiner transformation, while the detail pipeline continues to
the target.
The join condition contains ports from both input sources that must match for the
Integration Service to join two
rows. Depending on the type of join selected, the Integration Service either adds
the row to the result set or discards
the row.
The Joiner transformation produces result sets based on the join type, condition,
and input data sources. Hence it is
an Active transformation.
Q22. State the limitations where we cannot use Joiner in the mapping pipeline.
Ans. The Joiner transformation accepts input from most transformations. However,
following are the limitations:
. Joiner transformation cannot be used when either of the input pipeline contains
an Update Strategy
transformation.
. Joiner transformation cannot be used if we connect a Sequence Generator
transformation directly before the
Joiner transformation.
Q23. Out of the two input pipelines of a joiner, which one will you set as the
master pipeline?
Ans. During a session run, the Integration Service compares each row of the master
source against the detail source.
The master and detail sources need to be configured for optimal performance.
To improve performance for an Unsorted Joiner transformation, use the source with
fewer rows as the master
source. The fewer unique rows in the master, the fewer iterations of the join
comparison occur, which speeds the join
process.
To improve performance for a Sorted Joiner transformation, use the source with
fewer duplicate key values as the
master source.
Blocking logic is possible if master and detail input to the Joiner transformation
originate from different sources.
Otherwise, it does not use blocking logic. Instead, it stores more rows in the
cache.
Q24. What are the different types of Joins available in Joiner Transformation?
Ans. In SQL, a join is a relational operator that combines data from multiple
tables into a single result set. The Joiner
transformation is similar to an SQL join except that data can originate from
different types of sources.
. Normal
. Master Outer
. Detail Outer
. Full Outer
Join Type property of Joiner Transformation class="caption"
Note: A normal or master outer join performs faster than a full outer or detail
outer join.
Ans.
. In a normal join , the Integration Service discards all rows of data from the
master and detail source that do
not match, based on the join condition.
. A master outer join keeps all rows of data from the detail source and the
matching rows from the master
source. It discards the unmatched rows from the master source.
. A detail outer join keeps all rows of data from the master source and the
matching rows from the detail
source. It discards the unmatched rows from the detail source.
. A full outer join keeps all rows of data from both the master and detail sources.
Q26. Describe the impact of number of join conditions and join order in a Joiner
Transformation.
Ans. We can define one or more conditions based on equality between the specified
master and detail sources. Both
ports in a condition must have the same datatype.
If we need to use two ports in the join condition with non-matching datatypes we
must convert the datatypes so that
they match. The Designer validates datatypes in a join condition.
Additional ports in the join condition increases the time necessary to join two
sources.
The order of the ports in the join condition can impact the performance of the
Joiner transformation. If we use
multiple ports in the join condition, the Integration Service compares the ports in
the order we specified.
For example, if both EMP_ID1 and EMP_ID2 contain a row with a null value, the
Integration Service does not
consider them a match and does not join the two rows.
To join rows with null values, replace null input with default values in the Ports
tab of the joiner, and then join on
the default values.
Note: If a result set includes fields that do not contain data in either of the
sources, the Joiner transformation
populates the empty fields with null values. If we know that a field will return a
NULL and we do not want to insert
NULLs in the target, set a default value on the Ports tab for the corresponding
port.
Q28. Suppose we configure Sorter transformations in the master and detail pipelines
with the following sorted ports
in order: ITEM_NO, ITEM_NAME, PRICE.
When we configure the join condition, what are the guidelines we need to follow to
maintain the sort order?
Ans. If we have sorted both the master and detail pipelines in order of the ports
say ITEM_NO, ITEM_NAME and
PRICE we must ensure that:
Ans. The best option is to place the Joiner transformation directly after the sort
origin to maintain sorted data.
However do not place any of the following transformations between the sort origin
and the Joiner transformation:
. Custom
. UnsortedAggregator
. Normalizer
. Rank
. Union transformation
. XML Parser transformation
. XML Generator transformation
. Mapplet [if it contains any one of the above mentioned transformations]
Q30. Suppose we have the EMP table as our source. In the target we want to view
those employees whose salary is
greater than or equal to the average salary for their departments. Describe your
mapping approach.
ahref="http://png.dwbiconcepts.com/images/tutorial/info_interview/
info_interview10.png"
After the Source qualifier of the EMP table place a Sorter Transformation . Sort
based on DEPTNOport.
Sorter Ports Tab
Next we place a Sorted Aggregator Transformation. Here we will find out the AVERAGE
SALARY for each
(GROUP BY) DEPTNO.
When we perform this aggregation, we lose the data for individual employees.
To maintain employee data, we must pass a branch of the pipeline to the Aggregator
Transformation and pass a
branch with the same sorted source data to the Joiner transformation to maintain
the original data.
When we join both branches of the pipeline, we join the aggregated data with the
original data.
Aggregator Ports Tab
Aggregator Properties Tab
So next we need Sorted Joiner Transformation to join the sorted aggregated data
with the original data, based on
DEPTNO. Here we will be taking the aggregated pipeline as the Master and original
dataflow as Detail Pipeline.
Joiner Condition Tab
Joiner Properties Tab
After that we need a Filter Transformation to filter out the employees having
salary less than average salary for their
department.
Ans.
Sequence
Generator
Properties
Description
Start Value
Increment By
End Value
Current
Value
Cycle
Number of
Cached
Values
Reset
Will the Surrogate keys in both the target tables be same? If not how can we flow
the same sequence values in both of
them.
Ans. When we connect the NEXTVAL output port of the Sequence Generator directly to
the surrogate key columns
of the target tables, the Sequence number will not be the same.
A block of sequence numbers is sent to one target tables surrogate key column. The
second targets receives a block of
sequence numbers from the Sequence Generator transformation only after the first
target table receives the block of
sequence numbers.
Suppose we have 5 rows coming from the source, so the targets will have the
sequence values as TGT1 (1,2,3,4,5) and
TGT2 (6,7,8,9,10). [Taken into consideration Start Value 0, Current value 1 and
Increment by 1.
Now suppose the requirement is like that we need to have the same surrogate keys in
both the targets.
Then the easiest way to handle the situation is to put an Expression Transformation
in between the Sequence
Generator and the Target tables. The SeqGen will pass unique values to the
expression transformation, and then the
rows are routed from the expression transformation to the targets.
Q34. Suppose we have 100 records coming from the source. Now for a target column
population we used a Sequence
generator.
Suppose the Current Value is 0 and End Value of Sequence generator is set to 80.
What will happen?
Ans. End Value is the maximum value the Sequence Generator will generate. After it
reaches the End value the
session fails with the following error message:
Q35. What are the changes we observe when we promote a non resuable Sequence
Generator to a resuable one? And
what happens if we set the Number of Cached Values to 0 for a reusable
transformation?
Ans. When we convert a non reusable sequence generator to resuable one we observe
that the Number of Cached
Values is set to 1000 by default; And the Reset property is disabled.
When we try to set the Number of Cached Values property of a Reusable Sequence
Generator to 0 in the
Transformation Developer we encounter the following error message:
The number of cached values must be greater than zero for reusable sequence
transformation.
Ans. Apart from aggregate expressions Informatica Aggregator also supports non-
aggregate expressions and
conditional clauses.
Ans. We can enable the session option, Incremental Aggregation for a session that
includes an Aggregator
Transformation. When the Integration Service performs incremental aggregation, it
actually passes changed source
data through the mapping and uses the historical cache data to perform aggregate
calculations incrementally.
Q41. What are the performance considerations when working with Aggregator
Transformation?
Ans.
. Filter the unnecessary data before aggregating it. Place a Filter transformation
in the mapping before the
Aggregator transformation to reduce unnecessary aggregation.
. Improve performance by connecting only the necessary input/output ports to
subsequent transformations,
thereby reducing the size of the data cache.
. Use Sorted input which reduces the amount of data cached and improves session
performance.
Q42. What differs when we choose Sorted Input for Aggregator Transformation?
Ans. Integration Service creates the index and data caches files in memory to
process the Aggregator transformation.
If the Integration Service requires more space as allocated for the index and data
cache sizes in the transformation
properties, it stores overflow values in cache files i.e. paging to disk. One way
to increase session performance is to
increase the index and data cache sizes in the transformation properties. But when
we check Sorted Input the
Integration Service uses memory to process an Aggregator transformation it does not
use cache files.
Q43. Under what conditions selecting Sorted Input in aggregator will still not
boost session performance?
Ans.
Q44. Under what condition selecting Sorted Input in aggregator may fail the
session?
Ans.
. If the input data is not sorted correctly, the session will fail.
. Also if the input data is properly sorted, the session may fail if the sort order
by ports and the group by ports
of the aggregator are not in the same order.
Q45. Suppose we do not group by on any ports of the aggregator what will be the
output.
Ans. If we do not group values, the Integration Service will return only the last
row for the input rows.
Q46. What is the expected value if the column in an aggregator transform is neither
a group by nor an aggregate
expression?
Ans. Integration Service produces one row for each group based on the group by
ports. The columns which are
neither part of the key nor aggregate expression will return the corresponding
value of last record of the group
received. However, if we specify particularly the FIRST function, the Integration
Service then returns the value of the
specified first row of the group. So default is the LAST function.
Q47. Give one example for each of Conditional Aggregation, Non-Aggregate expression
and Nested Aggregation.
Ans.
Use conditional clauses in the aggregate expression to reduce the number of rows
used in the aggregation. The
conditional clause can be any clause that evaluates to TRUE or FALSE.
Q49. How does a Rank Transform differ from Aggregator Transform functions MAX and
MIN?
Ans. Like the Aggregator transformation, the Rank transformation lets us group
information. The Rank Transform
allows us to select a group of top or bottom values, not just one value as in case
of Aggregator MAX, MIN functions.
Ans. Rank port is an input/output port use to specify the column for which we want
to rank the source values. By
default Informatica creates an output port RANKINDEX for each Rank transformation.
It stores the ranking position
for each row in a group.
Ans. Rank transformation lets us group information. We can configure one of its
input/output ports as a group by
port. For each unique value in the group port, the transformation creates a group
of rows falling within the rank
definition (top or bottom, and a particular number in each rank).
Ans. If two rank values match, they receive the same value in the rank index and
the transformation skips the next
value.
Ans.
. We can connect ports from only one transformation to the Rank transformation.
. We can select the top or bottom rank.
. We need to select the Number of records in each rank.
. We can designate only one Rank port in a Rank transformation.
Ans. During a session, the Integration Service compares an input row with rows in
the data cache. If the input row
out-ranks a cached row, the Integration Service replaces the cached row with the
input row. If we configure the Rank
transformation to rank based on different groups, the Integration Service ranks
incrementally for each group it finds.
The Integration Service creates an index cache to stores the group information and
data cache for the row data.
Ans. Rank transformation can return the strings at the top or the bottom of a
session sort order. When the Integration
Service runs in Unicode mode, it sorts character data in the session using the
selected sort order associated with the
Code Page of IS which may be French, German, etc. When the Integration Service runs
in ASCII mode, it ignores this
setting and uses a binary sort order to sort character data.
Ans. When the Sorter transformation is configured to treat output rows as distinct,
it assigns all ports as part of the
sort key. The Integration Service discards duplicate rows compared during the sort
operation. The number of Input
Rows will vary as compared with the Output rows and hence it is an Active
transformation.
Ans. The Case Sensitive property determines whether the Integration Service
considers case when sorting data.
When we enable the Case Sensitive property, the Integration Service sorts uppercase
characters higher than
lowercase characters.
Ans. The Integration Service passes all incoming data into the Sorter Cache before
Sorter transformation performs the
sort operation.
The Integration Service uses the Sorter Cache Size property to determine the
maximum amount of memory it can
allocate to perform the sort operation. If it cannot allocate enough memory, the
Integration Service fails the session.
For best performance, configure Sorter cache size with a value less than or equal
to the amount of available physical
RAM on the Integration Service machine.
If the amount of incoming data is greater than the amount of Sorter cache size, the
Integration Service temporarily
stores data in the Sorter transformation work directory. The Integration Service
requires disk space of at least twice
the amount of incoming data when storing data in the work directory.
Ans.
. All input groups and the output group must have matching ports. The precision,
datatype, and scale must be
identical across all groups.
. We can create multiple input groups, but only one default output group.
. The Union transformation does not remove duplicate rows.
. We cannot use a Sequence Generator or Update Strategy transformation upstream
from a Union
transformation.
. The Union transformation does not generate transactions.
General questions
Q63. What is the difference between Static and Dynamic Lookup Cache?
Ans. Lookups are cached by default in Informatica. Lookup cache can be either non-
persistent or persistent. The
Integration Service saves or deletes lookup cache files after a successful session
run based on whether the Lookup
cache is checked as persistent or not.
A Mapplet is a reusable object created in the Mapplet Designer which contains a set
of transformations and lets us
reuse the transformation logic in multiple mappings. A Mapplet can contain as many
transformations as we need.
Like a reusable transformation when we use a mapplet in a mapping, we use an
instance of the mapplet and any
change made to the mapplet is inherited by all instances of the mapplet.
Q66. What are the transformations that are not supported in Mapplet?
Ans. Normalizer, Cobol sources, XML sources, XML Source Qualifier transformations,
Target definitions, Pre- and
post- session Stored Procedures, Other Mapplets.
. PMERR_DATA- Stores data and metadata about a transformation row error and its
corresponding source
row.
. PMERR_MSG- Stores metadata about an error and the error message.
. PMERR_SESS- Stores metadata about the session.
. PMERR_TRANS- Stores metadata about the source and transformation ports, such as
name and datatype,
when a transformation error occurs.
Ans. When we issue the STOP command on the executing session task, the Integration
Service stops reading data
from source. It continues processing, writing and committing the data to targets.
If the Integration Service cannot
finish processing and committing data, we can issue the abort command.
Ans. Yes we can copy session to new folder or repository provided the corresponding
Mapping is already in there.
(Page 3 of 3)
What is a fact-less-fact?
A fact table that does not contain any measure is called a fact-less fact. This
table will only contain keys from different
dimension tables. This is often used to resolve a many-to-many cardinality issue.
Explanatory Note:
Consider a school, where a single student may be taught by many teachers and a
single teacher may have many
students. To model this situation in dimensional model, one might introduce a fact-
less-fact table joining teacher and
student keys. Such a fact table will then be able to answer queries like,
A fact-less-fact table can only answer 'optimistic' queries (positive query) but
can not answer a negative query. Again
consider the illustration in the above example. A fact-less fact containing the
keys of tutors and students can not
answer a query like below,
Why not? Because fact-less fact table only stores the positive scenarios (like
student being taught by a tutor) but if
there is a student who is not being taught by a teacher, then that student's key
does not appear in this table, thereby
reducing the coverage of the table.
Coverage fact table attempts to answer this - often by adding an extra flag column.
Flag = 0 indicates a negative
condition and flag = 1 indicates a positive condition. To understand this better,
let's consider a class where there are
100 students and 5 teachers. So coverage fact table will ideally store 100 X 5 =
500 records (all combinations) and if a
certain teacher is not teaching a certain student, the corresponding flag for that
record will be 0.
A fact table stores some kind of measurements. Usually these measurements are
stored (or captured) against a
specific time and these measurements vary with respect to time. Now it might so
happen that the business might not
able to capture all of its measures always for every point in time. Then those
unavailable measurements can be kept
empty (Null) or can be filled up with the last available measurements. The first
case is the example of incident fact
and the second one is the example of snapshot fact.
Each shop manager can access the data warehouse and they can see which products are
sold by whom and in what
quantity on any given date. Thus the data warehouse helps the shop managers with
the detail level data that can be
used for inventory management, trend prediction etc.
Now think about the CEO of that retail chain. He does not really care about which
certain sales girl in London sold
the highest number of chopsticks or which shop is the best seller of 'brown
breads'. All he is interested is, perhaps to
check the percentage increase of his revenue margin accross Europe. Or may be year
to year sales growth on eastern
Europe. Such data is aggregated in nature. Because Sales of goods in East Europe is
derived by summing up the
individual sales data from each shop in East Europe.
What is slicing-dicing?
Slicing means showing the slice of a data, given a certain set of dimension (e.g.
Product) and value (e.g. Brown Bread)
and measures (e.g. sales).
Dicing means viewing the slice with respect to different dimensions and in
different level of aggregations.
What is drill-through?
Drill through is the process of going to the detail level data from summary data.
Consider the above example on retail shops. If the CEO finds out that sales in East
Europe has declined this year
compared to last year, he then might want to know the root cause of the decrease.
For this, he may start drilling
through his report to more detail level and eventually find out that even though
individual shop sales has actually
increased, the overall sales figure has decreased because a certain shop in Turkey
has stopped operating the business.
The detail level of data, which CEO was not much interested on earlier, has this
time helped him to pin point the root
cause of declined sales. And the method he has followed to obtain the details from
the aggregated data is called drill
through.
This article attempts to refresh your Unix skills in the form of a question/answer
based Unix tutorial on Unix
command lines. The commands discussed here are particulary useful for the
developers working in the middle-tier
(e.g. ETL) systems, where they may need to interact with several *nx source systems
for data retrieval.
No prize in guessing that if you specify [head -2] then it would print first 2
records of the file.
Another way can be by using [sed] command. [Sed] is a very powerful text editor
which can be used for various text
manipulation purposes like this.
How does the above command work? The 'd' parameter basically tells [sed] to delete
all the records from display
from line 2 to last line of the file (last line is represented by $ symbol). Of
course it does not actually delete those lines
from the file, it just does not display those lines in standard output screen. So
you only see the remaining line which
is the 1st line.
If you want to do it using [sed] command, here is what you should write:
From our previous answer, we already know that '$' stands for the last line of the
file. So '$ p' basically prints (p for
print) the last line in standard output screen. '-n' switch takes [sed] to silent
mode so that [sed] does not print
anything else in the output.
The easiest way to do it will be by using [sed] I guess. Based on what we already
know about [sed] from our previous
examples, we can quickly deduce this command:
You need to replace <n> with the actual line number. So if you want to print the
4th line, the command will be
$> sed �n '4 p' test
Of course you can do it by using [head] and [tail] command as well like below:
You need to replace <n> with the actual line number. So if you want to print the
4th line, the command will be
We already know how [sed] can be used to delete a certain line from the output � by
using the'd' switch. So if we
want to delete the first line the command should be:
But the issue with the above command is, it just prints out all the lines except
the first line of the file on the standard
output. It does not really change the file in-place. So if you want to delete the
first line from the file itself, you have
two options.
Either you can redirect the output of the file to some other file and then rename
it back to original file like below:
Or, you can use an inbuilt [sed] switch '�i' which changes the file in-place. See
below:
How to remove the last line/ trailer from a file in Unix script?
Always remember that [sed] switch '$' refers to the last line. So using this
knowledge we can deduce the below
command:
If you want to remove line <m> to line <n> from a given file, you can accomplish
the task in the similar method
shown above. Here is an example:
$> sed �i '5,7 d' file.txt
The above command will delete line 5 to line 7 from the file file.txt
This is bit tricky. Suppose your file contains 100 lines and you want to remove the
last 5 lines. Now if you know how
many lines are there in the file, then you can simply use the above shown method
and can remove all the lines from
96 to 100 like below:
$> sed �i '96,100 d' file.txt # alternative to command [head -95 file.txt]
But not always you will know the number of lines present in the file (the file may
be generated dynamically, etc.) In
that case there are many different ways to solve the problem. There are some ways
which are quite complex and
fancy. But let's first do it in a way that we can understand easily and remember
easily. Here is how it goes:
$> tt=`wc -l file.txt | cut -f1 -d' '`;sed �i "`expr $tt - 4`,$tt d" test
As you can see there are two commands. The first one (before the semi-colon)
calculates the total number of lines
present in the file and stores it in a variable called �tt�. The second command
(after the semi-colon), uses the variable
and works in the exact way as shows in the previous example.
We already know how to print one line from a file which is this:
Where <n> is to be replaced by the actual line number that you want to print. Now
once you know it, it is easy to
print out the length of this line by using [wc] command with '-c' switch.
The above command will print the length of 35th line in the file.txt.
Assuming the words in the line are separated by space, we can use the [cut]
command. [cut] is a very powerful and
useful command and it's real easy. All you have to do to get the n-th word from the
line is issue the following
command:
$> echo �A quick brown fox jumped over the lazy cat� | cut �f4 �d' '
xinu
We will make use of two commands that we learnt above to solve this. The commands
are [rev] and [cut]. Here we
go.
Let's imagine the line is: �C for Cat�. We need �Cat�. First we reverse the line.
We get �taC rof C�. Then we cut the
first word, we get 'taC'. And then we reverse it again.
$>echo "C for Cat" | rev | cut -f1 -d' ' | rev
Cat
We know we can do it by [cut]. Like below command extracts the first field from the
output of [wc �c] command
109
But I want to introduce one more command to do this here. That is by using [awk]
command. [awk] is a very
powerful command for text pattern scanning and processing. Here we will see how may
we use of [awk] to extract
the first field (or first column) from the output of another command. Like above
suppose I want to print the first
column of the [wc �c] output. Here is how it goes like this:
109
The basic syntax of [awk] is like this:
109
In the action space, we have asked [awk] to take the action of printing the first
column ($1). More on [awk] later.
How to replace the n-th line in a file with a new line in Unix?
This can be done in two steps. The first step is to remove the n-th line. And the
second step is to insert a new line in
n-th line position. Here we go.
$>sed -i'' '10 i This is the new line' file.txt # i stands for insert
Open the file in VI editor. Go to VI command mode by pressing [Escape] and then
[:]. Then type [set list]. This will
show you all the non-printable characters, e.g. Ctrl-M characters (^M) etc., in the
file.
In order to know the file type of a particular file use the [file] command like
below:
If you want to know the technical MIME type of the file, use �-i� switch.
$>file -i file.txt
file.zip: application/x-zip
You will be using the same [sqlplus] command to connect to database that you use
normally even outside the shell
script. To understand this, let's take an example. In this example, we will connect
to database, fire a query and get the
output printed from the unix shell. Ok? Here we go �
EXIT;
EOF`
If you connect to database in this method, the advantage is, you will be able to
pass Unix side shell
variables value to the database. See below example
EOF`
12
BEGIN
END;
EXIT;
EOF`
How to check the command line arguments in a UNIX command in Shell Script?
In a bash shell, you can access the command line arguments using $0, $1, $2, �
variables, where $0 prints the
command name, $1 prints the first input parameter of the command, $2 the second
input parameter of the command
and so on.
Just put an [exit] command in the shell script with return value other than 0. this
is because the exit codes of
successful Unix programs is zero. So, suppose if you write
exit -1
inside your program, then your program will thrown an error and exit immediately.
Normally [ls �lt] command lists down file/folder list sorted by modified time. If
you want to list then alphabetically,
then you should simply specify: [ls �l]
$> echo $?
Using command, we can do it in many ways. Based on what we have learnt so far, we
can make use of [ls] and [$?]
command to do this. See below:
If the file exists, the [ls] command will be successful. Hence [echo $?] will print
0. If the file does not exist, then [ls]
command will fail and hence [echo $?] will print 1.
The standard command to see this is [ps]. But [ps] only shows you the snapshot of
the processes at that instance. If
you need to monitor the processes for a certain period of time and need to refresh
the results in each interval,
consider using the [top] command.
$> ps �ef
If you wish to see the % of memory usage and CPU usage, then consider the below
switches
$> ps aux
If you wish to use this command inside some shell script, or if you want to
customize the output of [ps] command,
you may use �-o� switch like below. By using �-o� switch, you can specify the
columns that you want [ps] to print
out.
$>ps -e -o stime,user,pid,args,%mem,%cpu
You can list down all the running processes using [ps] command. Then you can �grep�
your user name or process
name to see if the process is running. See below:
In Linux based systems, you can easily access the CPU and memory details from
the /proc/cpuinfo and
/proc/meminfo, like this:
$>cat /proc/meminfo
$>cat /proc/cpuinfo
Just try the above commands in your system to see how it works
.
.
inShare0
Remember Codd's Rule? Or Acid Property of database? May be you still hold these
basic properties to your heart or
may be you no longer remember them. Let's revisit these ideas once again..
A database is a collection of data for one or more multiple uses. Databases are
usually integrated and offers both data
storing and retrieval.
Codd's Rule
Codd's 12 rules are a set of thirteen rules (numbered zero to twelve) proposed by
Edgar F. Codd, a pioneer of the
relational model for databases.
All information in the database is to be represented in one and only one way,
namely by values in column positions
within rows of tables.
The DBMS must allow each field to remain null (or empty). Specifically, it must
support a representation of "missing
information and inapplicable information" that is systematic, distinct from all
regular values (for example, "distinct
from zero or any other number", in the case of numeric values), and independent of
data type. It is also implied that
such representations must be manipulated by the DBMS in a systematic way.
The system must support an online, inline, relational catalog that is accessible to
authorized users by means of their
regular query language. That is, users must be able to access the database's
structure (catalog) using the same query
language that they use to access the database's data.
All views that are theoretically updatable must be updatable by the system.
The system must support set-at-a-time insert, update, and delete operators. This
means that data can be retrieved
from a relational database in sets constructed of data from multiple rows and/or
multiple tables. This rule states that
insert, update, and delete operations should be supported for any retrievable set
rather than just for a single row in a
single table.
Changes to the physical level (how the data is stored, whether in arrays or linked
lists etc.) must not require a change
to an application based on the structure.
Changes to the logical level (tables, columns, rows, and so on) must not require a
change to an application based on
the structure. Logical data independence is more difficult to achieve than physical
data independence.
Isolation: Isolation refers to the requirement that other operations cannot access
or see data that has been modified
during a transaction that has not yet completed. Each transaction must remain
unaware of other concurrently
executing transactions, except that one transaction may be forced to wait for the
completion of another transaction
that has modified data that the waiting transaction requires.
Durability: Durability is the DBMS's guarantee that once the user has been notified
of a transaction's success, the
transaction will not be lost. The transaction's data changes will survive system
failure, and that all integrity
constraints have been satisfied, so the DBMS won't need to reverse the transaction.
Many DBMSs implement
durability by writing transactions into a transaction log that can be reprocessed
to recreate the system state right
before any later failure.
Why people Hate Project Managers � A must read for would-be managers
"Project Managers" are inevitable. Love them or hate them, but if you are in a
project, you have to accept them. They
are Omnipresent in any project. They intervene too much on technical things without
much knowledge. They create
unrealistic targets and nonsensical methods of achieving them. And they invariably
fail to acknowledge the
individual hard work. Are they of any use?
Remember, all project managers are not hated! So, following reasons off course
don�t apply to them.
Generally project managers are not answerable to their subordinates. They are self-
paced and semi autocratic. These
allowances provide them the opportunity to spend time lazily. Many project managers
spend more time surfing
internet than evaluating the performances of his/her subordinates.
The cure for their laziness is pro-activeness which can help them spend quality
time in office.
I know of a project manager �Harry� (name changed), who used to receive work from
client and assign the work to
his subordinate �John� and once John finished the work and sent Harry an email,
Harry used to copy the contents of
John�s mail and reply back to the client. Since Harry never �forwarded� John�s mail
directly to client � so client was
always oblivion to the actual person (John) doing their work. Client always used to
send appreciation mail to Harry
only and John was never accredited for the work he did.
The advice for the would-be project managers here is to remain conscious about the
individual contributions and
give them their due credit whenever possible.
Proper planning makes thing easy. What do you think is the main difference between
a NASA space project and a
service industry IT project? The project members in that NASA project are the same
kind of engineers that you have
in your project. May be many of them passed from the same graduate school. The same
set of people who made one
project a marvellous success, fail miserably in some other project. There is
nothing wrong with those people. But
there is something wrong with the leader leading that set of people. A NASA project
succeeds because of a
meticulous and realistic planning whereas the other project slogs.
Don�t let new tools and technologies outsmart you. Technology space is ever
changing. Try to keep pace with that.
Install the software and tools that are being used in your project in your laptop.
Play with them. Know what their
features are and what their limitations are. Read blogs on them. Start your own
blog and write something interesting
in that in a regular basis. Be a savvy. Otherwise you will be fooled by your own
people.
Testing in data warehouse projects are till date a less explored area. However, if
not done properly, this can be a
major reason for data warehousing project failures - especially in user acceptance
phase. Given here a mind-map that
will help a project manager to think all the aspects of testing in data
warehousing.
Testing Mindmap
DWBI Testing
1. Why is it important?
. To bug-free the code
. To ensure data quality
. To increase credibility of BI Reports
. More BI projects fail after commissioning due to quality issue
2. What constitutes DWBI Testing?
. Performance Testing
. Functional Testing
. Canned Report Testing
. Ad-hoc testing
. Load Reconciliation
4. Why is it difficult?
. Limited Testing Tool
. Automated Testing not always possible
. Data traceability not always available
. Requires extensive functional knowledge
. Metadata management tool often fails
. Deals with bulk data - has performance impact
. Number of data conditions are huge
Use the above mind-map to plan and prepare the testing activity for your data
warehousing project.
An enterprise data warehouse often fetches records from several disparate systems
and store them centrally in an
enterprise-wide warehouse. But what is the guarantee that the quality of data will
not degrade in the process of
centralization?
Data Reconciliation
Many of the data warehouses are built on n-tier architecture with multiple data
extraction and data insertion jobs
between two consecutive tiers. As it happens, the nature of the data changes as it
passes from one tier to the next tier.
Data reconciliation is the method of reconciling or tie-up the data between any two
consecutive tiers (layers).
Why Reconciliation is required?
In the process of extracting data from one source and then transforming the data
and loading it to the next layer, the
whole nature of the data can change considerably. It might also happen that some
information is lost while
transforming the data. A reconciliation process helps to identify such loss of
information.
One of the major reasons of information loss is loading failures or errors during
loading. Such errors can occur due to
several reasons e.g.
Failure due to any such issue can result into potential information loss leading to
unreliable data quality for business
process decision making.
Further more, if such issues are not rectified at the earliest, this becomes even
more costly to �patch� later. Therefore
this is highly suggested that a proper data reconciliation process must be in place
in any data Extraction-
Transformation-Load (ETL) process.
Data reconciliation is often confused with the process of data quality testing.
Even worse, sometimes data
reconciliation process is used to investigate and pin point the data issues.
While data reconciliation may be a part of data quality assurance, these two things
are not necessarily same.
A successful reconciliation process should only indicate whether or not the data is
correct. It will not indicate why the
data is not correct. Reconciliation process answers �what� part of the question,
not �why� part of the question.
Methods of Data Reconciliation
Master data reconciliation is the method of reconciling only the master data
between source and target. Master data
are generally unchanging or slowly changing in nature and no aggregation operation
is done on the dataset. That is -
the granularity of the data remains same in both source and target. That is why
master data reconciliation is often
relatively easy and quicker to implement.
In one business process, �customer�, �products�, �employee� etc. are some good
example of master data. Ensuring
the total number of customer in the source systems match exactly with the total
number of customers in the target
system is an example of customer master data reconciliation.
Some of the common examples of master data reconciliation can be the following
measures,
Sales quantity, revenue, tax amount, service usage etc. are examples of
transactional data. Transactional data make
the very base of BI reports so any mismatch in transactional data can cause direct
impact on the reliability of the
report and the whole BI system in general. That is why reconciliation mechanism
must be in-place in order to detect
such a discrepancy before hand (meaning, before the data reach to the final
business users)
This paper outlines some of the most important (and equally neglected) things that
one must consider before and
during the design phase of a data warehouse. In our experience, we have seen data
warehouse designers often miss
out on these items merely because they thought them to be too trivial to attract
their attentions. Guess what, at the
end of the day such neglects cost them heavily as they cut short the overall ROI of
the data warehouse.
Here we outline some data warehouse gotchas that you should be aware of.
In a top-down design approach people often start to visualize the end data and
realize the complexity
associated with data analytics first. As they tend to see more details of it, they
tend to devote more time for
designing of the analytical or reporting solutions and less time for the designing
of the background ETL staffs
that deal with data extraction / cleaning / transformation etc. They often live
under the assumption that it
would be comparatively easy to map the source data from the existing systems since
users already have
better understanding on the source systems. Moreover, the need and complexity of
cleansing / profiling of
the source data would be less since the data is already coming from standard source
systems.
From budgeting and costing standpoints also, an architect prefers to choose the
case of data reporting and
analytics over background ETL as the former can be more easily presented to the
senior management over
the later in order to get them sanction the budget. This leads to disproportionate
budget between background
ETL and frontend Reporting tasks.
Users often do not know what they want from the data until they start to see the
data. As and when
development progress and more and more data visualization becomes possible, users
start wishing even
more out of their data. This phenomenon is unavoidable and designers must allocate
extra time to
accomodate such ad-hoc requirements.
Many requirements that were implicit in the beginning becomes explicit and
indispensable in the later phase
of the project. Since you can not avoid it, make sure that you already have
adequate time allocated in your
project plan before hand.
3. Issues will be discovered in the source system that went undetected till date
The power of an integrated data warehouse becomes apparent when you start
discovering discrepancies and
issues in the existing stable (so-called) source systems. The real problem,
however, is - designers often make
the wrong assumption that the source systems or upstream systems are fault free.
And that is why they do
not allocate any time or resource in their project plan to deal with those issues.
Data warehouse developers do discover issues in the source systems. And those
issues take lot of time to get
fixed. More than often those issues are not even fixed in the source (to minimize
the impact on business) and
some work around is suggested to deal with those issues in the data warehouse level
directly (although that
is not generally a good idea). Source system issues confuse everybody and require
more administrative time
(that technical time) to resolve as DW developers need to identify and make their
case to prove it to the
source systems that the issue(s) does exist. These are huge time wasters and often
not incorporated in the
project plan.
4. You will need to validate data not being validated in source systems
Source systems do not always give you the correct data. A lot of validations and
checks are not done in the
source system level (e.g. OLTP systems) and each time a validation check is
skipped, it creates danger of
sending unexpected data to the data warehouse level. Therefore before you can
actually process data in data
warehouse, you will require to perform some validation checks at your end to ensure
the expected data
availability.
This is again unavoidable. If you do not make those checks that would cause issues
at your side which
include things like, data loading error, reconciliation failure even data integrity
threats. Hence ensure that
proper time and resource allocation are there to work on these items.
5. User training will not be sufficient and users will not put their training to
use
You would face a natural resistance from the existing business users who would show
huge inertia against
the acceptance to the new system. In order to ease the things, adequate user
training sessions are generally
arranged for the users of the data warehouse. But you will notice that "adequate"
training is not "sufficient"
for them (mainly due to they need to unlearn a lot of things to learn the use of
the new data warehouse).
Even if you arrange adequate training to the users, you would find that the users
are not really putting their
training to use when it comes to doing things in the new data warehouse. That's
often because facts and
figures from the new data warehouse often challenge their existing convictions and
they are reluctant to
accept it whole heartedly.
User training and acceptance is probably the single most important non-technical
challenge that makes or
breaks a data warehouse. No matter what amount of effort you put as a designer to
design the data
warehouse - if the users are not using it - the data warehouse is as good as
failure. As the old saying goes in
Sanskrit � �a tree is known by the name of its fruit�, the success of data
warehouse is measured from the
information it produces. If the information is not relevant to the users and if
they are reluctant to use it - you
lost the purpose. Hence make all the possible efforts to connect to the users and
train them to use the data
warehouse. Mere 'adequate' training is not 'sufficient' here.
That is because the users often belong to different departments of the company and
even though each one of
them knows the business of her department pretty well, she would not know the
business of the other
department that well. And when you take the data from all these departments and try
to combine them
together into an integrated data warehouse, you would often discover that business
rule suggested by one
user is completely opposite to the business rule suggested by the other.
Such cases are generally involved and need collaboration between multiple parties
to come into the
conclusion. It's better to consider such cases way before during the planning phase
to avoid the late surprises.
A very minutely done volumetric estimate in the starting phase of the project would
go weary later. This
happens due to several reasons e.g. slight change in the standard business metrics
may create huge impact on
the volumetric estimates.
For example, suppose a company has 1 million customers who are expected to grow at
a rate of 7% per
annum. While calculating the volume and size of your data warehouse you have used
this measure in several
places. Now if the customer base actually increase by 10% instead of 7%, that would
mean 30000 more
customers. In a fact table of granularity customer, product, day - this would mean
30000 X 10 X 365 more
records (assuming on average one customer use 10 products). If one record takes
1kb, then the fact table
would now require - (30000 X 365 X 10 X 1kb ) /(1024 X 1024) = 100+ GB more disk
space from only one table.
When user look at one value in your report and says, "I think it's not right" - the
onus is on you to prove the
correctness or validity of that data. Nobody is going to help you around to prove
how right your data
warehouse is. For this reason, it is absolutely necessary to build a solid data
reconciliation framework for
your data warehouse. A reconciliation framework that can trigger an early alarm
whenever something does
not match between source and target, so that you get enough time to investigate
(and if required, fix) the
issue.
Such reconciliation framework however indispensable is not easy to create. Not only
they require huge
amount of effort and expertise, they also tend to run in the same production server
almost same time as that
of production load and eat up lot of performance. Moreover, such reconciliation
framework is often not a
client side requirement - making it even difficult for you to allocate time and
budget. But if you do not do it
that would be a much bigger mistake to make.
It's important to set the expectation in the very beginning of the project about
the huge maintenance cost
implications.
10. Amount of time needed to refresh your data warehouse is going to be your top
concern
You need to load data in the data warehouse, generally at least daily (although
sometimes more frequently
than this) and also monthly / quarterly / yearly etc. Loading latest data into data
warehouse ensures that your
reports are all up-to-date. However, the time required to load data (refresh time)
is going to be more than
what you have calculated and that's too going to increase day by day.
One of the major hinderances in the acceptance of a data warehouse by its users is
its performance. I have
seen too many cases where reports generated from data warehouse miss SLA and
severely damage the
dependency and credibility of the data warehouse. In fact, I have seen cases where
daily load runs more than
a day and never completes to generate timely daily report. There have been other
famous cases of SLA breach
as well.
I can not tress this enough but performance considerations are hugely important for
the success of a data
warehouse and it's more important than what you thought. Do everything necessary to
make your data
warehouse perform well - reduce overhead, maintain servers, cut-off complexities,
do regular system
performance tests (SPT) and weigh the performance against industry benchmarks, make
SPT a part of user
acceptance test (UAT) etc.
Well, you may be questioning how much large is large. The answer: it depends. You
must ask yourself if the large
size of the model is really justified. The more complex your model is, the more
prone it is to contain design errors.
For an example, you may want to try to limit your models to not more than 200
tables. To be able to do that, in the
early phase of data modelling ask yourself these questions �
If you consciously try to keep things simple, most likely you will also be able to
avoid the menace of over modelling.
Over modelling leads to over engineering which leads to over work without any
defined purpose. A person who
does modelling just for the sake of modelling often ends up doing over modelling.
All the above are sure signs of over modelling that only increases your burden (of
coding, of loading, of maintaining,
of securing, of using).
Purpose of the model determines the level of details that you want to keep in the
model. If you are unsure about the
purpose, you will definitely end up designing a model that is too detail or too
brief for the purpose.
Violation of Normalization
Clarity is also very important. For example - do you clearly know the data types
that you should be using for all the
business attributes? Or do you end up using some speculative data types (and
lengths)?
Modern data modelling tools come with different concepts of declaring data (e.g.
domain and enumeration concept
in ERWin) that helps to bring clarity to the model. So, before you start building �
pause for a moment and ask
yourself if you really understand the purpose of the model.
When the tables in the model satisfy higher levels of normal forms, they are less
likely to store redundant or
contradictory data. But there is no hard and fast rule about maintaining those
normal forms. A modeller is allowed to
violate these rules for good purpose (such as to increase performance) and such a
relaxation is called
denormalization.
But the problem occurs � when a modeller violates the normal form deliberately
without a clearly defined purpose.
Such reckless violation breaks apart the whole design principle behind the data
model and often renders the model
unusable. So if you are unsure of something � just stick to the rules. Don�t get
driven by vague purposes.
The above figure shows a general hierarchical relationship between customer and its
related categories. Let�s say a
customer can fall under following categories � Consumer, Business, Corporate and
Wholesaler. Given this condition,
�ConsumerFlag� is a redundant column on Customer table.
Theoretically speaking there is no issue with such a model, at least until one
tries to create the ETL programming
(extraction-transformation-loading) code behind these tables.
Even query optimization becomes difficult when one disassociates the surrogate key
with the natural key. The reason
being � since surrogate key takes the place of primary key, unique index is applied
on that column. And any query
based on natural key identifier leads to full table scan as that query cannot take
the advantage of unique index on the
surrogate key.
If the answer of the above questions are �YES� � don�t use the surrogate key.
This paper introduces the subject of data mining in simple lucid language and moves
on to build more complex
concepts. Start here if you are a beginner.
Not because I hate the subject of data mining itself, but because this term is so
much over-used and misused
and exploited and commercialized and often conveyed in inaccurate manner, in
inappropriate places and often
with intentional vagueness.
So when I decided to write about what is data mining, I was convinced that I need
to write about what is NOT
data mining first, in order to build a formal definition of data mining.
http://png.dwbiconcepts.com/images/wikipedia-icon.png
What is Data Mining? (And what it is not)
Now the question is: what does the above definition really mean and how does it
differ from finding
information from databases? We often store information in databases (as in data
warehouses) and retrieve the
information from the database when we need it. Is that data mining? Answer is �no�.
We will soon see why is it
so.
Let�s start with the big picture first. This all starts with something called
"Knowledge Discovery in Database". Data
mining is basically one of the steps in the process of knowledge discovery in
database (KDD). Knowledge
discovery process is basically divided in 5 steps:
1. Selection
2. Pre-processing
3. Transformation
4. Data Mining
5. Evaluation
Notice here the term � �Knowledge� as in Knowledge Discovery in Database (KDD). Why
did you say
�Knowledge�? Why not �information� or �data�?
This is because there are differences among the terms �data�, �information� and
�knowledge�. Let�s
understand this difference through one example.
You run a local departmental store and you log all the details of your customers in
the store
database. You know the names of your customers and what items they buy each day.
For example, Alex, Jessica and Paul visit your store every Sunday and buys candle.
You store
this information in your store database. This is data. Any time you want to know
who are
the visitors that buy candle, you can query your database and get the answer. This
is
information. You want to know how many candles are sold on each day of week from
your
store, you can again query your database and you�d get the answer � that�s also
information.
But suppose there are 1000 other customers who also buy candle from you on every
Sunday
(mostly � with some percentage of variations) and all of them are Christian by
religion. So,
you can conclude that Alex, Jessica and Paul must be also Christian.
Now the religion of Alex, Jessica and Paul were not given to you as data. This
could not be retrieved from the
database as information. But you learnt this piece of information indirectly. This
is the �knowledge� that you
discovered. And this discovery was done through a process called �Data Mining�.
Now there are chances that you are wrong about Alex, Jessica and Paul. But there
are fare amount of chances
that you are actually right. That is why it is very important to �evaluate� the
result of KDD process.
I gave you this example because I wanted to make a clear distinction between
knowledge and information in
the context of data mining. This is important to understand our first question �
why retrieving information
from deep down of your database is not same as data mining. No matter how complex
the information retrieval
process is, no matter how deep the information is located at, it�s still not data
mining.
As long as you are not dealing with predictive analysis or not discovering �new�
pattern from the existing data
� you are not doing data mining.
When it comes to applying data mining, your imagination is the only barrier (not
really . there are
technological hindrances as well as we will see later). But it�s true that data
mining is applied in almost any
fields starting from genetics to human rights violation. One of the most important
applications is in �Machine
Learning�. Machine learning is a branch of artificial intelligence concerned with
the design and development of
algorithms that allow computers to evolve behaviors based on empirical data.
Machine learning makes it
possible for computers to take autonomous decisions based on the data available
from past experiences. Many
of the standard problems of today�s world are being solved by the application of
machine learning as solving
them otherwise (e.g. through the deterministic algorithmic approach) would be
impossible given the breadth
and depth of the problem.
Let me start with one example of the application of data mining that enables
machine-learning algorithm to
drive an autonomous vehicle. This vehicle does not have any driver and it moves
around the road all by itself.
The way it maneuvers and overcomes the obstacles is by applying the images that it
sees (through a VGA
camera) and then using data mining to determine the course of action based on the
data of its past experiences.
. Voice recognition
Think of Siri in iPhone. How does it understand your commands? Clearly it�s not
deterministically
programmable as every body has different tone and accent and voice. And not only it
understands, it
also adapts better with your voice as you keep using it more and more.
DNA sequence contains biological information. One of the many approaches of DNA
sequencing is
through sequence mining where data mining techniques are applied to find
statistically relevant
patters, which are then compared with previously studied sequences to understand
the given sequence.
Now consider this. What if �Linda� was an automated machine? You could probably
have the same
kind of conversations still, but it would probably had much more unnatural.
I know the above example is a bit of overshoot, but you got the idea. Machines do
not understand
natural language. And it�s a challenge to make them understand the same. And until
we do that we
wont be able to build a really useful human-computer interface.
Now if the above examples interest you then let�s continue learning more about data
mining. One of the first
tasks that we have to do next is to understand the different approaches that are
used in the field of data mining.
Below list shows most of the important methods:
Anomaly Detection
This is the method of detecting patterns in a given data set that does not conform
to an established normal
behavior. This is applied in number of different fields such as � network intrusion
detection, share market fraud
detection etc.
Clustering
Classification
This method is used for the task of generalizing known structure to apply to new
data. For example, an email
program might attempt to classify an email as legitimate or spam.
Regression
Attempts to find a function, which models the data with the least error. The above
example of autonomous
driving uses this method.
Next we would learn about each of these methods in greater detail with examples of
their
SQL Questions
Check out the article Q100139 from Microsoft knowledge base and of course, there's
much more
information available in the net. It will be a good idea to get a hold of any RDBMS
fundamentals text
book, especially the one by C. J. Date. Most of the times, it will be okay if you
can explain till third normal
form.
Both primary key and unique enforce uniqueness of the column on which they are
defined. But by
default primary key creates a clustered index on the column, where are unique
creates a non-clustered
index by default. Another major difference is that, primary key does not allow
NULLs, but unique key
allows one NULL only.
What are user defined data types and when you should go for them?
User defined data types let you extend the base SQL Server data types by providing
a descriptive name,
and format to the database. Take for example, in your database, there is a column
called Flight_Num which appears in many tables. In all these tables it should
bevarchar(8). In this case
you could create a user defined data type called Flight_num_type of varchar(8) and
use it across all your
tables.
What is bit data type and what's the information that can be stored inside a bit
column?
Bit data type is used to store Boolean information like 1 or 0 (true or false).
Until SQL Server 6.5 bit data
type could hold either a 1 or 0 and there was no support for NULL. But from SQL
Server 7.0 onwards, bit
data type can represent a third state, which is NULL.
A candidate key is one that can identify each row of a table uniquely. Generally a
candidate key becomes
the primary key of the table. If the table has more than one candidate key, one of
them will become the
primary key, and the rest are called alternate keys.
A key formed by combining at least two or more columns is called composite key.
A transaction is a logical unit of work in which, all the steps must be performed
or none. ACID stands for
Atomicity, Consistency, Isolation, Durability. These are the properties of a
transaction. For more
information and explanation of these properties, see SQL Server books online or
any RDBMS fundamentals text book.
What type of Index will get created after executing the above statement?
8060 bytes. Do not be surprised with questions like 'What is the maximum number of
columns per table'.
Check out SQL Server books online for the page titled: "Maximum Capacity
Specifications".
Lock escalation is the process of converting a lot of low level locks (like row
locks, page locks) into higher
level locks (like table locks). Every lock is a memory structure too many locks
would mean, more memory
being occupied by locks. To prevent this from happening, SQL Server escalates the
many fine-grain locks
to fewer coarse-grain locks. Lock escalation threshold was definable in SQL Server
6.5, but from SQL
Server 7.0 onwards it's dynamically managed by SQL Server.
What's the difference between DELETE TABLE and TRUNCATE TABLE commands?
DELETE TABLE is a logged operation, so the deletion of each row gets logged in the
transaction log,
which makes it slow. TRUNCATE TABLE also deletes all the rows in a table, but it
will not log the
deletion of each row, instead it logs the de-allocation of the data pages of the
table, which makes it faster.
Of course, TRUNCATE TABLE can be rolled back.
Check out MOLAP, ROLAP and HOLAP in SQL Server books online for more information.
What are the new features introduced in SQL Server 2000 (or the latest release of
SQL Server at the time
of your interview)? What changed between the previous version of SQL Server and the
current version?
This question is generally asked to see how current is your knowledge. Generally
there is a section in the
beginning of the books online titled "What's New", which has all such information.
Of course, reading just
that is not enough, you should have tried those things to better answer the
questions. Also check out the
section titled "Backward Compatibility" in books online which talks about the
changes that have taken
place in the new version.
Types of constraints: NOT NULL, CHECK, UNIQUE, PRIMARY KEY, FOREIGN KEY
For an explanation of these constraints see books online for the pages titled:
"Constraints" and "CREATE
TABLE", "ALTER TABLE"
What is an index? What are the types of indexes? How many clustered indexes can be
created on a
table? I create a separate index on each column of a table. what are the advantages
and disadvantages
of this approach?
Indexes in SQL Server are similar to the indexes in books. They help SQL Server
retrieve the data quicker.
Indexes are of two types. Clustered indexes and non-clustered indexes. When you
create a clustered
index on a table, all the rows in the table are stored in the order of the
clustered index key. So, there can
be only one clustered index per table. Non-clustered indexes have their own storage
separate from the
table data storage. Non-clustered indexes are stored as B-tree structures (so do
clustered indexes), with
the leaf level nodes having the index key and it's row locater. The row located
could be the RID or the
Clustered index key, depending up on the absence or presence of clustered index on
the table.
RAID stands for Redundant Array of Inexpensive Disks, used to provide fault
tolerance to database
servers. There are six RAIDlevels 0 through 5 offering different levels of
performance, fault tolerance.
MSDN has some information about RAID levels and for detailed information, check out
the RAID
advisory board's homepage
What are the steps you will take to improve performance of a poor performing query?
This is a very open ended question and there could be a lot of reasons behind the
poor performance of a
query. But some general issues that you could talk about would be: No indexes,
table scans, missing or
out of date statistics, blocking, excess recompilations of stored procedures,
procedures and triggers
without SET NOCOUNT ON, poorly written query with unnecessarily complicated joins,
too much
normalization, excess usage of cursors and temporary tables.
Some of the tools/ways that help you troubleshooting performance problems are:
Download the white paper on performance tuning SQL Server from Microsoft web site.
What are the steps you will take, if you are tasked with securing an SQL Server?
Again this is another open ended question. Here are some things you could talk
about: Preferring NT
authentication, using server, database and application roles to control access to
the data, securing the
physical database files using NTFS permissions, using an unguessable SA password,
restricting physical
access to the SQL Server, renaming the Administrator account on the SQL Server
computer, disabling the
Guest account, enabling auditing, using multi-protocol encryption, setting up SSL,
setting up firewalls,
isolating SQL Server from the web server etc.
Read the white paper on SQL Server security from Microsoft website. Also check out
My SQL Server
security best practices
What is a deadlock and what is a live lock? How will you go about resolving
deadlocks?
Deadlock is a situation when two processes, each having a lock on one piece of
data, attempt to acquire a
lock on the other's piece. Each process would wait indefinitely for the other to
release the lock, unless one
of the user processes is terminated. SQL Server detects deadlocks and terminates
one user's process.
Check out SET DEADLOCK_PRIORITY and "Minimizing Deadlocks" in SQL Server books
online. Also
check out the article Q169960 from Microsoft knowledge base.
Blocking happens when one connection from an application holds a lock and a second
connection
requires a conflicting lock type. This forces the second connection to wait,
blocked on the first.
Read up the following topics in SQL Server books online: Understanding and avoiding
blocking, Coding
efficient transactions.
Many of us are used to creating databases from the Enterprise Manager or by just
issuing the command:
But what if you have to create a database with two file groups, one on drive C and
the other on drive D
with log on drive E with an initial size of 600 MB and with a growth factor of 15%?
That's why being a
DBA you should be familiar with the CREATE DATABASE syntax. Check out SQL Server
books online
for more information.
How to restart SQL Server in single user mode? How to start SQL Server in minimal
configuration
mode?
SQL Server can be started from command line, using the SQLSERVR.EXE. This EXE has
some very
important parameters with which a DBA should be familiar with. -m is used for
starting SQL Server in
single user mode and -f is used to start the SQL Server in minimal configuration
mode. Check out SQL
Server books online for more parameters and their explanations.
As a part of your job, what are the DBCC commands that you commonly use for
database
maintenance?
DBCC CHECKDB,
DBCC CHECKTABLE,
DBCC CHECKCATALOG,
DBCC CHECKALLOC,
DBCC SHOWCONTIG,
DBCC SHRINKDATABASE,
DBCC SHRINKFILE etc.
But there are a whole load of DBCC commands which are very useful for DBAs. Check
out SQL Server
books online for more information.
What are statistics, under what circumstances they go out of date, how do you
update them?
UPDATE STATISTICS,
STATS_DATE,
DBCC SHOW_STATISTICS,
CREATE STATISTICS,
DROP STATISTICS,
sp_autostats,
sp_createstats,
sp_updatestats
What are the different ways of moving data/databases between servers and databases
in SQL Server?
There are lots of options available, you have to choose your option depending upon
your requirements.
Some of the options you have are:
BACKUP/RESTORE,
Detaching and attaching databases,
Replication,
DTS,
BCP,
logshipping,
INSERT...SELECT,
SELECT...INTO,
creating INSERT scripts to generate data.
Types of backups you can create in SQL Sever 7.0+ are Full database backup,
differential database
backup, transaction log backup, filegroup backup. Check out the BACKUP and RESTORE
commands in
SQL Server books online. Be prepared to write the commands in your interview. Books
online also has
information on detailed backup/restore architecture and when one should go for a
particular kind of
backup.
What is database replication? What are the different types of replication you can
set up in SQL Server?
* Snapshot replication
* Transactional replication (with immediate updating subscribers, with queued
updating subscribers)
* Merge replication
See SQL Server books online for in-depth coverage on replication. Be prepared to
explain how different
replication agents function, what are the main system tables used in replication
etc.
How to determine the service pack currently installed on SQL Server?
The global variable @@Version stores the build number of the sqlservr.exe, which is
used to determine the
service pack installed. To know more about this process visit SQL Server service
packs and versions.
What are cursors? Explain different types of cursors. What are the disadvantages of
cursors? How can
you avoid cursors?
Types of cursors:
Static,
Dynamic,
Forward-only,
Keyset-driven.
Disadvantages of cursors: Each time you fetch a row from the cursor, it results in
a network roundtrip,
where as a normal SELECT query makes only one round trip, however large the
resultset is. Cursors are
also costly because they require more resources and temporary storage (results in
more IO operations).
Further, there are restrictions on the SELECT statements that can be used with some
types of cursors.
Most of the times, set based operations can be used instead of cursors. Here is an
example:
If you have to give a flat hike to your employees using the following criteria:
Another situation in which developers tend to use cursors: You need to call a
stored procedure when a
column in a particular row meets certain condition. You don't have to use cursors
for this. This can be
achieved using WHILE loop, as long as there is a unique key to identify each row.
Write down the general syntax for a SELECT statements covering all the options.
Here's the basic syntax: (Also checkout SELECT in books online for advanced
syntax).
SELECT select_list
[INTO new_table_]
FROM table_source
[WHERE search_condition]
[GROUP BY group_by__expression]
[HAVING search_condition]
[ORDER BY order__expression [ASC | DESC] ]
Joins are used in queries to explain how different tables are related. Joins also
let you select data from a
table depending upon data from another table.
Types of joins:
INNER JOINs,
OUTER JOINs,
CROSS JOINs
For more information see pages from books online titled: "Join Fundamentals" and
"Using Joins".
Yes, very much. Check out BEGIN TRAN, COMMIT, ROLLBACK, SAVE TRAN and @@TRANCOUNT
What is an extended stored procedure? Can you instantiate a COM object by using T-
SQL?
Yes, you can instantiate a COM (written in languages like VB, VC++) object from T-
SQL by
using sp_OACreate stored procedure.
What is the system function to get the current user's user id?
USER_ID(). Also check out other system functions like
USER_NAME(),
SYSTEM_USER,
SESSION_USER,
CURRENT_USER,
USER,
SUSER_SID(),
HOST_NAME().
What are triggers? How many triggers you can have on a table? How to invoke a
trigger on demand?
Triggers are special kind of stored procedures that get executed automatically when
an INSERT,
UPDATE or DELETE operation takes place on a table.
In SQL Server 6.5 you could define only 3 triggers per table, one for INSERT, one
for UPDATE and one
for DELETE. From SQL Server 7.0 onwards, this restriction is gone, and you could
create multiple
triggers per each action. But in 7.0 there's no way to control the order in which
the triggers fire. In SQL
Server 2000 you could specify which trigger fires first or fires last using
sp_settriggerorder
Triggers cannot be invoked on demand. They get triggered only when an associated
action (INSERT,
UPDATE, DELETE) happens on the table on which they are defined.
Triggers are generally used to implement business rules, auditing. Triggers can
also be used to extend the
referential integrity checks, but wherever possible, use constraints for this
purpose, instead of triggers, as
constraints are much faster.
Till SQL Server 7.0, triggers fire only after the data modification operation
happens. So in a way, they are
called post triggers. But in SQL Server 2000 you could create pre triggers also.
Search SQL Server 2000
books online for INSTEAD OF triggers.
Also check out books online for 'inserted table', 'deleted table' and
COLUMNS_UPDATED()
There is a trigger defined for INSERT operations on a table, in an OLTP system. The
trigger is written to
instantiate a COM object and pass the newly inserted rows to it for some custom
processing.
Instantiating COM objects is a time consuming process and since you are doing it
from within a trigger, it
slows down the data insertion process. Same is the case with sending emails from
triggers. This scenario
can be better implemented by logging all the necessary data into a separate table,
and have a job which
periodically checks this table and does the needful.
Here is an advanced query using a LEFT OUTER JOIN that even returns the employees
without
managers (super bosses)
1. You need to see the last fifteen lines of the files dog, cat and horse. What
command should you use?
tail -15 dog cat horse
The tail utility displays the end of a file. The -15 tells tail to display the last
fifteen lines of each specified file.
2. Who owns the data dictionary?
The SYS user owns the data dictionary. The SYS and SYSTEM users are created when
the database is
created.
3. You routinely compress old log files. You now need to examine a log from two
months ago. In order to
view its contents without first having to decompress it, use the _________ utility.
zcat
The zcat utility allows you to examine the contents of a compressed file much the
same way that cat
displays a file.
4. You suspect that you have two commands with the same name as the command is not
producing the
expected results. What command can you use to determine the location of the command
being run?
which
The which command searches your path until it finds a command that matches the
command you are
looking for and displays its full path.
5. You locate a command in the /bin directory but do not know what it does. What
command can you use to
determine its purpose.
whatis
The whatis command displays a summary line from the man page for the specified
command.
6. You wish to create a link to the /data directory in bob�s home directory so you
issue the command ln /data
/home/bob/datalink but the command fails. What option should you use in this
command line to be
successful.
Use the -F option
In order to create a link to a directory you must use the -F option.
7. When you issue the command ls -l, the first character of the resulting display
represents the file�s
___________.
type
The first character of the permission block designates the type of file that is
being displayed.
8. What utility can you use to show a dynamic listing of running processes?
__________
top
The top utility shows a listing of all running processes that is dynamically
updated.
9. Where is standard output usually directed?
to the screen or display
By default, your shell directs standard output to your screen or display.
10. You wish to restore the file memo.ben which was backed up in the tarfile
MyBackup.tar. What command
should you type?
tar xf MyBackup.tar memo.ben
This command uses the x switch to extract a file. Here the file memo.ben will be
restored from the tarfile
MyBackup.tar.
11. You need to view the contents of the tarfile called MyBackup.tar. What command
would you use?
tar tf MyBackup.tar
The t switch tells tar to display the contents and the f modifier specifies which
file to examine.
12. You want to create a compressed backup of the users� home directories. What
utility should you use?
tar
You can use the z modifier with tar to compress your archive at the same time as
creating it.
13. What daemon is responsible for tracking events on your system?
syslogd
The syslogd daemon is responsible for tracking system information and saving it to
specified log files.
14. You have a file called phonenos that is almost 4,000 lines long. What text
filter can you use to split it into
four pieces each 1,000 lines long?
split
The split text filter will divide files into equally sized pieces. The default
length of each piece is 1,000 lines.
15. You would like to temporarily change your command line editor to be vi. What
command should you
type to change it?
set -o vi
The set command is used to assign environment variables. In this case, you are
instructing your shell to
assign vi as your command line editor. However, once you log off and log back in
you will return to the
previously defined command line editor.
16. What account is created when you install Linux?
root
Whenever you install Linux, only one user account is created. This is the superuser
account also known as
root.
17. What command should you use to check the number of files and disk space used
and each user�s defined
quotas?
repquota
The repquota command is used to get a report on the status of the quotas you have
set including the amount
of allocated space and amount of used space.
why you need indexing ? where that is stroed and what you mean by schema object?
For what purpose
we are using view?
indexing is used for faster search or to retrieve data faster from various table.
Schema containing set of
tables, basically schema means logical separation of the database. View is crated
for faster retrieval of
data. It's customized virtual table. we can create a single view of multiple
tables. Only the drawback
is..view needs to be get refreshed for retrieving updated data.
Triggers are fired implicitly on the tables/views on which they are created. There
are various advantages
of using a trigger. Some of them are:
. Suppose we need to validate a DML statement(insert/Update/Delete) that modifies a
table then
we can write a trigger on the table that gets fired implicitly whenever DML
statement is executed
on that table.
. Another reason of using triggers can be for automatic updation of one or more
tables whenever a
DML/DDL statement is executed for the table on which the trigger is created.
. Triggers can be used to enforce constraints. For eg : Any insert/update/ Delete
statements should
not be allowed on a particular table after office hours. For enforcing this
constraint Triggers
should be used.
. Triggers can be used to publish information about database events to subscribers.
Database event
can be a system event like Database startup or shutdown or it can be a user even
like User loggin
in or user logoff.
Union will remove the duplicate rows from the result set while Union all does'nt.
Both will result in deleting all the rows in the table .TRUNCATE call cannot be
rolled back as it is a DDL
command and all memory space for that table is released back to the server.
TRUNCATE is much
faster.Whereas DELETE call is an DML command and can be rolled back.
Which system table contains information on constraints on all the tables created ?
yes,
USER_CONSTRAINTS,
system table contains information on constraints on all the tables created
Explain normalization ?
Normalisation means refining the redundancy and maintain stablisation. there are
four types of
normalisation :
first normal forms, second normal forms, third normal forms and fourth Normal
forms.
How to find out the database name from SQL*PLUS command prompt?
Select * from global_name;
This will give the datbase name which u r currently connected to.....
What is diffrence between Co-related sub query and nested sub query?
Correlated subquery runs once for each row selected by the outer query. It contains
a reference to a value
from the row selected by the outer query.
Nested subquery runs only once for the entire nesting (outer) query. It does not
contain any reference to
the outer query row.
For example,
Correlated Subquery:
Nested Subquery:
select empname, basicsal, deptno from emp where (deptno, basicsal) in (select
deptno, max(basicsal) from
emp group by deptno)
1. % and
2. _ ( underscore )
% means matches zero or more characters and under score means mathing exactly one
character
What is database?
A database is a collection of data that is organized so that itscontents can easily
be accessed, managed and
updated. open this url : http://www.webopedia.com/TERM/d/database.html
What are the advantages and disadvantages of primary key and foreign key in SQL?
Primary key
Advantages
1) It is a unique key on which all the other candidate keys are functionally
dependent
Disadvantage
1) There can be more than one keys on which all the other attributes are dependent
on.
Foreign Key
Advantage
1)It allows refrencing another table using the primary key for the other table
Which date function is used to find the difference between two dates?
datediff
output is 5
This article is a step-by-step instruction for those who want to install Oracle 10g
database on their
computer. This document provides guidelines to install Oracle 10g database on
Microsoft Windows
environment. If you use other operating system other than Microsoft Windows, the
process is not too
much different from that of Microsoft Windows, since Oracle uses Oracle Universal
Installer to install its
software.
For more information about installing Oracle 10g under operating systems other than
Microsoft
Windows, please refer to this URL :
http://www.oracle.com/pls/db102/homepage
You can download Oracle 10g database from www.oracle.com. You must registered and
create an
account before you can download the software. The example in this document uses
Oracle Database 10g
Release 2 (10.2.0.1.0) for Microsoft Windows.
1. Uninstall all Oracle components using the Oracle Universal Installer (OUI).
oracle10g installation
2. Run regedit.exe and delete the HKEY_LOCAL_MACHINE/ SOFTWARE/ORACLE key. This
contains registry entire for all Oracle products.
3. Delete any references to Oracle services left behind in the following part of
the registry: HKEY
LOCAL MACHINE/ SYSTEM/ CurrentControlsSet/ Services/Ora*. It should be pretty
obvious
which ones relate to Oracle
4. Reboot your machine.
5. Delete the C: \Oracle directory, or whatever directory is your Oracle_Base.
6. Delete the C:\Program Files \Oracle directory.
7. Empty the contents of your c:\temp directory.
8. Empty your recycle bin.
1. Insert Oracle CD , the autorun window opens automatically. If you are installing
from network or
hard disk, click setup.exe in the installation folder.
2. The Oracle Universal Installer (OUI) will run and display the Select
Installation
MethodWindow.
c:\oracle\product\10.2.0\db_1
requirements and the new products to be installed. Click Install to start the
installation..
6. The Install window appears showing installation progress.
7. At the end of the installation phase, the Configuration Assistants window
appears. This window
lists the configuration assistants that are started automatically.
If you are creating a database, then the Database Configuration Assistant starts
automatically in
oracle 10g installation
oracle 10g installation
a separate window.
At the end of database creation, you are prompted to unlock user accounts to make
the accounts
accessible. The SYS and SYSTEM accounts are already unlocked. Click OK to bypass
password
management.
oracle 10g installation
oracle 10g installation
Note: Oracle 10g still keeps scott / tiger username and password (UID=scott,
PWD=tiger) from
the old version of oracle. In the old version of oracle, scott/tiger user ID is
available by default,
but not in oracle 10g. If you want to use scott /tiger account, you must unlock it
by clicking
�Password Management� at the last window.
Password Management window will appear like the one shown below. Find the user name
�Scott� and uncheck the �Lock Account?� column for the user name.
8. Your installation and database creation is now complete. The End of Installation
window
displays several important URLs, one of which is for Enterprise Manager.
9. You can navigate to this URL in your browser and log in as the SYS user with the
associated
password, and connect as SYSDBA. You use Enterprise Manager to perform common
database
administration tasks
Note : you can access Oracle Enterprise Manager using browser by typing the URL
shown above
in your browser. Instead of typing the IP address, you can also access the
Enterprise Manager by
typing http://localhost:1158/em or �http://[yourComputerName]:1158/em� or by
clicking �Start
>> All Programs >> Oracle � [YourOracleHome_home1] >> Database Control �
[yourOracleID]�
in Windows menu.
By default, use user ID �SYSTEM�, with the password that you have chosen at the
beginning of
installation, to connect to database, SQLPlus, etc. If you want to use other user
ID, you may create
a new user .
Data Modeling
Data Model is a logical map that represents the inherent properties of the data
independent of
software, hardware, or machine performance considerations. The model shows data
elements
grouped into records, as well as the association around those records.
Since the data model is the basis for data implementation regardless of software or
hardware platforms, the data model should present descriptions about a data in an
abstract
manner which does not mention detailed information specific to any hardware or
software
such as bits manipulation or index addition.
There are two generally accepted meanings on the term data model. The first is that
the
data model could be some sort of theory about the formal description of the data's
structure
and use without any mention of heavy technical terms related to information
technology.
The second is that a data model instance is the application of the data model
theory in order
to create to meet requirements of some applications such as those used in a
business
enterprise.
The structural part of a data model theory refers to the collection of data
structures which
make up a data when it is being created. These data structures represent entities
and
objects in the database model. For instance the data model may that be of a
business
enterprise involved in sales of toys.
The real life things of interest would include customers, company staff and of
course the toy
items. Since the database which will keep the records of these things of interest
cannot
understand the real meaning of customers, company staff and toy item, there should
be
created a data representation of this real life things.
The integrity part of a data model refers to the collection of rules which governs
the
constraints on the data structures so that structural integrity could be achieved.
In the
integrity aspect of a data model, the formal definition of an extensive sets of
rules and
consistent application of data is defined so that the data can be used for its
intended
purpose.
Techniques are defined on hot to maintain data in the data resource and to ensure
that the
data consistently contains value which is loyal to its source while at the same
time accurate
in its destination. This is to ensure that data will always have data value
integrity, data
structure integrity, data retention integrity, and data derivation integrity.
The manipulation part of a data model refers to the collection of operators which
be applied
to the data structures. These operations include query and update of data within
the
database. This is important because not all data can be allowed for altering or
deletion. The
data manipulation part works hand in hand with the integrity part so that the data
model
can result in high quality in the database for the data consumers to enjoy.
As an example, let us take the relational model. The data model defined in the
structural
part refers to the modified concept of the mathematical relation. The reasoning
about such
data is represented as n-ary which is a subset of the Cartesian product of n
domains.
The integrity part refers to the expression in the first order logic and the
manipulation part
refers to the relational algebra as well as tuple and domain calculus.
Data Modeling is a method used to define and analyze data requirements needed to
support
the business functions of an enterprise. These data requirements are recorded as a
conceptual data model with associated data definitions. Data modeling defines the
relationships between data elements and structures.
Data modeling can be used for a wide array of purposes. It is an act of exploring
data
oriented structures without considering any specific applications that the data
will be used
in. It is like a conceptual definition an entity and its real life counterparts
which is any thing
that is of interest to the organization implementing a database.
Data models are the products of data modeling. In general, three data model styles
namely
conceptual data model, logical data model and physical data model.
The conceptual data model is often called the domain model. It describes the
semantics of a
business organization as this model consists of entity classes which represent
things of
significance to an organization and the relationships of these entities.
Relationships are
defined as assertions about associations between various pairs of entity classes.
The
conceptual data model is commonly used to explore domain concepts with project
stakeholders. Conceptual models may be created to explore high level static
business
structures and concepts. But they can be used as well as precursor or alternatives
to logical
data models.
The logical data model is used in exploring domain concepts and other key areas
such as
relationships and domain problems. The logical data models could be defined for the
scope
of a single project or for the whole enterprise. The logical data model describes
semantics
related to particular data manipulation methods and such descriptions include those
of
tables, columns, object oriented classes, XML tags and many other things. The
logical data
model depicts some logical entity types, the data attributes to describe those
entities and
relations among the entities.
The physical data model is used in the design of the internal database schema. This
design
defines data tables, data columns for the tables and the relationships among the
tables.
Among other things that the physical data model is concern with include
descriptions of the
physical means by which data should be stored. This storage aspect embraces
concerns on
hard disk partitioning, CPU usage optimization, creation of table spaces and
others.
Data modeling also focuses on the structure of a data within a domain. This
structure is
described in such a manner that specification is in a dedicated grammar for an
artificial
language used for a certain domain. But as always, the description of the data
structure will
never make any mention of a specific implementation of any database management
system
such as specific vendors.
Sometimes, having different data modelers could lead to confusion as they could
potentially
produce different data models within the same domain. The difference could stem
from
different levels of abstraction in the data models. This can be overcome by coming
up with
generic data modeling methods.
For instance, generic data modeling could take advantage of generic patterns in a
business
organization. An example is the concept of a Party which includes Persons and
Organizations. A generic data model for this entity may be easier to implement to
without
creating conflict along the way
This data model represents events, entities and objects in the real world that are
of interest
to the company. It is subject oriented and includes all aspects of the real world,
primarily
activities pertaining to the business.
To use lay terms, a data model can be considered a road map to get one employee
from
point A to point B in the least mileage, most scenery and shortest time of travel.
In the science of computing, data models are structured and organized data
structures that
are implemented in a database management system. Aside from defining and organizing
business data, data modeling also includes implicitly and explicitly imposing
constraints and
limitations on the data within the data structure.
A data model may be instance of a conceptual schema, logical schema and physical
schema.
The physical schema, as the name implies, is the description of the physical means
for
storing data. This can include definitions for storage requirements in hard terms
like
computers, central processing units, network cables, routers and others.
Data Architects and Business Analysts usually work hand in hand to make an
efficient data
model for an organization. To come up with a good Common Data Model output, they
need
to be guided by the following:
1. They have to be sure about database concepts like cardinality, normalization and
optionality;
2. The have to have in depth knowledge of the actual rules of business and its
requirements;
3. They should be more interested in the final resulting database than the data
model.
A data model describes the structure of the database within a business and in
effect, the
underlying structure of the business as well. It can be thought of as a grammar for
an
artificial intelligence in business or any other undertaking.
In the real world, the kinds of things are represented as entities in the data
model. This
entities are can hold any information or attribute as well as relationships.
Irrespective of
how data is represented in the computer system, the data model describes the
company
data.
It is always advised to have a good conceptual data model to describe the semantics
of a
given subject area. A conceptual data model is a collection of assertions
pertaining to the
nature of information used by the company. Entities should be named with natural
language
instead of a technical term. Relationships which are properly named also form
concrete
assertions about the subject.
Common Data Modeling is defining the unifying the structure used in allowing
heterogeneous business environments to interoperate. A Common Data Model is very
critical to a business organization.
In Common Data Modeling, Business Architects and analysts need to face the data
first
before defining a common data or abstraction layer so that they will not be bound
to a
particular schema and thus make the Business Enterprise more flexible.
Business Schemas are the underlying definition of all business related activities.
Data
Models are actually instances of Business Schemas � conceptual, logical and
physical
schemas. These schemas have several aspects of definition and they usually form a
concrete basis for the design of Business Data Architecture.
Data Modeling is actually a vast field but having a Common Data Model for a certain
domain
can answer problems with many different models operating in a homogeneous
environment.
To make Common Data Models, modelers need to focus on one standard of Data
Abstraction. They need to agree on certain elements to be concretely rendered so
uniformity
and consistency is obtained.
Generic patterns can be used to attain a Common Data Model. Some of these patterns
include using entities such as "party" to refer to persons and organizations, or
"product
types", "activity type", "geographic area" among others. Robust Common Data Models
explicitly include versions of these entities.
A good approach to Common Data Modeling is to a have a generic Data Model which
consists of generic types of entity like class, relationships, individual thing and
others. Each
instance of these classes can have subtypes.
Common Data Modeling process may obey some these rules:
1. Attributes are to be treated as relationships with other entities.
2. Entities are defined under the very nature of a Business Activity, rule, policy
or structure
but not the role that it plays within a given context.
3. Entities must have a local identifier in an exchange file or database. This
identifier must
be unique and artificial but should not use relationships to be part of the local
identifier.
4. Relationships, activities and effects of events should not be represented by
attributed but
by the type of entity.
5. Types of relationships should be defined on a generic or high level. The highest
level is
defined as a relationship between one individual thing with another individual
thing.
Data Modeling often uses the Entity-Relationship Model (ERM). This model is a
representation of structured data. This type of Data Modeling can be used to
describe any
ontology (the term used to describe the overview and classification of terms and
their
respective relationships) for a certain area of interest.
Common Data Modeling is one of the core considerations when setting up a business
data
warehouse. Any serious company wanting to have a data warehouse will have to be
first
serious about data models. Building a data model takes time and it is not unusual
for
companies to spend two to five years just doing it.
Data Models should reflect practical and real world operations and that is why a
common
data modeling method of combining forward, reverse and vertical methods make
perfect
sense to seamlessly integrate disparate data coming in whether top down or bottom
up
from different sources and triggering events.
Business Architects, analysts and data modelers work together to look around and
look for
the best practices found in the industry. These best practices are then synthesized
into the
enterprise model to reflect the current state of the business and the future it
wants to get
into.
A good Enterprise Data Model should strike a balance between conceptual entities
and
functional entities based on practical, real and available industry standard data.
Conceptual
entities are defined within the company and will take on the values of the data by
the
defined by the company. Examples of conceptual entities are products status,
marital
status, customer types, etc.
On the other hand, functional entities refer to entities that are already well
defined, industry
standard data ready to be placed into database tables. Examples of functional
entities are
D&B Paydex Rating and FICO Score.
Businesses usually start with simply and grows more complex as they progress. It
may start
by selling goods or providing services to clients. These goods and services
delivered as well
as money received were recorded and then reused. So over time, transactions pile up
over
another and the set up can get more and more complex. Despite the complexity, the
business is still essentially a simple entity that has just grown in complexity.
This happens when the business does not have a very defined common data modeling
method. Many software applications could not provide ways to integrate real world
data and
data within the data architecture.
This scenario where there is not common business model can worsen when disparate
multiple systems are used within the company each and each of the system has
differing
views on the underlying data structures.
Business Intelligence can perform more efficiently with Common Data Modeling
Method. As
its name implies, Business Intelligence processes billions of data from the data
warehouse
so that a variety of statistical analysis can be reported and a recommendation on
innovation
to give the company more competitive edge can be presented.
With Common Data Modeling Method, processes can be made faster as the internal
structure of data are made closer to reality compared to non-usage of the data
model. It
should be noted that the common set up of today's business involves having data
sources
from as many geographical locations as possible.
Entity-Relationship
The Entity-Relationship or E-R is a model which deals with real world entities, it
includes a
set of objects and the relationships among them. Entity is an object that exists
and is easily
distinguishable from others. Like people, they can easily be distinguished from
others
through various methods; for instance you can distinguish people from one another
by
social security numbers.
http://www.learn.geekinterview.com/images/dm01.gif
http://www.learn.geekinterview.com/images/dm02.gif
Also an entity can be concrete, like a book, person, or place, while it can also be
abstract,
like a holiday for example. Now an entity set is a set of entities that share
something, like
multiple holders of a bank account, those people would be considered an entity set.
Also
entity sets do not need to be disjoint.
Here is an example, the entity set employee (all employees of a bank) and the
entity
set customer (all customers of the bank) may have members in common. Such as an
employee or employees may also be members of the bank. This puts them into both
sets.
. Employees can be customers as well, and will have possession of both employee
numbers and account numbers.
Relationships and Relationship Sets
sentence.
One should remember that the role of an entity is the function it has in a
relationship.
Consider an example, the relationship �works-for� could be ordered pairs of
different
employee entities or entity sets. The first employee entity takes the role of a
manager or
supervisor, where as the other one will take on the role of worker or associate.
Relationships can also have descriptive attributes. This can be seen in the example
of a date
(as in the last date of access to an account), this date is an attribute of a
customer
account relationship set.
Attributes
A particular set of entities and the relationships between them can be defined in a
number
of ways. The differentiating factor is how you deal with the attributes. Consider a
set of
employees as an entity, this time let us say that the set attributes are employee
name andphone number.
In some instances the phone number should be considered an entity alone, with its
own
attributes being the location and uniqueness of the number it is self. Now we have
two
entity sets, and the relationship between them being through the attribute of the
phone
number. This defines the association, not only between the employees but also
between
the employee phone numbers. This new definition allows us to more accurately
reflect the
real world.
Basically what constitutes an entity and what constitutes an attribute depends
largely on
the structure of the situation that is being modeled, as well as the semantics
associated
with the attributes in question.
relationship diagram. Yet this is only possible when we keep in mind what
components
are involved in creating this type of model.
Some of the different variations of the Entity-Relational diagram you will see are:
* Diamonds are omitted - a link between entities indicates a relationship.
Less symbols, means a clearer picture but what will happen with descriptive
attributes? In
this case, we have to create an intersection entity to possess the attributes
instead.
* Also we can use a range of numbers that can indicate the different options of
relationship
E.g. (0, 1) is used to indicate minimum zero (optional), maximum 1. We can also use
(0,n),
(1,1) or (1,n). This is typically used on the near end of the link - it is very
confusing at first,
but this structure gives us more information.
This allows us to map composite attributes and record derived attributes. We can
then use
subclasses and super classes. This structure is generalization and specialization.
Summary
Entity-Relationship diagrams are a very important data modeling tool that can help
organize
the data in a project into categories defining entities and the relationships
between entities.
This process has proved time and again to allow the analyst to create a nice
database
structure and helps to store the data correctly.
Entity
The data entity represent both real and abstract entity about which data is being
stored.
The types of entities fall into classes (roles, events, locations, and concepts).
This could be
employees, payments, campuses, books, and so on. Specific examples of an entity are
called instances.
Relationship
The relationship between data is a natural association that exists with one or more
entities.
Like the employees process payments. Cardinality is the number of occurrences of a
single
entity for one occurrence of the related entity, such as, an employee may process
many
payments but might not process any depending on the nature of their job.
Attribute
The belief goes on to complement the top concept and the bottom concept structures.
This
particular structure constitutes a lattice which is also described as an order
satisfying certain
properties. Each item is then defined as a mixture of some other super items that
are taken
from the related super-concepts.
In simple words, the Concept Oriented model is based on a lattice theory or an
order of
sets. Each component is defined as a mixture of its super-concepts. The top concept
structure provided the most abstract view, with no items. Where as, the bottom
concept
structure is more specific and provides a much more detailed representation of the
model.
Each step within this path normally has a name that is in context to its
corresponding
concept. The number of these paths from the top of the model to the bottom of the
model is
its dimensionality. Now each dimension corresponds to one variable or one
attribute. Thus is
supposed to be one-valued.
All one valued attributes are also directed upward within the structure. Yet if we
were to
reverse the direction of the dimensions then we would have what is often called
sub-
dimensions or reverse dimensions.
These dimensions also corresponds to the attributes or the properties, but they
take many
values from the sub-concepts rather then the super-concepts to form normal
dimensions.
After we have explored the dimensionality of the concept model structure we move
forward
to address the relations between the concepts.
When speaking of relations, each concept is related to its super-concepts, yet the
super-
concepts are also clarified as relation with regard to this concept. So, in order
to be a
relation is a comparative role. More specifically each item is a single instance of
relation for
the corresponding super-items and it is an object link to other objects by the
means or the
relations of its sub-items, which clarified as relation to instances. This brings
us to grouping
and/or aggregation.
Let�s continue to think of these items in relations, this way we can imagine each
item has a
number of �parents� from the super-concepts as well as a number of sub-items from
the
sub-concepts.
Items are interpreted as a group, set, and category for its sub-items. Yet it is
also a
member of the sets, groups, and categories formula represented by the super-items.
You
can see the dual functions or roles for the items when you consider them in this
light.
Continuing to think of our items in the light we have created we can now see that
each
problem domain that would be represented in a concept model have differing levels
of
details.
We can easily indicate that multiple source concepts and enforce input constraints
upon
them. Then the constraints can be spread in a downward direction to the bottom
level. This
is the most specific level of all of the levels.
Once we have completed this step the result of this then transported back up the
chain of
levels toward a target concept. Then we can begin an operation of moving one of the
source
concepts downward, basically choosing one of the sub-concepts with more detail.
This is
called Drill Down. We also have an operation known as Roll Up; this is the process
of
moving up by selecting some super-concept with less detail.
With all this talk of constraints, you are probably wondering what they are as
well. Simply
for each concept we can indicate constraints the corresponding items need to
satisfy. That
forces us to describe the actual properties by indicating a path resembling a
zigzag pattern
in the concept structure.
The zigzag path then goes up when needed to get more detailed information; it also
goes
down to retrieve more general information. Using this we can easily then express
the
constraints in terms of the other items and where they can be found.
Object and references both have corresponding structure and behavioral methods. A
simple
consequence of having concepts is that the object are presented and accessed in an
indirect
manner, this is done by the concept using custom references with subjective domain
specific structures as well as functions.
Now you might also want to know what a sub-concept or a super-concept is as well. A
Remember there is always an upward directed arrow in the concept graph from a sub-
concept to any of the corresponding super-concepts. Therefore the sub-concept then
associated with the start of that arrow type and the super-concepts is associated
with the
end of that arrow.
Sub-concepts also have two parts, such as Order Parts, the formula would be Order
Parts=<Products, and Orders> or Order Operations=<Orders, and operations>.
So now we have dissected the main components of the Concept Oriented Model, it is
obviously important to the association of data and concepts related to that data.
We can
now understand the uses and functions of this model with a bit more clarity.
Though the Concept Oriented model is complex and definitely worthy of further
research. It
is suggested that anyone who has had their curiosity sparked by this article look
further into
the model, and perhaps even further explore the additional functions and uses since
it can
be applied to many situations.
Object-Relational Model
Some of the benefits that are offered by the Object-Relational Model include:
. Extensibility - Users are able to extend the capability of the database server;
this can
be done by defining new data types, as well as user-defined patterns. This allows
the
user to store and manage data.
.
. Complex types - It allows users to define new data types that combine one or more
of
the currently existing data types. Complex types aid in better flexibility in
organizing the
data on a structure made up of columns and tables.
.
. Inheritance - Users are able to define objects or types and tables that procure
the
properties of other objects, as well as add new properties that are specific to the
object
that has been defined.
.
The object-relational database management systems which are also known as ORDBMS,
these systems provide an addition of new and extensive object storage capabilities
to the
relational models at the center of the more modern information systems of today.
These services assimilate the management of conventional fielded data, more complex
objects such as a time-series or more detailed geospatial data and varied dualistic
media
such as audio, video, images, and applets.
This can be done due to the model working to summarize methods with data
structures, the
ORDBMS server can implement complex analytical data and data management operations
to
explore and change multimedia and other more complex objects.
What are some of the functions and advantages to the Object-Relational Model?
It can be said that the object relational model is an evolutionary technology, this
approach
has take on the robust transaction and performance management aspects of its
predecessors and the flexibility of the object-oriented model (we will address this
in a later
article).
Database developers can now work with somewhat familiar tabular structures and data
definition but with more power and capabilities. This also allows them to perform
such task
all the while assimilating new object management possibilities. Also the query and
procedural languages and the call interfaces in the object relational database
management
systems are familiar.
The main function of the object relational model is to combine the convenience of
the
relational model with the object model. The benefits of this combination range from
Object-relational models allow users to define data types, function, and also
operators. As a
direct result of this the functionality and performance of this model are
optimized. The
massive scalability of the object-relational model is its most notable advantage,
and it can
be seen at work in many of today�s vendor programs.
The History of the Object-Relational Model
The Object Oriented model seemed to move into the spot light in the 1990s, the idea
of
being able to store object oriented data was a hit, but what happened to the
relational data?
Later in the 1990s the Object-Relational model was developed, combining the
advantages of
its most successful predecessors such as; user defined data types, user defined
functions,
and inheritance and sub-classes.
This model grew from the research conducted in the 1990s. The researches many goal
was
to extend the capabilities of the relational model by including objects oriented
concepts. It
was a success.
The easiest way to begin mapping between a enduring class and a table is one-on-
one. In a
case such as this, all of the attributes in the enduring class are represented by
all of the
columns of the table. Each case in point of a business class is then in turn stored
in a row of
that table.
Due to this two types of class to table modeling methods have been adopted by most
users.
This was to help overcome the issues caused by differences between the relational
and
object models. The two methods are known as SUBSET mapping and SUPERSET mapping.
SUBSET Mapping is used to create projection classes as well for tables with a
sizable
number of columns. A projection class contains enough information to enable the
user to
choose a row for complete retrieval from a database.
This essentially reduces the amount of information passed through out the network.
This
type of mapping can also be used to help may a class inheritance tree to a table of
using
filters.
Now let�s consider SUPERSET Mapping. With a persistent class the superset mapping
method holds attributes taken from columns of more then one table. This particular
method
of mapping is also known as table spanning.
Mapping using the SUPERSET method is meant to create view classes that cover the
underlying data model, or to map a class inheritance tree to a database by using a
Vertical
mapping tactic.
There are millions of other aspects and advantages to this model. The Object-
Relational
model does what no other single model before it could do. By combining the
strongest
points of those that did come before it, this model has surpasses expectations, and
taken on
a definitive role in database technology. Despite what models follow it, this model
is here to
stay.
http://www.learn.geekinterview.com/images/dm05a.png
The Object model, also referred to as the object oriented model was designed to add
provide full-featured database programming capability, all the while retaining the
native
language compatibility as well.
patient record systems, all of which have very complex relationships between data.
When you search the web for some concrete information on the object model don�t be
surprised to end up with mix matched results, none of which plainly stating �Object
Model�.
You will instead turn up results for Document Object Models, and Component Object
Models.
This is because the Object Model has been modified just slightly to apply to
different
instances. We will touch on that before moving on.
So what exactly is a Document Object Model? Well you might see it often referred to
as a
DOM, this model is a platform and language neutral interface that allows programs
or script
to vigorously access as well as update the content, structures, and styles of
documents. The
document can then be processed further and the results can be incorporated back
into the
contents of the page.
The Component Object Model which is also referred to as COM, this model is
basically a
component software architecture that enables users to build applications and
systems alike
from components supplied by different software vendors. Component Object Models are
the
underlying design that forms the foundation for some higher-level software
services. Some
of these services include those that are provided by OLE. Any PC user may be
surprised to
learn that a COM is also known as ActiveX. An application we are all familiar with,
especially
those of use that spend a lot of time surfing the internet.
Many of the traditional operating systems were designed to deal with only the
application
binaries and not the actual components. Due to this the benefits of good component-
oriented designs have until now never gone beyond the compilation step. In a world
that is
object-centric it is confusing that our operating systems still can not recognize
objects.
Instead our operating systems have been dealing with only application binaries or
EXEs.
This prevented objects in one process from communication with objects in a
different
process while using their own defined method.
History
The object model really hit the programming scene in the mid-1990s. Around October
of
1998 the first specification of the Document Object model was released by W3C, it
was
known as DOM 1. Later in 2000 DOM 2 followed, it surpassed its older version by
including
specifics with the style sheet Object Model and style information manipulation.
Most recently
DOM 3 wowed the programming world with its release in 2004. Thus far there have
been no
more current releases, as of now we are still using the DOM 3 model, and it has
served us
well.
The history of the Component Object Model is a bit lengthier; we will summarize its
more
dramatic points. DDE was one of the very first methods of inter-process
communication. It
allowed sending and receiving communications or messages between applications. This
is
also sometimes referred to as a conversation between applications. At this point I
think it is
important to point that Windows is the leading Component Object Model vendor, and
the
history of COM is based richly on the information a discoveries made by Windows.
The budding technology of COM was the base of OLE, which means Object Linking and
Embedding. This was one of the most successful technologies introduced with
Windows. The
programs we soon being added into application like Word and Excel by 1991, and on
into
1992. it was not until 1996 that Windows truly realized the potential for their
discover. They
found that the OLE custom controls could expand a web browsers capability enough to
present content.
From that point the vendor had be integrating aspects of COM into many of their
applications, some like Microsoft Office. There is no way to tell how far or how
long the
evolution of Object Modeling with travel, we need only sit back and watch as it
transforms
our software and application into tools to help us mold and shape our future in
technology.
The first example we will cover it the Document Object Model. This example is a
remake of
a more detailed example, I have reduced the information provided in the example in
order
to express on the important features of the model. This example can be seen below:
http://www.learn.geekinterview.com/images/dm06.png
http://www.learn.geekinterview.com/images/dm07.png
By looking at the example provided above, we can clearly see the process in which
the
Document Object Model is used. The sample model is designed to show us the way in
which
the document is linked to each element and the coinciding text that is linked to
those
elements..
Now we will take a quick look at a simple component object model example. This
particular
example has been based on one of the models provided by window themselves.
On the example above you see two different types of arrows. The solid arrows are
used to
indicate the USES, where as the dashed arrows are used to represent the OPTIONALLY
USES. The boxes with green out lined text are there to represent the aspects
provided with
WDTF. The blue high lighted text is you Implement or Modify example. The red is
expressing the implementation of your own action interface, and the text high
lighted in
black indicates your operating systems or driver API. By viewing this sample of the
Component Object Model, we can see how the components are linked and the way in
which
they communication between one another.
So after exploring the Object model we can safely come to the conclusion that the
Object
model does serve an important purpose that no model before it was able to grasp.
Though
the model has been modified to fit with specific instances the main use is to model
object
data.
Windows is one of the more notable vendors who have put the Object Model in the
limelight; it will be interesting to see what heights this model reaches with their
assistance.
I will be keeping a close eye out for the next evolutionary change in the Object
Model.
The Associative data model is a model for databases unlike any of those we spoke of
in prior
articles. Unlike the relational model, which is record based and deals with
entities and
attributes, this model works with entities that have a discreet independent
existence, and
their relationships are modeled as associations.
The Associative model was bases on a subject-verb-object syntax with bold parallels
in
sentences built from English and other languages. Some examples of phrases that are
. Cyan is a Color
. Marc is a Musician
. Musicians play instruments
. Swings are in a park
. A Park is in a City (the bold text indicates the verbs)
By studying the example above it is easy to see that the verb is actually a way of
association. The association�s sole purpose is to identify the relationship between
the
subject and the object.
The Associative database had two structures, there are a set of items and a set of
links that
are used to connected them together. With the item structure the entries must
contain a
unique indication, a type, and a name. Entries in the links structure must also
have a unique
indicator along with indicators for the related source, subject, object, and verb.
The Associative model structure is efficient with the storage room fore there is no
need to
put aside existing space for the data that is not yet available. This differs from
the relational
model structure. With the relational model the minimum of a single null byte is
stored for
missing data in any given row. Also some relational databases set aside the maximum
room
for a specified column in each row.
The Associative database creates storage of custom data for each user, or other
needs clear
cut and economical when considering maintenance or network resources. When
different
data needs to be stored the Associative model is able to manage the task more
effectively
then the relational model.
With the Associative model there are entities and associations. The entity is
identified as
discrete and has an independent existence, where as the association depends on
other
things. Let�s try to simplify this a little before moving on.
Let�s say the entity is an organization, the associations would be the customer and
the
employees. It is possible for the entity to have many business roles at the same
time, each
role would be recorded as an association. When the circumstances change, one or
more of
the associations may no longer apply, but the entity will continue to endure.
The Associative model is designed to store metadata in the same structures where
the data
itself is stored. This metadata describes the structure of the database and the how
different
kinds of data can interconnect. Simple data structures need more to transport a
database
competent of storing the varying of data that a modernized business requires along
with the
protection and managements that is important for internet implementation.
The Associative model is built from chapters and the user�s view the content of the
database
is controlled by their profile. The profile is a list of chapters. When some links
between items
in the chapters inside as well as outside of a specific profile exist, those links
will not be
visible to the user.
There is a combination of chapters and profiled that can simplify the making of the
database
to specific users or ever subject groups. The data that is related to one of the
user groups
would remain unseen to another, and would be replaced by a different data set.
With the Associative model there is not record. When assembling all of the current
information on a complex order the data storage needs to be re-visited multiple
times. This
could pose as a disadvantage. Some calculations seem to suggest that Associative
database
would need as many as four times the data reads as the relational database.
All of the changes and deletions to the Associative model are directly affected by
adding
links to the database. However we must not that a deleted association is not
actually
deleted itself. Rather it is linked to an assertion that has been deleted. Also
when an entity
is re-named it is not actually re-named but rather linked to its new name.
In order to reduce the complexity that is a direct result from the parameterization
required
by heftier software packages we can rely on the chapters, profiles and the
continuation of
database engines that expect data stored to be different between the individual
entities or
associations. To set of or hold back program functions in a database the use of
�Flags� has
begun to be practiced.
The packages that are based on an Associative model would use the structure of the
database along with the metadata to control this process. This can ultimately lead
to the
generalization of what are often lengthy and costly implementation processes.
A generalization such as this would produce considerable cost reductions for users
purchasing or implementing bigger software packages, this could reduce risks
related with
the changes of post implementation as well.
How well does the Associative Model suit the demands of data?
Some ask if there is still an ongoing demand for a better database. Honestly, there
will
always be that demand. The weaker points of the current relational model are now
apparent, due to the character of the data we still need to store changing. Binary
structures
that are supportive to multimedia have posed real challenged for relational
databases in the
same way that the object-oriented programming methods did.
When we look back on the Object databases we can see that they have no conquered
the
market, and have their cousins the hybrid relational products with their object
extensions.
So will the Associative model solve some of the issues surrounding the relational
model?
The answer is not entirely clear, though it may resolve some issues it is not
completely clear
how efficiently the model will manage when set against the bigger binary blocks of
data.
Areas of the Associative database design do seem simpler then the relational
models, still as
we have pointed out there are also areas that call for careful attention. There are
issues
related to the creation of chapters that remain daunting at best.
Even so, if the concept of the Associative model proves itself to be a genuinely
feasible and
is able to bring out a new and efficient database, then others could bring to life
products
that are built upon the base ideas that exist with this model.
There is definitely an undeniable demand for a faster operating database model that
will
scale up to bigger servers and down to the smaller devices. It will be an
interesting journey
to witness; I personally would like to see if the future databases built using this
model can
make their mark in the market.
The term Hierarchical Model covers a broad concept spectrum. It often refers to a
lot of set
ups like Multi-Level models where there are various levels of information or data
all related
be some larger form.
You can see from the above figure that the supplementing information or details
branch out
from the main or core topic, creating a �tree� like form. This allows for a visual
relationship
of each aspect and enables the user to track how the data is related.
There are many other ways to create this type of model, this is one of the simplest
and is
used the most often.
An example of information you would use the Hierarchical model to record would be
the
levels within an organization, the information would flow such as:
http://www.learn.geekinterview.com/images/dm09.png
So the Hierarchical model for this scenario would look closely like the one below.
As you can
see this model is substantially larger, the benefit of the Hierarchical model is
that it allows
for a continuous growth, though it can take up a lot of room.
With each addition of data a new branch on the �tree� is formed, adding to the
information
as a whole as well as the size.
Hierarchical models allow for a visual parent/ child relationship between data
sets,
organizational information, or even mathematics.
The idea for these models is to begin with the smallest details, in the example
above that
would be the sections.
From the smallest details you would move up (it is often easiest to think of the
model as a
hierarchy) to the subdivisions, above the subdivisions you find departments, and
finally
ending at one �parent� the organization.
Once finished you can sit back and view the entire �family� of data and clearly
distinguish
how it is related.
The first mainframe database management systems were essentially the birth place of
the
Hierarchical model.
The hierarchical relationships between varying data made it easier to seek and find
specific
information.
Though the model is idea for viewing relationships concerning data many
applications no
longer use the model. Still some are finding that the Hierarchical model is idea
for data
analysis.
Perhaps the most well known use of the Hierarchical model is the Family Tree, but
people
began realizing that the model could not only display the relationships between
people but
also those between mathematics, organizations, departments and their employees and
employee skills, the possibilities are endless.
Simply put this type of model displays hierarchies in data starting from one
�parent� and
branching into other data according to relation to the previous data.
Though the Hierarchical model is rarely used some of its few uses include file
systems and
XML documents.
The tree like structure is idea for relating repeated data, and though it is not
currently
applied often the model can be applied to many situations.
The Hierarchical model can present some issues while focusing on data analysis.
There is
the issue of independence of observations, when data is related it tends to share
some type
of background information linking it together, therefore the data is not entirely
independent.
Though the tree like structure is perhaps the simplest and also the most desirable
form for
new users there are other types or structures for this model.
Hierarchy is also structured as an outline or indented list. It can be found in the
indented
lists of XML documents.
The example, below, presents information similar to those above that we have
created but
the tree like form is not used in this Hierarchical Modeling but that of
indentation.
. ORGANISATION
o Department 1
. Subdivision 1
. Section 1
. Section 2
. Section 3
. Subdivision 2
. Section 1
. Section 2
. Section 3
. Subdivision 3
. Section 1
. Section 2
. Section 3
o Department 2
. Subdivision 1
. Section 1
. Section 2
. Section 2
. Subdivision 2
. Section 1
. Section 2
. Section 3
One thing you must keep in mind at all times is that no matter what type of
structure you
use for the model you need to be able to add categories at any time, as well as
delete them.
http://www.learn.geekinterview.com/images/dm10.png
An idea to ensure that this is possible is to use a list view or tree view with
expandable and
collapsible categories.
You can also use the model in a visual form, something involving a cylinder or
pyramid or
even a cube, this visual presentation of the data would be most suitable for a
presentation
of data to a group of professionals.
This form would be better for smaller less detailed levels. There is an example
using some
of the same information from above but shown more compact below.
There are various structures of the Hierarchical Model; in fact there are many more
then
those shown here.
The type you use all depends on the data you are using. The methods differ
according to
whether your data is people related, mathematical related, or just simple
statistics.
1. This model expresses the relationships between information. How they are related
and
what they are most closely related to.
3. The model has many different structures and forms. Each is best used depending
on the
type of data being recorded, the amount of data being recorded, and who it is being
recorded for.
4. Speaking in parent/child terms data can have many children but only one parent.
5. The model begins with core data and branches off into supplementing data or
smaller
related data.
6. One must remember to start with the smallest detail and work their way up.
If you keep to these simple and compacted guidelines your own Hierarchical Model
will be
successful, clean, clear, and well built. The point is to present information in a
simple and
easy to read manner.
The Multi-Dimensional Model
Due to the fact that OLAP is online it provides information quickly, iterative
queries are
often posed during interactive sessions.
Due to the analytical nature of OLAP the queries are often complex. The multi-
dimensional
model is used to solve this kind of complex queries. The model is important because
it
applies simplicity.
This helps users understand the databases and enables software to plot a course
through
the databases effectively.
The analysts know what measures they want to see, what dimensions and attributes
make
the data important, and in what ways the dimensions of their work is organized into
levels
as well as hierarchies.
Let us touch on what the logical cubes and logical measures are before we move on
to more
complicated details.
Logical cubes are designed to organize measures that have the same exact
dimensions.
Measures that are in the same cube have the same relationship to other logical
objects;
they can easily be analyzed and shown together.
With logical measures cells of the logical cube are filled with facts collected
about an
organization�s operations or functions.
The measures are organized according to the dimensions, which also deals with time
dimension.
Analytic databases contain outlines of historical data, taken from data in a
heritage system,
also those other data sources such as syndicated sources. The normally acceptable
amount
of historical data for analytic applications is about three years worth.
The measures are static; they are also trusted to be consistent while they are
being used to
help make informed decisions.
The measures are updated often, most applications update data by adding to the
dimensions of a measure. These updates give users concrete historical record of a
specific
organizational activity for an interval. This is very productive.
The lowest level of a measure is called the grain. Often this level of data is
never seen, even
so it has a direct affect on the type of analysis that can be done.
This level also determines whether or not the analysts can obtain answers.
Questions such
as, when are men most likely prone to place orders for custom purchases?
Logical cubes and measures were relatively simple and easy to digest. Now we will
consider
Logical Dimensions, which is a little more complex. Dimensions have a unique set of
values
that define and categorized the data.
These form the sides of the logical cubes and through this the measures inside of
the cubes
as well. The measures themselves are usually multi-dimensional; due to this a value
within
a measure should be qualified by a member of all of the dimensions in order to be
appropriate.
The Hierarchy is a mode which is used to organize the data at each level of
aggregation.
When looking at data, developers use hierarchy dimensions to identify trends on a
specific
level, as well as drill down to lower lever to indicate what is causing such
trends, then they
can also roll up to the higher levels to view how these trends affect the bigger
sections of
the organization.
Back to the levels, each level represent a position in the hierarchy, the levels
above the
most detailed level contain aggregated values for the levels that are beneath it.
Finally to wrap up this section we will take a quick look at Logical Attributes. By
now we
should all know that an attribute provides extra information about the data.
Some of there attributes are used simply for display. You can have attributes that
are like,
flavors, colors, sizes, the possibilities are endless.
It is this kind of attribute that can be helpful in data selection and also in
answering
questions.
An example of the type of questions that the attributes can help answer are; what
colors
are most popular in abstract painting? Also we can ask, what flavor of ice-cream do
seven
year olds prefer?
We also have time attributes, which can give us information about the time
dimensions we
spoke of earlier, this information can be helpful in some kinds of analysis.
These types of analysis can be indication the last day or amount of days in a time
period.
That pretty much wraps it up for attributes at this point. We will revisit the
topic a little
later.
Variables
Now we will consider the issue of variables. A variable is basically a value table
for data,
which is an array with a specific type of data and is indexed by a particular list
of
dimensions. Please be sure to understand that the dimensions are not stored in the
variable.
Each mixture of members of a dimension define a data cell. This is true whether a
value for
that cell is present or not. Therefore, if data is missing, or absent the fact of
the absences
can either be included or excluded from analysis.
When you have variable that contain identical dimensions it creates a logical cube.
With that
in mind, you can see how if you change a dimension, like adding time periods to the
time
dimension then the variables change as well to include the new time periods, this
happens
even if the other variable have no data for them.
The variables that share dimensions can be manipulated in a array of ways, this
includes
aggregation, allocation, modeling, and calculations.
This is more specifically numeric calculations, and it is an easy and fast method
in the
analytic work place. We can also use variables to store measures.
In addition to using variable to store measures they can be used to store attribute
as well.
There are major differences between the two.
While attribute are multi-dimensional, only one dimension is the data dimension.
Attributes
give us information about each dimension member no matter what level it inhabits.
Through out our journey of learning about the different types of data models, I
think that
the multi-dimensional model is perhaps one of the most useful.
It takes key aspects from other models like the relational mode, the hierarchical
model, and
the object model, and combines those aspects into one competent database that has a
wide
variety of possible uses.
Network Model
Oddly enough the Network model was designed to do what the Hierarchical model could
not.
Though both show how data is related the Network model allows for data to not only
have
many children but also many parents, where as the Hierarchical model allowed for
only one
parent with many children. With the Network model data relationships must be
predefined.
It was in 1971 that the Conference on Data System Languages or CODASYL officially
or
formally defined the Network model. This is essentially how the CODASYL defined the
Network model:
The central data modeling theory in the network model is the set theory. A set
contains a
holder record style, a set title, and an affiliate record type.
An affiliate record type is able to have the same role in more than one set;
because of this
the multi-parent hypothesis is established. A holder record style can be an
affiliate or holder
in another set as well.
The data model is an uncomplicated system, and link and connection record styles
(often
referred to as junction records) may well be existent, as well as additional sets
between
them.
The most notable advantage of the Network model is that in comparison with the
Hierarchical model it allows for a more natural avenue to modeling relationships
between
information. Though the model has been widely used it has failed to dominate the
world of
data modeling.
For a while the performance benefits of the lower lever navigational interfaces
used with the
Hierarchical and Network models were well suited for most large applications.
Yet as hardware advanced and became faster the added productivity and flexibility
of the
newer models proved to be better equipped for the data needs.
Soon the Hierarchical and Network models were all but forgotten in relation to
corporate
enterprise usage.
Open System Interconnection or OSI models were created to serve as tools that could
be
used to describe the various hardware and software components that can be found in
a
network system.
Over the year we have learned that this is particularly useful for educational
purposes, and
in expressing the full details of the things that need to occur for a network
application to be
successful.
This particular model consists of seven separate layers, with the hardware placed
at the
very bottom, and the software located at the top.
This process could easily be compared to that of reading an email. Imagine Column
#1 and
#2 as computers when exploring the figure below:
http://www.learn.geekinterview.com/images/dm12.png
The first layer, which is clear labeled as the physical layer, is used to describe
components
like that of internal voltage levels, it is also used to define the timing for the
conduction of
single fragments.
The next layer is the Data Link, which is the second layer that is listed in the
example
above, this often relates to the sending of a small amount of data, this could be
and often is
a byte, it is also often used for the task of error corrections.
The Network layer follows the Data Link layer, this defines how to transport the
message
through and within the network. If you can stop an moment and think of this layer
as one
working with an internet connection, it is easy to imagine that it would be used to
add the
correct network address.
Next we have the Transport layer, this layer is designed to divide small amounts of
the data
into smaller sets, or if needed it even severs to recombine them into a larger more
complete
set. The Transport layer also deals with data integrity; this process often
involves a
checksum.
Following the Transport layer we find the Session layer, this next layer is related
to issues
that go further or are more complicated then a single set of data.
More to the point the layer is meant to address resuming transmissions like those
that have
been prematurely interrupted or even some how corrupted by some kind of outside
influence. This layer also often makes long term connections to other remote
machines.
Following the Session layer is where we find the Presentation layer. This layer
acts as an
application interface so that syntax formats and codes are consistent with two
networked or
connected machines.
The Presentation layer Ialso designed to provide sub-routines as well, these are
often what
the user may call on to access their network functions, and perform some functions
like
encrypting data, or even compressing their data.
Finally we have the Application layer. This layer is where the actual user programs
can be
found. In a computer this could be as simple as a web browser surprisingly enough,
or it
could serve as a ladder logic program on a PLC.
After reading this article it is not hard to see the big differences between the
Hierarchical
Model and the Network Model. The network model is by far more complicated and deals
with
larger amounts of information that can be related in various and complicated ways.
This model is more useful due to the fact that the data can have many-to-many
relationships, not restricting in to a single parent to a child structure. This is
how the
Hierarchical Model works with data.
Though the Network model has been officially replaced by the more accommodating
Relational Model, for me it is not hard to imagine how it can still be used today,
and may
very well still be being used by PCs around the globe when I think of the Network
Model in
relation to how we email one another.
After reviewing the information and investigating the facts of the Network model I
have
come to the conclusion that it is a sound and relatively helpful model if not a bit
complicated.
Its one major downfall being that the data must be predefined; this adds
restrictions and is
why a more suitable model was needed for more advanced data. Ultimately this one
restriction lead to the model�s untimely replacement with in the world of data
analysis.
The Relational Model is a clean and simple model that uses the concept of a
relation using a
table rather then a graph or shapes. The information is put into a grid like
structure that
consists of columns running up and down and rows that run from left to right, this
is where
information can be categorized and sorted.
The columns contain information related to name, age, and so on. The rows contain
all the
data of a single instance of the table such as a person named Michelle.
In the Relational Model, every row must have a unique identification or key used to
allocate
the data that will follow it. Often, keys are used to join data from two or more
relations
based on matching identification.
Social Security
Number
Name
Date of Birth
Annual Income
Dependents
M-000-00-0002
F-000-00-0001
000-00-0003
Michelle
39,000
000-00-0001
Michael
78,510
000-00-0002
Grehetta
March 5th, 1952
0
The Relational Model can often also include concepts known commonly as foreign
keys,
foreign keys are primary keys in one relation that are kept in another relation to
allow for
the joining of data.
An example of foreign keys is storing your mother's and father's social security
number in
the row that represents you. Your parents' social security numbers are keys for the
rows
that represent them and are also foreign keys in the row that represents you. Now
we can
begin to understand how the Relational Model works.
Like most other things the Relational Model was born due to someone�s need. In 1969
Dr.
Edgar F. Codd published the first use of the relational model though it was meant
to be no
more then a report for IBM, if swept across and through data analysis unlike any
before it.
Codd's paper was primarily concerned with what later came to be called the
structural part
of the relational model; that is, it discusses relations per se (and briefly
mentions keys), but
it does not get into the relational operations at all (what later came to be called
the
manipulative part of the model).
Codd�s discovery, his creation was a breath of fresh air for those digging through
data
banks, trying to categorize and define data. When he invented this model he truly
may have
not foreseen what an incredible impact it would have on the world of data.
Some believe there is a great deal of room for improvement where the Relational
Model is
concerned. It may be a surprise to find not everyone supported relational model.
There
have been claims that the rectangular tables do not allow for large amounts of data
to be
recorded.
With the example of apples and oranges, both are fruits and therefore related in
that way
but apples have different attributes then oranges, At times a user may only want to
see one
or the other, then again the may want to view both. Handling this type of data with
the
relational model can be very tricky.
We are beginning to hear more and more about the need for a better model, a more
adequate structure, still no one has been able to design something that can truly
hold its
own with the Relational Model.
True the model could use a bit of tweaking and leaves a little to be desired, yet,
what would
the perfect model be? What could we use that would apply to as many instances as
the
Relational model, and still surpass its usefulness?
The Relational Model has survived through the years, though there are those who are
always trying to construct a more efficient way, it has managed to come out the
victor thus
far. One reason may be due to the structure it is big enough to be worthy of
optimizing.
Another notable reason is that the relational operations work on sets of data
objects, this
seems to make it a reasonably adequate model for remote access. Finally, it is a
clean
model and concise model that does not encourage design extravagance, or phrased as,
�design cuteness.�
Some prefer the clean and simple style that the Relational Model offers, they can
easily do
with out colorful shapes and stylish layouts, instead wanting nothing more then the
clear cut
facts and relevant information.
Here are a few of the more obvious and noted advantages to the Relational Model:
. Allows for Data Independence. This helps to provide a sharp and clear boundary
between the logical and physical aspects of database management.
.
. Simplicity. This provides a more simple structure than those that were being
before it. A
simple structure that is easy to communicate to users and programmers and a wide
variety of users in an enterprise can interact with a simple model.
.
. A good theoretical background. This means that it provides a theoretical
background
for database management field.
Do not be surprised to find that these are nearly the very same advantages that Dr.
Codd
listed in the debut of this model. It is obvious that he was right, and these
advantages have
been restated again and again since the first publication on his report.
There has been no other model brought into view that has had the advantages of the
Relational Model, though there have been hybrids of the model, some of which we
will
discuss in later articles.
We began with the Hierarchical Model, this model allowed us to distribute our data
in terms
of relation, some what like that of a hierarchy, it showed a parent/child type of
relation. It is
one big down fall being that of the fact that each �child� could only have one
parent, but a
parent could have many children.
This model served us well in its time of glory, and sure there are still systems
using it now,
though trusting their more hefty loads of data to better equipped models.
Following the Hierarchical Model we investigated the Network Model. This model was
closely
kin to the Hierarchical Model in that it to allow for a parent/child view of data,
its main
advantage over the previous model being that it allowed for a many-to-many
relationship
between data sets.
Still the data had to be predefined. Though some forms of this model are still used
today, it
has become some what obsolete.
Finally we come to our current model. The Relational Model. Like those before it,
it to
expresses the relationships between data, only it allows for larger input and does
not have
to be predefined. This model allows users to record and relate large amounts of
data.
This model also allows for multi-level relationships between data sets, meaning
they can be
related in many ways, or even only one way. It is easy to understand how this model
has
managed to out live those that came shortly before it. It is versatile, simple,
clean in
structure, and applicable to nearly every type of data we use.
The semi-structured data model is a data model where the information that would
normal
be connected to a schema is instead contained within the data, this is often
referred to as
self describing model.
With this type of database there is no clear separation between the data and the
schema,
also the level to which it is structured relies on the application being used.
Certain forms of semi-structured data have no separate schema, while in others
there is a
separate schema but only in areas of little restriction on the data.
Modeling semi-structured data in graphs which have labels that give semantics to
its
fundamental structure is a natural process. Databases of this type include the
modeling
power of other extensions of flat relational databases, to sheathed databases which
enable
the encapsulation of entities, as well as to the object databases, which also
enable recurring
references between objects.
Data that is semi-structured has just recently come into view as an important area
of study
for various reasons. One reason is that there are data sources like the World Wide
Web,
which we often treat as a database but it cannot be controlled by a schema.
Another reason is it might be advantageous to have a very flexible format for data
exchange
between contrasting databases. Finally there is also the reason that when dealing
with
structured data it sometimes may still be helpful to view it as semi-structured
data for the
tasks of browsing.
We are familiar with structured data, which is the data that has been clearly
formed,
formatted, modeled, and organized into customs that are easy for us to work and
manage.
We are also familiar with unstructured data.
Unstructured data combines the bulk of information that does not fit into a set of
databases.
The most easily recognized form of unstructured data is the text in a document,
like this
article.
What you may not have known is that there is a middle ground for data; this is the
data we
refer to as semi-structured. This would be data sets that some implied structure is
usually
followed, but still not a standard enough structured to meet the criteria needed
for the
types of management and mechanization that is normally applied to structured data.
We deal with semi-structured data every day; this applies in both technical and
non-
technical environments. Web pages track definite distinctive forms, and the content
entrenched within HTML usually have a certain extent of metadata within the tags.
Details about the data are implied instantly when using this information. This is
why semi-
structured data is so intriguing, though there is no set formatting rule, and there
is still
adequate reliability in which some interesting information can be taken from.
. Representation of the information about data sources that normally can not be
constrained by schema.
. The model provides a flexible format used for the data switch over amongst
dissimilar
kinds of databases.
The most important exchange being made in using a semi-structured database model is
quite possibly that the queries will not be made as resourcefully as in the more
inhibited
structures, like the relational model.
Normally the records in a semi-structured database are stored with only one of a
kind IDs
that are referenced with indicators to their specific locality on a disk. Due to
this the course-
plotting or path based queries are very well-organized, yet for the purpose of
doing
searches over scores of records it is not as practical for the reason that it is
forced to seek
in the various regions of the disk by following the indicators.
We can clearly see that there are some disadvantages with semi-structured data
model, as
there are with all other models, lets take a moment to outline a few of these
disadvantages.
Storage: Transfer formats like XML are universally in text or in Unicode; they are
also
prime candidates for transference, yet not so much for storage. The presentations
are
instead stored by deep seated and accessible systems that support such standards.
In short, many academic, open source, or other direct attention to these particular
issues
have been at an on-the-surface level of resolving representation or definitions, or
even
units.
Conclusion
We have researched many area of the semi-structured data model; include the
differences
between structured data, unstructured data, and semi-structured data. We have also
explored the various used for the model.
After looking at the advantages and the disadvantages, we are now educated enough
about
the semi-structured model to make a decision regarding its usefulness.
Though this model is worthy of more research and deeper contemplation. The
advantage of
flexibility and diversity that this particular model offers is more then
praiseworthy.
After researching, one can see many conventional and non-conventional uses for this
model
in our systems. A model example for semi-structured data model is depicted below.
http://www.learn.geekinterview.com/images/dm13.png
At the end of each arrow you can find the corresponding information. So this model
example
expresses the information about this article, the information being express is the
title of the
article which is
The Semi-Structure Data Model, also expresses the year in which the article was
written
which is 2008, and finally is tells us who the author is. As you can see from the
example
this data model is pretty easy to follow and useful when dealing with semi-
structured
information like web pages.
Star Schema
What is the Star Schema?
The Star Schema is basically the simplest form of a data warehouse. This schema is
made
up of fact tables and dimension table. We have covered dimension tables in previous
articles
but the concept of fact tables is fairly new.
The two tables are different from each other only in the way that they are used in
the
schema. They are actually made up of the same structure and the same SQL syntax is
used
to create them as well.
Interestingly enough in some schemas a fact table can also play the role of a
dimension
table in certain conditions and vice versa. Though they may be physically a like it
is vital
that we also understand the differences between fact table and dimension tables.
A fact table in a sales database, used with the star schema, could deal with the
revenue for
products of an organization from each customer in each market over a period of
time.
However, a dimension table in the same database would define the organizations
customers, the markets, the products, and the time periods that are found in the
fact
tables.
When a schema is designed right it will offer dimensions tables that enable the
user to leaf
through the database and get comfortable with the information that it contains.
This helps
the user when they need to write queries with constraints so that the information
that
gratifies those constraints is routed back into the database.
Star Schema Important Issues
As with any other schema performance is a big deal with the Star Schema. The
decision
support system is particularly important; users utilize this system to query large
quantities
of data. Star Schema�s happen to perform the most adequate decision support
applications.
Another issue that is important mention are the roles that fact and dimension
tables play in
a schema. When considering the material databases, the fact table is essentially a
referencing table, where as the dimension table plays the role of a referenced
table.
http://www.learn.geekinterview.com/images/dm14a.png
We can correctly come to the conclusion that a fact table has a foreign key to
reference
other tables and a dimension table is the foreign key reference from one or
multiple tables.
Tables that are references or are referenced by other tables have what is known as
a
primary key. A primary key is a column or columns with contents that specifically
identify
the rows. With simple star schemas, the fact table�s primary key can have multiple
foreign
keys.
The foreign key can be a column or a group of columns in a table which has values
that are
identified by the primary key of another table. When a database is developed the
statements used to make the tables should select the columns that are meant to form
the
primary keys as well as the foreign keys. Below is an example of a Star Schema.
. Bold italic column names indicate the primary key that is a foreign key to
another table
Let�s point out a few things about the Star Schema above:
. Items listed in the boxes above are columns in the tables with the same names as
the
box names.
. The foreign key columns are in italic text (you can see that the primary key from
the
green Dimension box is also a key in the orange box, the primary key from the
turquoise
box is also a foreign key in the orange box.)
. You can see that columns that are part of the primary key and the foreign keys
are
labeled in bold and italic text, like the key 1 in orange box.
. The foreign key relationships are identified by the lines that are used to
connect the
boxes that represent tables.
Even though a primary key value must be one of a kind in the rows of a dimension
table the
value can take place many times in a foreign key of a fact table, as in a many to
one
relationship. The many to one relationship can be present between the foreign keys
of the
fact table and the primary key they refer to in the dimension tables.
The star schema can hold many fact tables as well. Multiple fact tables are present
because
the have unrelated facts, like invoices and sales. With some situations multiple
fact tables
are present simply to support performance.
You can see multiple fact tables serving this purpose when they are used to support
levels
of summary data, more specifically when the amount is large, like with daily sales
data.
Referencing tables are also used to define many-to-many relationships between
dimensions.
This is usually referred to as an associative table or even a cross-reference
table. This can
be seen at work in the sales database as well. In a sales database each product has
one or
more groups that is belongs to, each of those groups also contain many products.
A rough example of this would be yet again with the sale database, as we said
before each
product is in one or more groups and each of those grouse have multiple products,
which is
a many-to-many relationship.
When designing a schema for a database we must keep in mind that the design affects
the
way in which it can be used as well as the performance.
Due to this fact it is vital that one makes the preliminary investment in time and
research
they dedicate to the design a database one that is beneficial to the needs of its
user. Let�s
wrap things up with a few suggestions about things to consider when designing a
schema:
. What is the function of the organization? Identify what the main processes are
for the
organization; it may be sales, product orders, or even product assembly, to name a
few.
This is a vital step; the processes must be identified in order to create a useful
database.
. What is meant to be accomplished? As all databases, a schema should reflect the
organization, in what it measures as well as what it tracks.
. Where is the data coming from? It is imperative to consider projected put in data
and its
sources will disclose whether the existing data can support the projected schema.
. Will there be dimensions that may change in time? If the organization contains
dimensions that change often then it is better to measure it as a fact, rather then
have it
stored as a dimension.
. What is the level of detail of the facts? Each row should contain the same kind
of data.
Differing data would be addressed with a multiple fact table design or even by
modifying
the single table so that there is a flag to identify the differences can be stored
with the
data. You want to consider the amount of data, the space, and the performance needs
. If there are changes how will they be addressed, and how significant is
historical
information?
XML Database
The XML database is most commonly described as a data persistence system that
enables
data to be accessed, exported, and imported. XML stands for Extensible Markup
Language.
The XML database is a Meta markup language that was developed by W3C to handle the
inadequacies of HTML. The HTML language began to evolve quickly as more
functionality
was added to it.
Soon there was a need to have a domain-specific markup language that was not full
of the
unnecessary data of HTML, thus XML was brought to life.
XML and HTML are indeed very different, the biggest way in which they differ is
where in
HTML semantics and syntax tags are unchanging, in XML the creator of the document
is
able to produce tags whose syntax and semantics are particular to the intended
application.
The semantics of tag in XML are reliant on the framework of the application that
processes
the document. Another difference between XML and HTML is an XML document has to in
good form.
XML�s beginning purpose may have been to mark up content, but it did not take long
for
users to realize that XML also gave them a way to describe structured data, that in
turn
made XML significant as a data storage and exchange format as well.
Here are a few of the advantages that the XML data format has:
. Built-in support for internationalization due to the reality that it utilizes
Unicode.
. The individual decipherable format makes it easier for developers to trace and
repair
errors than with preceding data storage formats.
. A great quantity of off-the-shelf apparatus for doling out XML documents are
already
present.
A Native XML Database or NXD defines a model for an XML document instead of the
data in
the document; it stores and retrieves documents in relation to the model.
At the very least the model will consist of attributes, elements, and document
order. The
NXD has a XML document as its fundamental area of storage.
The Database is also not obligated to have any specific underlying tangible storage
model. It
can be built on a hierarchical, relational, or even an object-oriented database,
all of which
we have explored in detail.
It can also use a proprietary storage format such as indexed or compressed files.
So we can
gather from this information that the database is unique in storing XML data and
stores all
agents of the XML model without breaking it down.
We have also learned that the NXD is not really an independent database all of the
time.
And are not meant to replace actually databases, they are a tool; this tool is used
to aid the
developer through providing a full-bodied storage and management of XML documents.
Not all databases are the same, yet there is enough features between them that is
similar to
give us a rough idea of some of the basic structure. Before we continue let us note
that the
database is still evolving and will continue to do so for some time.
One feature of the database is XML storage. It stores documents as a unit, and
creates
models that are closely related to XML or a related technology like DOM.
content and semi-structured data. Mapping is used to ensure that the XML unique
model of
data is managed.
After the data is stored the user will need to continue to use the NXD tools. It is
not as
useful as to try to access the data tables using SQL as one would think; this is
because the
data that would be viewed would be the model of an XML document not the entities
that the
data depict.
It important to note that the business entity model is within the XML document
domain, not
the storage system, in order to work with the actually data you will have to work
with it as
XML.
Another feature of the database worth mentioning is queries. Currently XPath is the
query
language of choice. To function as a database query language XPath is extended some
what
to allow queries across compilations of documents.
On a negative note XPath was not created to be a database query language so it does
not
function properly in that area.
In order to better the performance of queries NXDs support the development of index
on
the data stored in the collections.
The index can be used to improve the speed of the query execution. Fine points of
what can
be indexed and how the index is fashioned varies with products.
What kind of data types are supported by XML?
You might be surprised to hear that XML does not actually support any data types.
The XML
document is almost always text, even if by chance it does represent another data
type; this
would be something like a date or integer.
Usually the data exchange software converts the data from a text form, like in the
XML
document, to other forms within the database and vice versa.
Two methods are most common in determining which conversion to do. The fist of
these
methods is that the software determines the type of data that is from the database
schema,
this works out well because it is always available at run time.
The other method that is common is that the user clearly provides the data type,
like with
the mapping information.
This can be recorded by the user or even generated without human intervention from
a
database schema or even an XML schema.
When it is generated automatically, the data types can be taken from database
schemas as
well as from certain types of XML schemas.
The is another issue related to conversions as well, this has to do largely with
what text
formats are recognized when being exchanged from XML or what could be produced when
Concluding statements
XML may seem to be confusing, however, it is beneficial and even a bit less
complicated
then HTML. Yet when you are beginning to take the step to understanding XML when
you
have spent much time working with HTML, the process can be a bit distressing.
Never fear, once you have completed that step, XML is definitely a dominant format.
It is
also used in almost all of the models we have discussed, making it a vital area to
explore in
more detail.
It is innately obvious that the product names wouldn�t be hard-coded as the names
of the
columns in a table. Alternatively one department�s product descriptions in a
product table
may function as follows: purchases/sales of an individual item are recorded in
another table
that would have separate rows with a way to use the product ID for referencing.
Perhaps the most notable example of the EAV model is in the production databases we
see
with clinical work. This includes clinical past history, present clinical
complaints, physical
examinations, lab test, special investigations, and diagnoses. Basically all of the
aspects
that could apply to a patient. When we take into account all of the specialties of
medicine,
this information can consist of hundreds of thousands of units of data.
However, most people who visit a health care provides have few findings. Physicians
simply
do not have the time to ask a patient about every possible thing, this is just not
the way in
which patients are examined. Rather then using the process of elimination against
thousands of possibilities the health care provider focuses on the primary
complaints of the
patient, and then asks questions related to those complaints.
Now let�s consider how some one would attempt to represent a general-purpose
clinical
record in a database like those we discussed earlier.
By creating a table or even a set of tables with thousands of columns would not be
the best
choice of action, the vast majority of the columns would be unacceptable, also the
user
interface would be obsolete with out an extremely elaborate logic that could hide
groups of
columns based on the data that has been entered in the previous columns.
To complicate things further the patient record and medical findings continue to
grow. The
Entity-Attribute-Value data model is a natural solution for this perplexing issue,
and you
shouldn�t be surprised to find that larger clinical data repositories do use this
model.
Earlier we covered the facts that the EAV table consists of thee columns in which
data is
recorded. Those columns were the entity, the attribute, and the value. Now we will
talk a
little more in-depth about each column.
. The Entity, sticking to the scenario of clinical finding, the entity would be the
patient
event. This would contain at the very least a patient ID and the date and time of
the
examination.
.
Entity-Attribute-Value Database
This database is most commonly called the EAV database; this is a database where a
large
portion of data is modeled as EAV. Yet, you may still find some traditional
relational tables
within this type of database.
. We stated earlier what the EAV modeling does for certain categories of data such
as
clinical findings where attributes are many and few. However where these specific
functions do not apply we can use a traditional relational model instead. Using EAV
has
nothing to do with leaving the common sense and principles of a relational model
behind.
.
. The EAV database is basically un-maintainable without the support of many tables
that
store supportive metadata. These metadata tables usually outnumber the EAV tables
by
about three or more, they are normally traditional relational tables.
.
. The Entity in clinical data is usually a Clinical Event as we have discussed
above.
However for more general purposes the entity is a key into an Objects table that is
used
to note common information about all of the objects in the database. The use of an
Object table does not need EAV, traditional tables can be used to store the
category-
specific details of each object.
.
. The Value brings all values into lings, as in the EAV data example above as well,
this
results in a simple, yet still not scalable structure. Larger systems use separate
EAV
tables for each of their data types, including the binary larger objects, this
deals with the
metadata for a specific attribute in identifying the EAV table in which the data
will be
stored.
.
. The Attribute, in the EAV table this is no more then an Attribute ID, there are
normally
multiple metadata tables that contain the attribute related information.
There have been a number of issues with the Entity-Attribute-Value model brought to
light
throughout its lifetime. We will briefly discuss those now. It is important that we
clarify first
that these issues arise when metadata is not used with the EAV model, for metadata
is vital
for its functionality.
The technology of the relational databases will be inaccessible and will have to be
recreated
by a development team, this could include system tables, Graphical query tools,
fine
grained data security, incremental back-up and restore, and exception handling,
partitioned
tabled, and clustered indexes. All of which are currently non-existent.
The actual format is not supported well by the DBMS internals. Standard query
optimizers
for SQL do not handle the EAV formatted data that well, and a lot of time will need
to be
dedicated to performance tuning for an acceptable production quality application.
As you can see from above there are still a few issues that need to be addressed by
developers in order to make the EAV optimal. Regardless of those issues we have
also
learned that if we use metadata with the EAV we can avoid many if not all of these
issues.
There are three general graphical symbols used in entity relation diagram and these
symbols are: box, diamond and oval. The box is commonly used for representing the
entities in the database. The diamond is typically used for representing the
relationships and
finally, the oval is used for representing all the attributes.
In many other entity relation diagrams, the rectangle symbol is used to represent
entity
sets while the ellipse symbol is used to represent attributes. The line is
generally used for
linking attributes to entity sets and entity sets to relationship sets.
The entity relation diagram is used to represent the entire information system for
easy
management of resources. The diagram can make the people concerned easily identify
concepts or entities which exist in the whole information system as well as the
entire
business structure and the complex interrelationships between them.
It may represent very industry specific theoretical overview of the major entities
and
relationships needed for management of the industry resources whatever they may be.
It
may assist in the designing process of the database for an e-resource management
system
but may not necessarily identify every table which would be used.
In representing cardinality, the "Crow's Foot" notation uses three symbols: the
ring
represent zero; the dash represents one; and the crow's foot represents more or
many.
This diagram scheme may not be as famous and widely used as the symbols above but
it is
fast gaining notice especially now that it is used with Oracle texts and in some
visual
diagram and flowcharting tools such as Visio and PowerDesigner.
Those who prefer using the "Crow's Foot" notation say that this technique give
better clarity
in the identification of the many, or child, side of the relationship as compared
to other
techniques. This scheme also gives more concise notation for identifying mandatory
relationship with the use of perpendicular bar, or an optional relation, or an open
circle.
There are many tools for entity relation diagrams available in the market or the
internet
today. The proprietary tools include Oracle Designer, SILVERRUN ModelSphere,
SmartDraw,
CA ERwin Data Modeler, DB Visual ARCHITECT, Microsoft Visio, owerDesigner and
ER/Studio. For those who want free tools, their choices include MySQL Workbench,
Open
System Architect, DBDesigner and Ferret.
An entity structure chart draws the graphical structure for any data within the
enterprise
common data structure. This graphical drawing is intended to help data document
analysts,
database administrators, informational technology and all other staff of the
organization
visualized the data structures and the information system design. Entity relation
diagrams
makes use of Entity Structure Chart in order to provide a complete representation
of the
logical data structure.
Without the aid of an entity structure chart, data entities and all attributes and
relations
pertaining to them, would all be defined in bullet format with details in paragraph
format.
This can be straining to the eyes as well as difficult to analyze because one will
have to dig
through all those words and sentences.
With an entity structure chart, it becomes easy to have an analysis that shows the
position
of any selected entity in the structures that have been defined. An entity may
exist in many
structure charts. This chart would also give a great benefit in validating the
position or
absence of an entity within one or more specific structures.
For instance, there might be a need for identifying where a specific set of
departmental data
exists within the structure of the entire organization. With an entity structure,
it would be
very easy to spot and point to the location of the departmental data being looked
for.
The data entities which are represented in the entity structure chart may include
resource
data, roles of the entity, data from each organizational or departmental unit, data
location,
and any other business items which may apply.
An entity represents any real world object and the entity structure is technically
and
basically the formalism used in a structural knowledge representation scheme that
systematically organizes a family of possible structures of a system. Such an
entity
structure chart illustrates decomposition, coupling, and taxonomic relationships
among
entities.
The decomposition of an entity is the particular area which is concerned with how
the any
entity may be broken down further into sub-entities until its atomicity. Coupling
pertains to
the specifications detailing how sub-entities may be coupled together to
reconstitute the
entity.
An entity structure chart has a direct support for the entity relation diagram.
This diagram is
the visual and graphical representation of all of the interrelationships between
entities and
attributes in the database. And while the entity relation diagram uses graphical
symbols
such as box, diamond, oval and lines for representing the entities and
relationships in the
database, the entity structure chart may or may not use the same symbols in trying
to
visualize the structure of the entity for the database or the enterprise
information system.
It could be said that an entity structure chart is the break down of the components
of the
entity relations diagram. In the entity relations diagram only the entities and the
table
relations are being specified. But for the entity structure chart, the visual
illustration may
include a symbol for the entity is structurally represented.
For instance, let us say the entity is a CUSTOMER. The entity structure details
everything
about the customer including name, age, etc, and its relationships with products
and other
entities and at the same time the data type on how the CUSTOMER and its attributes
will be
physically stored in the database.
External Schema
The word schema as defined in the dictionary means plan, diagram, scheme or an
underlying organizational structure. Therefore, as can be very briefly said, an
external
schema is a plan on how to structure data so it can seamlessly integrates with any
information system that needs it.
It also means that data needs to integrate with the business schema of the
implementing
organization. External Schema is a schema that represents the structure of data
used by
applications.
Each of the external schemas describes the part of the information which is
appropriate to
the group of users at whom the schema is being addressed. Each external schema is
derived from the conceptual schema.
The external schema definitions are all based on a data dictionary. The universe of
discourse of the data dictionary is all information in the use and management of
the
database system.
The external schema definition system is the means by which the external schemas
are
being defined. The external schema must contain information which must be derivable
from
the conceptual schema.
In systems which are based on the object oriented paradigm, this may not
necessarily mean
that the classes which have been included in the schema have to have been
previously
defined in the conceptual schema. Any external schema may also include classes that
have
been defined in the conceptual schema like it may also contain derived classes.
A derived class could be any classes which have been directly or indirectly defined
on the
bases of the conceptual schema classes and have been defined and included in the
data
dictionary.
The typical external schema definition for the object oriented paradigm includes
three steps
which are: definition of the necessary derived classes; selection of the set of
classes that
will constitute the external schema; and generation of the output external schema.
A general external schema explicitly defines data structure and content in terms of
the data
model that tackles structural, integrity and manipulation of the data. As such, the
external
data schema include the data vocabulary which defines the element and attribute
names,
the content model which holds the definition of relationships and corresponding
structure,
and the data types.
Some of these data types are integer, string, decimal, boolean, double, floast,
hexBinary,
base64Binary, QName,dateTime, time, date, gYearMonth, gYear, gMonthDay, gDay,
gMonth, NOTATION and all others which may be applicable or appropriate for
representing
the entities in the information system.
As this defines the structure of data used in applications, this will make it easy
for the
information system to deal with all processes, optimizations and troubleshooting.
This
schema will also ensure that the data used in the information system adheres by the
rules
of the business and follows the framework of the data architecture.
In data modeling, the external schema may refer to the collection of data
structures that
are being used in creating databases representing all objects and entities which
are modeled
by the database.
The same external schema stores the definition of the collection of governing rules
and
constraints place on the data structure in order to maintain structure integrity by
preventing
orphaned records and disconnected data sets. In the same external schema is also
can be
found the collection of operators which are applicable to the data structures such
as update,
insert, query of the database.
Four-Schema Concept
Four-Schema Concept
a physical schema,
a logical schema,
a data view schema, and
a business schema.
The use of the four schema concept is greatly taken advantaged of the in the
implementation of a service oriented business processes integration. It helps to
resolve
problems with 3-schema concept.
Today, there is a very widely recognized trend in the marketing and business
environment
and this trend is the pushing towards a scenario wherein companies and business
enterprises are networked together in order to be able gain higher profit from the
collaborative efforts, and improve operational flexibility while reducing the
operational cost.
In order for integration to take place, the underlying architecture must have to be
resolved
first so that smooth and near seamless integration can happen.
One of the most important aspects of not just in enterprise data integration but in
computer
science in general is the data modeling.
This is the process of creating a data model with the use of the model theory (the
formal
description of data) so that a data model instance can be created. The physical
schema, a
logical schema, a data view schema, and a business schema.
A typical data model may have an instance of one of the three. The conceptual or
business
schema is used for describing an the semantics pertaining to the business
organizations. It
is in this schema that the entity classes which represent things of significance to
the
organization and the entity relationships which are the assertions about
associations
between pairs of entity classes are being defined.
In the logical schema, the descriptions of the semantics are being contained here.
The
semantics' descriptions may be represented by a particular data manipulation
technology
including the descriptions of tables and columns, object oriented classes, and XML
tags,
among other things.
The physical schema is where the descriptions of the physical means by which data
are
stored and other concerns pertaining to partitions, CPUs, tablespaces, etc.
The addition of the data view schema is what makes the four schema concept complete
and
distinct. The data view schema details how enterprises can offer the information
that they
wish to share with others as well as request what they want.
In real life business data management implementation, there really are many types
of data
model being developed using many different notations and methods.
Some of the data models are biased towards physical implementation while other are
biased
toward understanding business data, and a still a few are biased towards business
managers.
The four schema concept, as mentioned, is biased towards service oriented business
processes integration.
Some good practices in this area tend to rely on homogeneous semantics which can
pose
difficulty in achieving for independent databases owned by independent enterprises.
The new model provides a four-schema architecture which can already allow
management
for information sharing.
The logical data model elaborates the representation all data pertaining to the
organization
and this model organizes the enterprise data in management technology jargon.
In 1975 when the American National Standards Institute (ANSI) first introduced the
idea of
a logical schema for data modeling, there were only two choices that time which
were the
hierarchical and network models.
Today there are three choices for logical data model and these choices are
relational, object
oriented and Extensible Markup Language (XML). The relational option defined the
data
model in terms of tables and columns. The object oriented option defines data in
terms of
classes, attributes and associations. Finally, the XML option defined data in terms
of tags.
The logical data model is based closely on the conceptual data model which
describes all
business semantic in natural language without pointing any specific means of
technical
implementation such as the use of hardware, software or network technologies.
The process of logical data modeling could be a labor intensive technique depending
on the
size of the enterprise the data model will be used for. The resulting logical data
model
represents all the definition, characteristics, and relationships of data in a
business,
technical, or conceptual environment. In short, logical data modeling is about
describing
end-user data to systems and end-user staff.
The very core of the logical data model is the definition of the three types of
data objects
which the building blocks of the data model and these data objects are the
entities,
attributes, and relationships. Entities refer to persons, places, events or things
which are of
particular interest to the company.
Some examples of entities are Employees, States, Orders, and Time Sheets.
Attributes refer
to the properties of the entities. Examples of attributes for the Employee entity
are first
name, birthday, gender, address, age and many others. Lastly, relationships refer
to the
way where in the entities relate to each other. An example relationship would be
"customers purchase products" or "students enroll in classes".
The above mentioned example is a logical data model using the Entity-Relationship
(ER)
model which identifies entities, relationships, and attributes and normalize your
data.
A logical data model should be carefully designed because it will have tremendous
impact on
the actual physical implementation of the database and the larger data warehouse.
A logical data model influenced the design of data movement standards and
techniques
such as the heavily used extract, transform and load (ETL) process in data
warehousing and
the enterprise application integration (EAI), degree of normalization, use of
surrogate keys
and cardinality.
Likewise, it will determine the efficiency in data referencing and in managing all
the
business and technical metadata and the metadata repository. Several pre-packaged
third
party business solutions like enterprise resource planning (ERP) or HR systems have
their
own logical data model and when they are integrated into the overall existing
enterprise
model with a well designed logical data model, the implementation may turn out to
be
easier and less time consuming resulting in saving of money for the company.
With Enterprise Data Model being a data architectural framework, the business
enterprise
will have some sort of starting point for all data system designs. Its theoretical
blueprint can
provide for provisions, rules and guide in the planning, building and
implementation of data
systems.
In the area of enterprise information system, the Operational Data Store (ODS) or
Data
Warehouse (DW) are two of the largest components which need carefully designed
enterprise data model because data integration is the fundamental principle
underlying any
such effort and a good model can facilitate data integration, diminishing the data
silos,
inherent in legacy systems.
As the name implies, the very core of an Enterprise Data Model is about the data,
regardless of where the data is coming from and how it will be finally used. The
model is
meant primarily to give clear definitions on how come up with efficient initiatives
in the
aspects of Data Quality, Data Ownership, Data System Extensibility, Industry Data
Integration, Integration of Packaged Applications and Strategic Systems Planning.
The process of making an enterprise model typically utilizes a top down bottom up
approach
for all designs of the data systems including the operational data store, data
marts, data
warehouse and applications. The enterprise data model is built in three levels of
decomposition and forms a pyramid shape.
The first to be created is Subject Area Model which sits on top of the pyramid. It
expands
down to create the Enterprise Conceptual Model and finally the Enterprise
Conceptual Entity
Model is created and occupies the base part of the pyramid. The three models are
interrelated but each of them has its own unique purpose and identity.
The Enterprise Conceptual Model, the second level in the pyramid, identifies and
defines the
major business concepts of each of the subject areas. This model is high level data
model
having an average of several concepts for every subject area. These concepts have
finer
details compared to the subject area details. This model also defines the
relationships of
each concept.
The Enterprise Conceptual Entity Model represents all things which are important to
each
business area from the perspective of the entire enterprise. This is the detailed
level of an
enterprise data model which each of the concept being expanded within each subject
area.
It is also in this level that the business and its data rules are examined, rather
than existing
systems so as to create the major data entities, the corresponding business keys,
relationships and attributes.
Star Schema
The star schema, which is sometimes called a star join schema, is one of the most
simple
styles of a data warehouse schema. It consists of a few fact tables that reference
any
number of dimension tables. The facts tables hold the main data with the typically
smaller
dimension tables describing each individual value of a dimension.
. simplicity
Normalization is not a goal of star schema design. Star schemas are usually divided
into fact
tables and dimensional tables, where the dimensional tables supply supporting
information.
A fact table contains a compound primary key that consists of aggregate of relevant
Making a star schema for a database may relatively easy but it still very important
to make
some investments in time and research because the schema's effect on the usability
and
performance of the database is very important in the long run.
In a data warehouse implementation, the creation of the star schema database is one
of the
most important and often the final process in implementing the data warehouse. A
star
schema has also a significant importance in some business intelligence processes
such as
on-line transaction processing (OLTP) system and the on-line analytical processing
(OLAP).
Several call center agents continuously take calls and enter order typically
involving
numerous items which must be stored immediately in the database. This makes the
scenario very critical and that the speed of inserts, updates and deletes should be
maximized. In order to optimize the performance, the database should hold as few
records
as possible at any given time.
On the other hand, On-line Analytical Processing, though this may mean many
different
things to different people, are many for analyzing corporate data. But in some
cases, the
terms OLAP and star schema are used interchangeably. But a more precise way of
thinking
would be to think of a star schema database is an OLAP system which can be any
system of
read-only, historical, aggregated data
The same OLAP and OLTP can be optimized with a star schema in a data warehouse
implementation. Since a data warehouse is the main repository of a company's
historical
data, it naturally contains very high volumes of data which can be used for
analysis with
OLAP. Querying these data may take a long time but with the help of star schema in
implementation, the access time may be made faster and more efficient.
Because of the some of the nuances associated with different database systems and
development platforms, reverse data modeling is generally a difficult task to
accomplish.
But today, there are software vendors trying to offer solutions to make reverse
data
modeling relatively easier to do. These reverse data modeling software solutions
can take
snapshots of existing databases and in producing physical models from which an IT
staff can
begin to verify table and column names.
While it can be generally relatively easy to document table and column names
through using
such tools, or by reviewing database creation scripts, thing that is really very
complicated
and difficult to do is dealing with the relationships between tables in the
database.
Hard skills like coding and software engineering are very important in reverse data
modeling, some soft skills like documentation, employee know-how, training, can
also be
invaluable in figuring out exactly how data is being used by the system in
question.
One of the biggest factors for a successful task of reverse data modeling is
learning as much
as possible about using the application can you determine how the existing data
structures
are being used and how this compares with the original design.
Reverse data modeling is a technology that represents a big opportunity for some
legacy
reporting systems to have its useful life gain extension but it should not be
construed as a
permanent replacement for the data warehousing technology stack.
Reverse data modeling has also a very big role in helping improve the efficiency of
Some commercial software offers solutions that can enable true enterprise
application
engineering by storing and documenting data, processes, business requirements, and
objects that can be shared by application developers throughout an organization.
With reverse data modeling and the help of such software, the business organization
implementing an enterprise data management system can easily design and control
enterprise software for quality, consistency, and reusability in business
applications through
the managed sharing of meta-data.
Like all other reverse engineering technologies regardless of whether they are for
software
or hardware or non-IT implementation, reverse data modeling is very useful in time
when
there is a deep seated need for system troubleshooting with very complicated
problems.
Reverse data modeling can give IT staff with a deeper insight and understanding of
the data
models thereby empowering them to come up with actions to improve the management of
the system.
There are three basic styles of data models: conceptual data model, logical data
model and
physical data model. The conceptual data model is sometimes called the domain model
and
it is typically used for exploring domain concepts in an enterprise with
stakeholders of the
project.
The logical model is used for exploring the domain concepts as well as their
relationships.
This model depicts the logical entity types, typically referred to simply as entity
types, the
data attributes describing those entities, and the relationships between the
entities.
The physical data model is used in the design of the database's internal schema and
as
such, it depicts the data columns of those tables, and the relationships between
the tables.
This model represents the data design taking into account the facilities and
constraints of
any given database management system. The physical data model is often derived from
the
logical data model although some can reverse engineer this from any database
implementation.
A detailed physical data model contains all artifacts a database requires in
creating
relationships between tables or achieving performance goals, such as indexes,
constraint
definitions, linking tables, partitioned tables or clusters. This model is also
often used
calculating estimates for data storage and sometimes is sometimes includes details
on
storage allocation for a database implementation.
The physical data model is basically the output of physical data modeling which is
conceptually similar to design class modeling whose main goal is to design the
internal
schema of a database, depicting the data tables, the data columns of those tables,
and the
relationships between the tables.
In a physical data model, the tables are first identified where data will be stored
in the
database.
For instance, in a university database, the database may contain the Student table
to store
data about students. Then there may also be the Course table, Professors table, and
other
related table to contain related information. The tables will then be normalized.
Data normalization is the process wherein the data attributes in a data model are
being
organized to reduce data redundancy, increase data integrity and increase the
cohesion of
tables and to reduce the coupling between tables.
After the tables are being normalized, the columns will be identified. A column is
the
database equivalent of an attribute. Each table will have one or more columns. In
our
example, the university database may have columns in the Student table such as
FirstName, LastName and StudentNumber columns.
The stored procedures are then being identified. Conceptually, a stored procedure
is like a
global method for implementing a database. An example of a stored procedure would
be a
code to compute student average mark, student payables or number of students
enrolled
and allowable in a certain course. Relationships are also identified in a physical
data model.
Keys are also assigned in the tables. A key is one or more data attributes which
identifies a
table row to make it unique and thus eliminate data redundancy and increase data
integrity.
Database Concepts
In the 1960s, the System Development Corporation, one of the world�s first computer
software companies and a significant military technology contractor, first used the
term
�data base� to describe a system to manage United States Air Force personnel. The
term
�databank� had also been used in the 1960s to describe similar systems, but the
public
seemed less accepting of that term and eventually adopted the word �database�,
which is
universally used today. A number of corporations, notably with IBM and Rockwell at
the
forefront, developed database software throughout the 1960s and early 1970s. MUMPS
(also known as M), developed by a team at Massachusetts General Hospital in the
late
1960s, was the first programming language developed specifically to make use of
database
technology. In 1970, the relational database model was born. Although this model
was
more theoretical than practical at the time, it took hold in the database community
as soon
as the necessary processing power was available to implement such systems.
Perhaps you wish you had kept a simple, personal record of these movies and books,
so you
could have a quick look at it and easily identify the title that has been bothering
you all this
time? A database would be perfect for that!
The purest idea of a database is nothing more than a record of things that have
happened.
Granted, most professionals use databases for much more than storing their favorite
movies
and books, but at the most basic level, those professionals are simply recording
events, too.
Every time you join a web site, your new account information ends up in a database.
Have
you ever rented movies from Netflix? Their entire web site is essentially one big
database, a
record of every movie available to rent along with the locations of their copies of
that movie
among their many warehouses and customers� homes. The list goes on! Databases are
everywhere, and they assist us in performing many essential tasks throughout our
daily
lives.
You can easily see how databases affect all aspects of our modern lives, since
everything
you do, from calls you make on your mobile phone to transactions you make at the
bank to
the times you drive through a toll plaza and used your toll tag to pay, is recorded
in a
database somewhere.
If these databases did not exist, our lives would surely be much less convenient.
In the 21st
century, we are so accustomed to using credit cards and printing airline boarding
passes at
home that, if databases were to suddenly disappear, it would almost seem like we
were
cavemen again.
Fortunately, we have databases, and there are many people who are skilled at using
them
and developing software to use alongside them. These people take great pride in
their work,
as database programming is difficult but nonetheless very rewarding.
Consider, for example, the database team at Amazon.com. They built an enormous
database to contain information about books, book reviews, products that are not
books at
all, customers, customers� preferences, and tons of other things.
http://learn.geekinterview.com/images/ot1.jpg
It must have taken them months to get the database just right and ready for
customers to
use! But, once it started to work well and Amazon.com went live back in 1995, can
you
imagine the sense of pride those developers had as millions of potential customers
poured
onto the web site and began to interact with the product they spent so much time
perfecting? That must have been an incredible feeling for the database team!
What is a Database?
When you are in a big electronics store buying the latest edition of the iPod, how
does that
store�s inventory tracking system know you just bought an iPod and not, for
example, a car
stereo or a television
Let�s walk through the process of buying an iPod and consider all the implications
this has
on the inventory database that sits far underneath all the shiny, new gadgets on
the sales
floor.
When you hand the iPod box to the cashier, a barcode scanner reads the label on the
box,
which has a product identification number. In barcode language, this number might
be
something like 885909054336. The barcode representing this number can be seen in
Figure
1.
Figure 1. A sample barcode
The barcode acts as a unique identifier for the product; in this case, all iPods
that are the
same model as the one passing across the barcode reader have the same exact
barcode.
The barcode scanner relays the number represented by the barcode to the register at
the
cashier�s station, which sends a request (or a query) to the store�s inventory
database. This
database could be in the same store as the register or somewhere across the country
or
even around the world, thanks to the speed and reliability of the Internet.
The register asks the database, �What are the name and price of the product that
has this
barcode?� To which the database responds, �That product is an iPod, and it costs
$200.�
You, the customer, pay your $200 and head home with a new toy. Your work in the
store is
finished, but the inventory management system still needs to reconcile your
purchase with
the database!
When the sale is complete, the register needs to tell the database that the iPod
was sold.
The ensuing conversation goes something like the following.
Register: �How many products with this barcode are in our inventory?�
Database: �1,472.�
Register: �Now, 1,471 products with this barcode are in our inventory.�
Database: �OK.�
Data Retrieval
Of course, this is not the whole story. Much more happens behind the scenes than
simple
conversational requests and acknowledgements.
The first interaction the register had with the database occurred when the request
for the
product name and price was processed. Let�s take a look at how that request was
really
handled.
If the database is an SQL database, like MySQL or PostgreSQL or many others, then
the
request would be transmitted in the standard Structured Query Language (SQL). The
software running on the register would send a query to the database that looks
similar to
the following.
Every database may contain multiple tables, and every table may contain multiple
rows, so
specifying the name of the table and the row�s unique identifier is very important
to this
query. To illustrate this, an example of a small products table is shown in Figure
2.
When the database has successfully found the table and the row with the specified
id, it
looks for the values in the name and price columns in that row. In our example,
those
values would be �iPod� and �200.00�, as seen in Figure 2. The execution of the
previous
SELECT statement, which extracts those values from the table, is shown in Figure 3.
The database then sends a message back to the register containing the product�s
name and
price, which the register interprets and displays on the screen for the cashier to
see.
Data Modification
The second time the register interacts with the database, when the inventory number
is
updated, requires a little more work than simply asking the database for a couple
numbers.
Now, in addition to requesting the inventory number with a SELECT statement, an
UPDATE
statement is used to change the value of the number.
First, the register asks the database how many iPods are in the inventory (or �on
hand�).
http://learn.geekinterview.com/images/ot4.jpg
http://learn.geekinterview.com/images/ot5a.jpg
The database returns the number of products on hand, the register decrements that
number
by one to represent the iPod that was just sold, and then the register updates the
database
with the new inventory number.
In Figure 4, the database responds to the UPDATE query with UPDATE 1, which simply
means one record was updated successfully.
Now that the number of iPods on hand has been changed, how does one verify the new
number? With another SELECT query, of course! This is shown in Figure 5.
Now, the register has updated the database to reflect the iPod you just purchased
and
verified the new number of iPods on hand. That was pretty simple, wasn�t it?
More on Databases
You now know databases are made of tables, which are, in turn, made of records.
Each
record has values for specific columns, and in many cases, a record can be uniquely
In our example, the barcode number uniquely identified the iPod, which cost $200,
in the
products table. You have also seen that values in a database can be modified. In
this case,
the number of iPods on hand was changed from 1,472 to 1,471.
Database Systems
Early Databases
In the 1960s, the System Development Corporation, one of the world�s first computer
software companies and a significant military technology contractor, first used the
term
�data base� to describe a system to manage United States Air Force personnel. The
term
�databank� had also been used in the 1960s to describe similar systems, but the
public
seemed less accepting of that term and eventually adopted the word �database�,
which is
universally used today.
A number of corporations, notably with IBM and Rockwell at the forefront, developed
database software throughout the 1960s and early 1970s. MUMPS (also known as M),
developed by a team at Massachusetts General Hospital in the late 1960s, was the
first
programming language developed specifically to make use of database technology.
In 1970, the relational database model was born. Although this model was more
theoretical
than practical at the time, it took hold in the database community as soon as the
necessary
processing power was available to implement such systems.
The advent of the relational model paved the way for Ingres and System R, which
were
developed at the University of California at Berkeley and IBM, respectively, in
1976. These
two database systems and the fundamental ideas upon which they were built evolved
into
the databases we use today. Oracle and DB2, two other very popular database
platforms,
followed in the footsteps of Ingres and System R in the early 1980s.
Modern Databases
The Ingres system developed at Berkeley spawned some of the professional database
systems we see today, such as Sybase, Microsoft SQL Server, and PostgreSQL.
Now, PostgreSQL is arguably the most advanced and fastest free database system
available,
and it is widely used for generic and specific database applications alike. MySQL
is another
free database system used in roughly the same scope of applications as PostgreSQL.
While
MySQL is owned and developed by a single company, MySQL AB in Sweden, PostgreSQL
has
no central development scheme, and its development relies on the contributions of
software
developers around the world.
IBM�s System R database was the first to use the Structured Query Language (SQL),
which
is also widely used today. System R, itself, however, was all but abandoned by IBM
in favor
of focusing on more powerful database systems like DB2 and, eventually, Informix.
These
products are now generally used in large-scale database applications. For example,
the Wal-
Mart chain of large department stores has been a customer of both DB2 and Informix
for
many years.
Modern Database
The other major player in the database game, Oracle, has been available under a
proprietary license since it was released as Oracle V2 in 1979. It has undergone a
number
of major revisions since then and, in 2007, was released as Oracle 11g. Like DB2
and
Informix, Oracle is mostly used for very large databases, such as those of global
chain
stores, technology companies, governments, and so forth. Because of the similar
client
bases enjoyed by IBM and Oracle, the companies tend to be mutually cooperative in
database and middleware application development.
In order of market share in terms of net revenue in 2006, the leaders in database
platform
providers are Oracle, with the greatest market share; IBM; and Microsoft.
While the database systems with the greatest markets shares use SQL as their query
language, other languages are used to interact with a handful of other relatively
popular
databases. Most developers will never encounter these languages in their daily
work, but for
purposes of being complete, some of these languages are IBM Business System 12,
EJB-QL,
Quel, Object Query Language, LINQ, SQLf, FSQL, and Datalog. Of particular note is
IBM
Business System 12, which preceded SQL but was, for some time, used with System R
instead of SQL due to SQL being relationally incomplete at the time.
Today, organizations with large database projects tend to choose Oracle, DB2,
Informix,
Sybase, or Microsoft SQL Server for their database platforms because of the
comprehensive
support contracts offered in conjunction with those products. Smaller organizations
or
organizations with technology-heavy staff might choose PostgreSQL or MySQL because
they
are free and offer good, community-based support.
Terminology
Further, most modern database systems employ the idea of the relational database,
and
they are properly called Relational Database Management Systems (RDBMS). The
distinction
between a DBMS and a RDBMS, unless critical to the understanding of a specific
topic, is not
made in these articles.
Database Interaction
Database Interaction
Efficient interaction, efficient storage, and efficient processing are the three
key properties
of a successful database platform. In this article, we explore the first: efficient
interaction.
Many database platforms are shipped with a simple command line utility that allows
the
user to interact with the database. PostgreSQL ships with psql, which gives the
user
extensive control over the operation of the database and over the tables and schema
in the
database. Oracle�s SQLPlus and MySQL�s MySQL are similar utilities. Collectively,
these are
also called SQL shells.
The final method for interacting with a database is through an application. This
indirect
interaction might occur, for example, when a bank customer is withdrawing money
from an
ATM. The customer only presses a few buttons and walks away with cash, but the
software
running on the ATM is communicating with the bank�s database to execute the
customer�s
transaction. Applications that need to interact with databases can be written in
nearly all
programming languages, and almost all database platforms support this form of
interaction.
A command line client usually provides the most robust functionality for
interacting with a
database. And, because they are usually developed by the same people who developed
the
database platform, command line clients are typically also the most reliable. On
the other
hand, effectively using a command line client to its full extent requires expert
database skill.
The �help� features of command line clients are often not comprehensive, so
figuring out
how to perform a complex operation may require extensive study and reference on the
part
of the user. Some basic usage of the PostgreSQL command line client is shown in
Figure 1.
http://learn.geekinterview.com/images/ot9.jpg
All command line clients operate in a similar manner to that shown in Figure 1. For
users
with extensive knowledge of SQL, these clients are used frequently.
One typically accesses an SQL command line client by logging into the database
server and
running them from the shell prompt of a UNIX-like operating system. Logging into
the
server may be achieved via telnet or, preferably, SSH. In a large company, the
Information
Technology department may have a preferred application for these purposes.
GUI Clients
The simplest way to think about a GUI client is to consider it to be a
sophisticated, flashy
wrapper around a command line client. Really, it falls into the third category of
interaction,
application development, but since the only purpose of this application is to
interface with
the database, we can refer to it separately as a GUI client.
The GUI client gives the user an easy-to-use, point-and-click interface to the
internals of the
database. The user may browse databases, schemas, tables, keys, sequences, and,
essentially, everything else the user could possibly want to know about a database.
In most
cases, the GUI client also has a direct interface to a simulated command line, so
the user
can enter raw SQL code, in addition to browsing through the database. Figure 2
shows the
object browser in pgAdmin III, a free, cross-platform GUI client for PostgreSQL .
http://learn.geekinterview.com/images/ot10.jpg
http://learn.geekinterview.com/images/ot11.jpg
Figure 2. The object browser in pgAdmin III
With an easy tree format to identify every element of the database and access to
even more
information with a few simple clicks, the GUI client is an excellent choice for
database
interaction for many users.
Figure 3 shows the Server Information page of MySQL Administrator, the standard GUI
tool
for MySQL databases.
For example, the software running on an ATM at a bank needs to access the bank�s
central
database to retrieve information about a customer�s account and then update that
information while the transaction is being performed.
Many database access extensions for modern programming languages exist, and they
all
have their advantages and caveats. The expert database programmer will learn these
caveats, however, and eventually become comfortable and quite skilled at
manipulating
database objects within application code.
Figure 4 and Figure 5 show the code for a simple database application written in
Perl and its
output, respectively.
With all the features of modern programming languages, extremely complex database
applications can be written. This example merely glosses over the connection,
query, and
disconnection parts of a database application.
Database Overview
You have been using databases for a few years, and you think you are at the top of
your
game. Or, perhaps, you have been interested in databases for a while, and you think
you
did like to pursue a career using them, but you do not know where to start. What is
the next
step in terms of finding more rewarding education and employment?
There are two routes people normally take in order to make them more marketable
and, at
the same time, advance their database skills. The first, earning an IT or computer
science
degree, requires more effort and time than the second, which is completing a
certification
program.
If you do not have a science, engineering, or IT degree yet and you want to keep
working
with databases and IT for at least a few years, the degree would probably be worth
the
time. For that matter, if you already have an undergraduate degree, then perhaps a
master�s degree would be the right choice for you? Master�s degrees typically only
require
three semesters of study, and they can really brighten up a resume. An MBA is a
good
option, too, if the idea of managing people instead of doing technical work suits
your fancy.
Your employees would probably let you touch the databases once in a while, too!
Many universities offer evening classes for students who work during the day, and
the
content of those classes is often focused on professional topics, rather than
abstract or
theoretical ideas that one would not regularly use while working in the IT field.
Online
universities like the University of Phoenix also offer IT degrees, and many busy
professionals have been earning their degrees that way for years now.
Certifications, while quite popular and useful in the late 1990s and early 2000s,
seem to be
waning in their ability to make one marketable. That said, getting a certification
is much
quicker than earning a degree and requires a minimal amount of study if the
certificate
candidate already works in the relevant field.
The certification will also highlight a job applicant�s desire to �get ahead� and
�stay ahead�
in the field. It may not bump the applicant up the corporate food chain like an MBA
might,
but it could easily increase the dollar amount on the paychecks by five to ten
percent or
more.
If you feel like you could be making more money based on your knowledge of
databases,
exploring the degree and certification avenues of continuing education may be one
of the
best things you do for your career.
Relational Databases
Popular, modern databases are built on top of an idea called �relational algebra�,
which
defines how �relations� (e.g. tables and sequences in databases) interact within
the entire
�set� of relations. This set of relations includes all the relations in a single
database.
Knowing how to use relational algebra is not particularly important when using
databases;
however, one must understand the implications certain parts of relational algebra
have on
database design.
Relational algebra is part of the study of logic and may be simply defined as �a
set of
relations closed under operators�. This means that if an operation is performed on
one or
more members of a set, another member of that same set is produced as a result.
Mathematicians and logicians refer to this concept as �closure�.
Integers
Consider the set of integers, for example. The numbers 2 and 6 are integers. If you
add 2 to
6, the result is 8, which is also an integer. Because this works for all integers,
it can be said
that the set of integers is closed under addition. Indeed, the set of integers is
closed under
addition, subtraction, and multiplication. It is not closed under division,
however. This can
be easily seen by the division of 1 by 2, which yields one half, a rational number
that is not
an integer.
Database Relations
Using the integer example as a starting point, we can abstract the idea of closure
to
relations. In a relational database, a set of relations exists. For the purposes of
initially
understanding relational databases, it is probably best to simply think of a
relation as being
a table, even though anything in a database that stores data is, in fact, a
relation.
Performing an operation on one or more of these relations must always yield another
relation. If one uses the JOIN operator on two tables, for example, a third table
is always
produced. This resulting table is another relation in the database, so we can see
relations
are closed under the JOIN operator.
Relations are closed under all SQL operators, and this is precisely why databases
of this
nature can be called relational databases
Concurrency and reliability have long been �hot topics� of discussion among
developers and
users of distributed systems. The fundamental problem can be seen in a simple
example, as
follows.
Suppose two users are working on the same part of a database at the same time. They
both
UPDATE the same row in the same table, but they provide different values in the
UPDATE.
The UPDATE commands are sent to the database precisely at the same time. What does
the
database system do about this, and what are the rules governing its decision?
ACID
When discussing concurrency and reliability, developers often talk about the
components of
ACID: atomicity, consistency, isolation, and durability. Together, these properties
guarantee
that a database transaction is processed in a reliable, predictable manner. A
transaction, in
this case, can be defined as any set of operations that changes the state of the
database. It
could be something as simple as reading a value, deciding how to manipulate that
value
based on what was read, and then updating the value.
Atomicity
The second rule is somewhat of an extension of the first rule. It says that, if any
operations
involved in a transaction fail, the entire transaction fails, and the database is
restored to the
http://learn.geekinterview.com/images/ot38.jpg
state before the transaction began. This prevents a transaction from being
partially
completed.
Database Consistency
Consistency is probably the most fundamental of the four ACID components. As such,
it is
arguably the most important in many cases. In its most basic form, consistency
tells us that
no part of a transaction is allowed to break the rules of a database.
In this example, if consistency were not upheld, the NULL value would initially
still not be
added as part of the row, but the remaining parts of the row would be added.
However,
since no value would be specified for the NOT NULL column, it would revert to NULL,
anyway, and violate the rules of the database. The subtleties of consistency go far
beyond
an obvious conflict between NOT NULL columns and NULL value, but this example is a
clear
illustration of a simple violation of consistency. In Figure 1, we can see that no
part of a row
is added when we try to violate the NOT NULL constraint.
Isolation
Durability
Distributed Databases
Suppose you created a database for a web application a few years ago. It started
with a
handful of users but steadily grew, and now its growth is far outpacing the
server�s
relatively limited resources. You could upgrade the server, but that would only
stem the
effects of the growth for a year or two. Also, now that you have thousands of
users, you are
worried not only about scalability but also about reliability.
If that one database server fails, a few sleepless nights and many hours of
downtime might
be required to get a brand new server configured to host the database again. No
one-time
solution is going to scale infinitely or be perfectly reliable, but there are a
number of ways
to distribute a database across multiple servers that will increase the scalability
and
reliability of the entire system.
Put simply, we want to have multiple servers hosting the same database. This will
prevent a
single failure from taking the entire database down, and it will spread the
database across a
large resource pool.
A distributed database is divided into sections called nodes. Each node typically
runs on a
different computer, or at least a different processor, but this is not true in all
cases.
Horizontal Fragments
One of the usual reasons for distributing a database across multiple nodes is to
more
optimally manage the size of the database.
Each server might be responsible for customers with certain ZIP codes. Since ZIP
codes are
generally arranged from lowest to highest as they progress westward across the
country,
the actual limits on the ZIP codes might be 00000 through 33332, 33333 through
66665,
and 66666 through 99999, respectively.
In this case, each node would be responsible for approximately one third of the
data for
which a single, non-distributed node would be. If each of these three nodes
approached its
own storage limit, another node or two nodes might be added to the database, and
the ZIP
codes for which they are responsible could be altered appropriately. More
�intelligent�
configurations could be imagined, as well, wherein, for example, population density
is
considered, and larger metropolitan areas like New York City would be grouped with
fewer
other cities and towns.
requests for certain ZIP codes. Regardless of which approach is taken, the
distribution of
the database must remain transparent to the user of the application. That is, the
user
should not realize that separate databases might handle different transactions.
Reducing node storage size in this manner is an example of using horizontal
fragments to
distribute a database. This means that each node contains a subset of the larger
database�s
rows.
Vertical Fragments
A situation that calls for vertical fragments might arise if a table contains
information that is
pertinent, separately, to multiple applications. Using the previous example of a
database
that stores customer information, we might imagine an airline�s frequent flyer
program.
These sets of data have different applications: the customer information is used
when
mailing tickets and other correspondence, and the mileage information is used when
deciding how many complimentary flights a customer may purchase or whether the
customer has flown enough miles to obtain �elite� status in the program. Since the
two sets
of data are generally not accessed at the same time, they can easily be separated
and
stored on different nodes.
Since airlines typically have a large number of customers, this distribution could
be made
even more efficient by incorporating both horizontal fragmentation and vertical
fragmentation. This combined fragmentation is often called the mixed fragment
approach.
A database can be broken up into many smaller pieces, and a large number of methods
for
doing this have been developed. A simple web search for something like �distributed
databases� would probably prove fruitful for further exploration into other, more
complex,
methods of implementing a distributed database. However, there are two more terms
with
which the reader should be familiar with respect to database fragmentation.
The first is homogeneous distribution, which simply means that each node in a
distributed database is running the same software with the same extensions and so
forth. In
this case, the only logical differences among the nodes are the sets of data stored
at each
one. This is normally the condition under which distributed databases run.
However, one could imagine a case in which multiple database systems might be
appropriate for managing different subsets of a database. This is called
heterogeneous
distribution and allows the incorporation of different database software programs
into one
big database. Systems like this are useful when multiple databases provide
different feature
sets, each of which could be used to improve the performance, reliability, and/or
scalability
of the database system.
Replication
Replication has time expense because each operation performed on one node�s
database
must be performed on each other node�s database simultaneously. Before the
operation can
be said to be committed, each other node must verify that the operation in its own
database
succeeded. This can take a lot of time and produce considerable lag in the
interface to the
database.
And, replication has data expense because every time the database is replicated,
another
hard drive or two or more fills up with data pertaining to the database. Then,
every time
one node gets a request to update that data, it must transmit just as many requests
as
there are other nodes. And, confirmations of those updates must be sent back to the
node
that requested the update. That means a lot of data is flying around among the
database
nodes, which, in turn, means ample bandwidth must be available to handle it.
Many of the more popular databases support some sort of native replication. MySQL,
for
example, provides the GRANT REPLICATION command, which initiates replication
automatically. PostgreSQL, on the other hand, requires external software for
replication.
This usually happens in the form of Slony-1, a comprehensive replication suite.
Each
database platform has a different method for initiating replication services, so it
is best to
consult that platform�s manual before implementing a replication solution.
Considerations
When implementing a distributed database, one must take care to properly weigh the
advantages and disadvantages of the distribution. Distributing a database is a
complicated
and sometimes expensive task, and it may not be the best solution for every
project. On the
other hand, with some spare equipment and a passionate database developer,
distributing a
database could be a relatively simple and straightforward task.
The most important thing to consider is how extensively your database system
supports
distribution. PostgreSQL, MySQL, and Oracle, for example, have a number of native
and
external methods for distributing their databases, but not all database systems are
so
robust or so focused on providing a distributed service. Research must be performed
to
determine whether the database system supports the sort of distribution required.
Business data refers to the information about people, places, things, business
rules, and
events in relation to operating a business.
Knowing things about people and their buying behaviors can make a company generate
very important business data. For instance, statisticians and market researches
know that
certain age groups have unique buying habits. Races and people from different
demographics locations also have buying patterns of their own so gathering these
information in one business database can be a good way to future target marketing.
In terms of production, business data about where to get raw materials, how much
the cost
is, what are the customs and importation policies of the raw materials' country of
origin and
other information are also very important.
There are many software applications that manage business data for easy statistical
In many companies, they maintain a business data warehouse where data from several
are
collected and integrate every few minutes. These repositories of business data may
supply
needed information to generate reports and recommendations in an intelligent
manner. Hence the term Business Intelligence is already widely used in the business
industry today.
For processing billions of business data in the data warehouse for business
intelligence,
companies use high powered and secure computer systems that are installed with
different
levels of security access.
Several software applications and tools have been developed to gather and analyze
large
amounts of unstructured business data ranging from sales statistics, production
metrics to
employee attendance and customer relations. Business intelligence software
applications
very depending on the vendor but the common attribute in most of them is that they
can be
customized based on the needs and requirements of the business company. Many
companies have in-house developers to take care of business data as the company
continues to evolve.
Some example of business intelligence tools to process business data include Score
carding,
Business activity monitoring, Business Performance Management and Performance
Measurement, Enterprise Management systems and Supply Chain Management/Demand
Chain Management. Free Business Intelligence and open source products include
Pentaho,
RapidMiner, SpagoBI and Palo, an OLAP database.
Business data is the core of the science of Analytics. Analytics is the study of
business data
that uses statistical analysis in knowing and understanding patterns and trends in
order to
foresee or predict business performance. Analytics is commonly associated with data
business activities, acquiring or capturing those data, and maintaining them in the
data
resource.
Everyday, billions of data and information gets carried across different
communications
media. The number medium is of course the internet. Another data communications
media
are television and mobile phones
Any non-business individual or entity may not find the real significance of these
data. They
are merely there to because it is innate in people to communicate and get connected
with
each other.
But individuals or organizations who think about business take advantage of these
data in a
business-driven approach. They try to collect, aggregate, summarize and
statistically
analyze data so they know what products they may want to see and who among the
people
will be their target market.
Many people who started from scratch but were driven by passion for business have
created
multinational corporations. Traditional vendors started from the sidewalk and took
a
business driven approach to move their wares from the streets to the internets
where
millions of potential buyers converge.
Today's businesses are very closely dependent on the use of technology. More and
more
transactions are going online. With the possibility of high security money
transfers, people
can buy items from continents away using just their credit cards. They can also
maintain
online accounts and make fund transfers in seconds at one click of the mouse.
Software application developers and vendors are coming up with innovative tools to
help
business optimize their performance. Large data warehouses, those repositories of
all sorts
data getting bulkier every single minute, are being set up and companies are
investing in
labor intensive high power capable computers connected to local area networks and
the
internet.
To manage these data warehouses, database administrators use relational database
technology with the aid of tools like Enterprise Resource Planning (ERP), Customer
Resource
Management (CRM), Advance Planner and Optimizer (APO), Supply Chain Management
(SCM), Business Information Warehouse (BIW), Supplier Relationship Management
(SRM),
Human Resource Management System (HRMS) and Product Lifecycle Management (PLM)
among thousands of others.
To keep businesses stable and credible especially when dealing with a global market
using
the internet, it needs to ensure security of critical data and information. Keeping
these data
safe is a responsibility which requires coordinated, continuous and very focused
efforts. Companies should invest in infrastructures that shield critical resources.
To keep the business afloat in all situations, it has to see to it that procurement
and
contract management is well maintained and coordinated. It has to wield substantial
purchasing power to cope up with new and highly competitive contracting methods.
A business needs to constantly evaluate enterprise resource planning implementation
and
should have a distinct department to manage finance and human resources with
principles
that are sound.
Business drivers are the people, information, and tasks that support the
fulfillment of a
business objective. They lead the company trying to get it away from pitfalls and
turn
unforeseen mistakes into good lessons for future success and sustainability.
Technology is fast evolving and businesses that do not evolve with technology will
suffer
tremendously. The world of business is a competitive sink or swim arena. Every
single day,
a business is exposed to risks and challenges but this is just a natural thing.
The foremost key business drivers of course are the staff and talents � the people.
Being
the literal brains behind every business, people make and set the objectives,
execution of
critical decisions and constant innovation to move the business forward. Human
resources
department are scouting for the best people in the business industry everyday.
Different
people have different aptitudes so it is important to employ only those with
business
stamina. While most companies prefer people with advance degrees specifically
master and
doctorates in business administration and the like, there are many people who have
not
even finished college but have innate skills for running a business effectively.
The fast rising in popularity of open source and open standards has also been a
significant
business driver in recent years. Open standards make possible the sharing of non
proprietary so portability is has no longer become an issue. In the past,
integration of
internal systems of a company was expensive because different computing platforms
could
not effectively communicate with each other. Now, with more bulk of data being
transferred from one data warehouse to another, open standards is making things a
lot
faster and more efficient.
Open source on the other hand allows companies free software or software with
minimal fee
and thus save the company a lot of money to be used in other investments. Open
source
software applications are in no way inferior than commercial applications. With
open
source, anybody can make additions to the source code so this can be a good way to
customize software to the business archicture.
Business Architecture
Business architects should also determine the scope of the business. How can the
company
grow and branch out to areas or countries? What is the expected annual growth based
on
product manufacture and sales revenues?
There are many more considerations to account and once all these are in one place,
a
business architect starts drawing the design. The design must have to cater to all
aspects of
the business. There is no trivial aspect in business as tiny details can create
huge impact
on the organization as a whole.
As one of the layer of the IT architecture, the business architecture is the very
framework
where the other layers, information, application and technology are based on.
Business
data constitute information which may be used by business software applications
which are
executed by hardware technology. All these other layers operate within the business
architecture framework.
The nature of business is fast evolving. Many traditional businesses have evolved
from a
local place to global proportions. One mobile phone has evolved from wood pulp
mills to
rubber works to what today is perhaps one of the global leaders in technological
innovations. Some insurance companies also consider themselves banks and vice
versa. These complex business evolution need to be tracked so appropriate changes
in the
business architecture can also be taken care by the information systems architects.
Business experts are people who thoroughly understand the business and the data
supporting the business. They know the specific business rules and processes.
In a typical company setup, a person will have a long way to climb up the corporate
ladder
so he or she will land on top manager positions. During the period that he is
climbing the
corporate ladder, he learns valuable lessons about the company's objectives,
decision
making guides, internal and external policies, business strategy and many other
aspect of
the business.
The chief executive officer (CEO) is the business organization's highest ranking
executive.
He is responsible for running the business, carrying out policies of the board of
directors and
making decisions that can highly impact the company. The CEO must not only be a
business expert in general but must also be a business expert in particular about
all the
details of the business he is running. He can be considered the image of the
company and
his relationship both internally and externally with other companies is very vital
for the
success of the business.
Accountant are key business experts responsible for recording, auditing and
inspecting
financial records of business and prepares financial and tax reports. Most
accountants give
recommendations by laying out projected sales, income, revenue and others.
Marketing people, whether marketing managers or staff, are constantly on the look
out for
marketing trends. They gather different statistical data and demography so they
know the
target for goods and services. They closely work with advertising people.
Business software developers are among the top notch information technology
professionals. Aside from mastering the technical aspect of IT like computer
languages and
IT infrastructure, the must also know the very framework of the business
architecture. Their applications are made to automate business tasks like
transaction
processing and all kinds of business related reporting.
Customer relations specialists take of the needs of clients especially the current
and loyal
ones. They make sure that clients are satisfied with the products and services.
They also
act like marketing staff by recommending other products and service to clients.
Their main
responsibility is keeping the clients happy, satisfied and wanting for more.
Human resource staff take care of hiring the best minds suited for the business.
Since
businesses offer different products and services, the human resource staff is
responsible for
screening potential employees to handle the business operations. In order for the
human
resource staff to match the potential candidate, they (the HR staff) should also
know the ins
and outs of the business they are dealing with.
There are also business experts who do not want to be employed in other companies
but
instead they want to have their own business and be their own boss. These people
are
called entrepreneur. Entrepreneurs spend invest their own money on the business of
their
choice. The make feasibility studies first before throwing in their money or they
may hire
the services of a business consultant.
A business consultant is a seasoned business expert that that has a lot of business
success
experiences under his belt. Most business consultants are not fully attached to one
company alone. Business consultants know all aspects of the business. He recommends
actions by studying the financial status and transaction history of the company he
offering
his services. Many companies are offering business consultancy as the main line of
service.
Corporate lawyers focus on laws pertaining to business. They take charge of the
contracts
and represent companies during time of legal misunderstanding with other entities.
In all
undertakings whether traditional or new, corporate lawyers will have to ensure that
the
company does not violate any law of a given country.
Business rule is very important to define a business schema. The business rule can
influence and guide behaviors and result to support to policies and response to
environmental events and situations. Rules may be the top means whereby a business
organization directs its movement and defines it objectives and perform appropriate
actions.
The terms in a business schema is the precise definition of words used in the
business
rule. Order, Product Type and Line items are terms that refer to entity class which
are
things of heavy significance to the business. Attributes are terms that describe
the entity
class. For example, total number and total value are attributes of an order.
Attributes of
Product Type may include manufacturer, unit price and materials used. Quantity and
extended value are attributes for Line Item entity class.
Facts, another business rule in the schema, describe a thing like the role a thing
plays and
other descriptions. A data model has three kinds of facts which relationships,
attributes and
super types / sub-types.
Derivation can be any attribute that is a derivative of other attributes or system
variables. For example, the extended value attribute for the entity class line
items can be
determined by multiplying quantity of line item by the unit price of the product
type.
Many companies hire consultants to document and consolidated the standards, rules,
policies and practices. Then these documentations are handed to IT and database
consultants so that they can be transformed into database rules that follow the
business
schema.
Many independent or third party consultancy firms, research institutions and
software
application vendors and developers offer rich business schema solutions. Even if
business
rules are constantly changing and software applications are already in final form,
these
software applications can be customized to be in sync with the constantly changing
business
rules in particular and the schema in general.
A business rules engine is widely used software application that is used to manage
and
automate business rules to follow legal regulations. For example the law that
states "An
employee can be fired for any reason or no reason but not for an illegal reason" is
ensured
to be followed by the software.
The rule engine's most significant function is to help classify, register and
manage business
and legal rules and verify for constancy. Likewise, the can infer some rules basing
other
existing rules and relate them to IT application. Rules in IT applications can
automatically
detect unusual situations arising from the operations.
With a business schema clearly defined, there is little room for mistake in running
a
business successfully.
Business-driven data distribution refers to the situation where the business need
for data at
a specific location drives the company to develop a data site and the distribution
of data to
the said newly developed site. This data distribution is independent of a
telecommunications
network.
Having a data warehouse can have tremendous positive impact on a company because
the
data warehouse can allow people in the business to get timely answers to many
business
related questions and problems. Aside from answers, data warehouses can also help
the
company spot trends and patterns on the spending lifestyles of both their existing
clients or
potential customers.
Building a data warehouse can be expensive. With today's businesses relying heavily
on
data, it is not uncommon for companies to invest millions of dollar on IT
infrastructures
alone. Aside from that, companies still need to expend on hiring of IT staff and
consultants.
For many companies, the best way to grow is to utilize the internet as the main
market
place of their goods and services. With the internet, a company can be exposed to
the
global market place. People from different places can have easy access to goods or
make
inquiries when they are interested. Customer relationship can be sealed a lot
faster without
staff having to travel far places. The company can have presence in virtually all
countries
around the world.
But with a wide market arena comes greater demand for data. One data warehouse
alone
location may not be able to handle the bulk data that need to be gathered,
aggregated,
analyzed and reported.
This is where business-driven data distribution comes in. For instance, a company
that has
mainly operated in New York and London has one data warehouse to serve both
offices. But
since many people on the internet have come to know of the company and availed of
their
services and have since then been loyal clients, many different locations are
starting to
generate bulks of data. Say, Tokyo has an exponentially increasing pool of clients,
the
company should already consider creating a new data warehouse in the location. This
data
warehouse can be the source of data to distribute to other data warehouses located
in other
areas but maintained by the same company.
If all these questions are coming up with positive answers, then the next logical
step would
be to design and implement the new data warehouse that will become the new
business-
driven data distribution center. As in most data warehouses, the appropriate high
powered
computers and servers will have to bought and installed. Relational database
software
applications and other user friendly tools should also be acquired.
As time progresses, a company may have several location that have data warehouses
acting
as business driven data distributions. These separate warehouses continually
communicate
with each transferring data updates every minute and synchronizing and aggregating
data
so business staff can get the most recent trends and patterns in the market.
With this set up, a company can have very competitive advantage over the
competitors.
They can formulate new policies, strategies and business rules based on the demand
created from the reports of the many business driven data distribution centers
around the
world.
What is Business Activity
Business activities utilize all data resources and platform resources in order to
performing
specific tasks and duties of the company.
For a company to survive in the business, it must have a competitive edge in the
arena
where multitudes of competitors exist. Competitive advantages can be had by having
rigorous and highly comprehensive methods for current and future rules, policies
and
behaviors for the processes within the organization.
Business activities are the bases for setting management procedures so support the
company in achieving the best financial quality and value in IT operations and
services.
Although these broad categories may apply to many businesses, companies have
differing
needs to the categories. For instance, a law firm may not need to acquire raw
materials as
do furniture companies. Hotels need intensive marketing while an environmental non
profit
organization may not need marketing at all.
Several vendors sell IT solutions to take care of business activities and expedite
manual
work by automatic them through computers.
An Enterprise Resource Planning (ERP) system is a software based application that
integrates all data processes of a company into one efficient and fast system.
Generally,
ERP uses several components of the software application and hardware system to have
the
integration. ERP system is highly dependent on relational databases and they it
involves
huge data requirements. Companies set up data warehouses to feed the ERP.
Supply Chain Management (SCM) makes the planning, implementation and control of
operation related to storage of raw materials, inventory of work in process and
point-of-
origin to point-of-consumption of finished products.
Supply management system takes care of the methods and process involved in
institutional
or corporate buying. Corporate buying may include purchasing of raw materials or
already
finished goods to be resold.
Data Tracking
Data tracking involves tracing data from the point of its origin to its final
state. Data
tracking is very useful in a data warehouse implementation. As it is very well
known, a data
warehouse is very complex system that involves disparate data coming various
operational
data sources and data marts.
Hence, data keeps traveling from one server to another. Data tracking helps develop
data
collection proficiency at each site when proper management actions are being taken
to
ensure data integrity.
Some servers handle data processes by archiving raw, compute and test data
automatically.
Raw data are stored exactly in the format as its was received including the header,
footer
and all other information about the data.
Data tracking can be employed to improve the quality for transactions. For example,
I do
my withdrawal using the automated teller machine (ATM) and something unexpected
happens to my transaction which results in the machine not dispensing any money but
deducting the amount from my balance as reflected in receipt spewed out by the
machine.
When I report this incident to the bank authorities, they can easily trace the
series of events
by tracking the data I entered into the machine and the activities that myself and
the ATM
machine did. Because the data is tracked, they can easily spot patterns which led
to the
problem and from then, they can immediately take actions to improve the services
Data tracking can also be used in cases of fraud being committed. Inside the
company, if
there an erring person, the data tracking process may involve inspecting the audit
trail logs.
Some websites offer aggregated data through data tracking by acquiring certain
fields of
data using remote connection technologies. For instance, web applications can track
the
sales of a company which is being refreshed regularly so while a staff is on
business travel
or simply on holiday, he can still see what is happening in the business operation.
In another example, a website may offer a service such as an hourly update of the
global
weather and this can be done by tracking data from different geographical locations
data
sites.
But despite the uses, there also issues associate with data tracking. The issue of
security
and privacy is one of the biggest areas of concerns with data tracking. Many
website try to
install some small codes on the web browsers so that they can track the return
visits of an
internet user and bring up the preferences he has specified during his previous
visits.
This information about the preferences is tracked from the small code copies on the
computer. This code is called "cookie". Cookies in themselves are intended for good
use but
there many coders who have exploited their use by making them track sensitive
information
and steal them for bad purposes.
There are many specialized used for data tracking purposes available as free
download or
for commercial purposes coming with a fee. These software data tracking tools have
easy to
use graphical dashboards and very efficient back end programs that can give users
the data
they need to track on a real time basis.
Graphical data tracking applications like these make for perfect monitoring tools
for
database or data warehouse administrators who want to keep track of the data
traveling
from data mart or operational data source to another. The graphical presentation
can make
the administrator easily spot the erring and data and have an instant idea of where
that
data is currently located.
With today's fast paced business environment, data tracking tools and devices can
greatly
enhance the information system of any organization in order for them to make wise
decisions and corresponding moves in the face of a challenging situation.
Intelligent Agent
There are two general types of intelligent agents � the physical agent and the
temporal
agent. The physical agent refers to an agent which uses sensors and other less
abstract and
more tangible means to do its job. On the other hand, the temporal agent may be
purely
codes that use time based stored information which are triggered depending on
configuration.
From those two general types of intelligent agents, there may be five classes of
intelligent
agents based on the degree of their functionalities and capabilities. These five
are simple
reflex agents, model-based reflex agents, goal-based agents, utility-based agents
and
learning agents.
The simple reflex agent functions on the basis of its most current perception is
also based
on the condition � action rule such as "if condition then action rule". The success
of the
agents job depends on how fully observable the environment is.
The model based agent has its current state stored inside the agent which maintains
certain
structures describing the part of the world which are unseen and this kind of agent
can
handle environments which are partially observable. The behavior of this kind of
agent
requires information about the way that the world works and behaves and thus is
sometimes considered to have the world view model.
The goal based agent is actually a model based agent but it stores information
about certain
situations and circumstances in a more desirable way by allowing the agent some
good
choices from among many possibilities.
The utility based agent uses a function that can map a state to a certain measure
of the
utility of the state.
Finally, a learning agent are is a self governing intelligent agents that can learn
and adapt
to constantly changing situations It can quickly learn even from large amounts of
data and
its learning can be online and in real time.
Intelligent agents have become the new paradigm for software development. The
concept
behind intelligent agents have been hailed as "the next significant breakthrough in
software
development". Today, intelligent agents are used in an increasingly wide variety of
An operational database, as the name implies, is the database that is currently and
progressive in use capturing real time data and supplying data for real time
computations
and other analyzing processes.
For example, an operational database is the one which used for taking order and
fulfilling
them in a store whether it is a traditional store or an online store. Other areas
in business
that use an operational database is in a catalog fulfillment system any other Point
of Sale
system which is used in retail stores. An operational database is used for keeping
track of
payments and inventory. It takes information and amounts from credit cards and
accountants use the operational database because it must balance up to the last
penny.
An operational database is also used for supported IRS task filings and regulations
which is
why it is sometimes managed by the IT for the finance and operations groups in a
business
organization. Companies can seldom ran successfully without using an operational
database
as this database is based on accounts and transactions.
Because of the very dynamic nature of an operational database, there are certain
issues
that need to be addressed appropriately. An operational database can grow very fast
in size
and bulk so database administrations and IT analysts must purchase high powered
computer hardware and top notch database management systems.
Most business organizations have regulations and requirements that dictate storing
data for
longer periods of time for operation. This can even create more complex setup in
relation to
database performance and usability. With ever increasing or expanding operational
data
volume, operational databases will have additional stress on processing of
transactions
leading to slowing down of things. As a general trend, the more data there are in
the
operational database, the less efficient the transactions running against the
database tend
to be.
There are several reasons for this one of the most obvious reasons is that table
scans need
to reference more pages of data so it could give results. Indexes can also grow in
size so it
could support larger data volumes and with this increase, access by the index could
degrade
as there would be more levels that need to be traversed. Some IT professionals
address this
problem by having solutions that offload older data to data stores for archive.
Operational databases are just part of the entire enterprise data management and
some of
the data that need to be archived go directly to the data warehouse.
An operational data store is basically a database that is used for being an interim
area for a
data warehouse. As such, its primary purpose is for handling data which are
progressively in
use such as transactions, inventory and collecting data from Point of Sales. It
works with a
data warehouse but unlike a data warehouse, an operational data store does not
contain
static data. Instead, an operational data store contains data which are constantly
updated
through the course of the business operations.
ODS is specially designed such that it can quickly perform relatively simply
queries on
smaller volumes of data such as finding orders of a customer or looking for
available items
in the retails store. This is in contrast to the structure of a data warehouse
wherein one
needs to perform complex queries on high volumes of data. As a simple analogy, a
data
store may be a company's short term memory storing only the most recent information
while the data warehouse is the long term memory which also serves as a company's
historical data repository whose data stored are relatively permanent.
The history of the operational data store goes back to as early as the year 1990
when the
original ODS system were developed and used as a reporting tool for administrative
purposes. But even then, the ODS was already dynamic in nature and was usually
updated
every day as it provided reports about daily business transactions such as sales
totals or
orders being filled.
The ODS that time are now referred to as a Class III ODS. As information technology
evolved, so did ODS with the coming of the Class II ODS which was already capable
of
tracking more complex information such as product and location codes, and to update
the
database more frequently (perhaps hourly) to reflect changes. And then came the
Class I
ODS systems from the development of customer relationship management (CRM).
Many years, IT professional were having great problems with integrating legacy
applications
as the process would entail so many resources for maintenance and other efforts had
done
little to care of the needs of the legacy environments. With experimentations and
development of new technologies, there was little left for company IT resources. As
IT
people had experienced with legacy applications, the legacy environment has become
the
child consuming its parent.
There were many approaches done to respond to the problems associated with legacy
systems. One approach was to model data and have information engineering but this
proved to be slow in the delivery of tangible results. With the growth of legacy
systems
came the growth in complexity as well as the data model.
Another response done to address legacy system problems was the establishment of a
data
warehouse and this has proven to be beneficial but a data warehouse only addresses
the
informational aspect of the company.
The development of an operational data store has greatly addressed the problems
associate
with legacy systems. Much like a data warehouse, data from legacy systems are
transformed and integrated into the operational data store and once there, data
ages and
then passed into a data warehouse. One of the main roles of the ODS is to represent
a
collective, integrated view of the up-to-the-second operations of the company. It
is very
useful for corporate-wide mission-critical applications.
OLAP is part of a wide category of business intelligence that includes ETL (extract
transform
load), relational reporting and data mining. Some critical areas of a business
enterprise
where OLAP are greatly used include business reporting for sales, marketing,
management
reporting, business process management (BPM), budgeting and forecasting, financial
reporting and similar areas.
Those databases that are planned and configured for use with OLAP use a
multidimensional
data model which can enable complex analytical and ad-hoc queries with a rapid
execution
time. Outputs from an OLAP query are displayed in a matrix or pivot format with
dimensions
forming the row and column of the matrix; the measures, the values.
OLAP is a function of business intelligence software which can enable data end
users to
easily and selectively get extracted data and view them from different points of
view. In
many cases, an OLAP as aspects designed for managers who want to look to make sense
of
enterprise information as well as how the company fares well with the competition
in a
certain industry. OLAP tools structure data in a hierarchical manner which is
exactly the way
many business mangers thinks of their enterprises. But OLAP also allows business
analysts
to rotate data and change relationships and perspectives so they get deeper
insights into
corporate information to enable them to analyze historical as well as future trends
and
patterns.
The OLAP cube is found in the core of any OLAP system. The OLAP cube is also
referred to
as the multidimensional cube or a hypercube. This cube contains numeric facts which
are
called measures and these measures are further categorized into several dimensions.
The
metadata of the cube is often made from a snowflake schema or star schema of tables
in a
relational database. The hierarchy goes from measures which are derived from the
records
in the fact table and dimensions which are derived from dimension tables.
A claim has it that OLAP cubes for complex queries has the power to produce answers
in
around 0.1% of the time for the same query on OLTP relational data. Aggregation is
the key
for OLAP to achieve the amazing performance and in OLAP, aggregations are built
from the
fact table. This is done by changing the granularity on specific dimensions and
aggregating
up data along these dimensions.
There are different kinds of OLAP such as Multidimensional OLAP (MOLAP) which uses
database structures which are optimal for attributes such as time period, location,
product
or account code; Relational OLAP (ROLAP) wherein the base data and the dimension
tables
are stored as relational tables and new table are created so they can hold
aggregated
information; and Hybrid OLAP (HOLAP) which can be a combination of OLAP types. Many
client � server processing and brokering of software applications that can enable
transactions to run on various computer platforms within a network.
Today, with the ubiquity of the internet, more and more people even from those
remote
areas are not doing transactions online through an e-commerce environment. The term
transaction processing is often associated with the process wherein an online shop
or
ecommerce website accepts and processes payments through a customer's credit or
debit
card in real time in return for purchased goods and services.
There are also many OLTP brokering programs which can distribute transaction
processing
among multiple computers on a network that can enhance the functions of an OLTP
working
on a more demanding decentralized database system. Service oriented architectures
and
web services are now commonly integrated with OLTP.
The two main benefits with using OLTP are simplicity and efficiency. OLTP helps
simplify a
business operation by reducing paper trails and helping draw faster and more
accurate
forecasting for revenues and expenses. OLTP helps provide a concrete foundation
with
timely updating of corporate data. For an enterprise' customers, OLTP allows the
more
choices on how they want to pay giving them more flexible time and enticing them to
make
more transactions. Most OLTP transactions offer services to customers 24 hours a
day seven
days a week.
But despite the great benefits that OLTP can give to companies and their customers,
there
are certain issues that it needs to address. The main issues pertaining to OLTP are
on
security and economic costs. Because an OLTP implementation is exposed on a
network,
more specifically the internet, the database may be susceptible to hackers and
intruders
who may be waiting on the side to get sensitive information on people and their
bank and
credit card accounts.
What is Aggregation
In the broadest sense of the word, aggregation means collecting and combining of
data
horizontally, vertically and chronologically and then expressed in summary form to
be used
for statistical analysis. In the more technical sense, aggregation is a special
kind association
that specified a part of whole relationship between the component part and the
while.
Data aggregation can be used for personal data aggregation services to offer a user
one
point for collection of his personal information from other websites. The user can
use a
single master personal identification number (PIN) which he can use to access a
variety of
other accounts like airlines, clubs and financial institutions. This kind of
aggregation is often
called "screen scraping".
During the course of time, large amounts of aggregated account data from provider
to
server are transferred and may develop into a comprehensive database of user
profiles with
details of balances, securities transactions, credit information and other
information. Privacy
and security become a major issue but there are independent companies offering
these
related services.
Because of the possibility of liabilities that may arise from activities related to
data
aggregations such as security issues and infringement of intellectual property
rights,
aggregators may agree on a data feed arrangement at the discretion of the end user
or
customers. This may involve using an Open Financial Exchange (OFX) standard in
requesting and delivering the information to the customer. This agreement will
provide an
opportunity for institutions to protect the interest of their customers' interest
and for
aggregators to come up with more robust services. Screen scrapping without the
content
provider's consent can lead to allowing subscribers to see any account opened
through a
single website.
Automatic data partitioning is the process of breaking down large chunks of data
and
metadata at a specific data site into partitions according to the request
specification of the
client.
Data sites contain multitudes of varied data which can be extremely useful as a
statistical
basis for determining many trends in businesses. Because data in the data sites can
grow at
a very fast rate, the demand for internet traffic also increases. a good software
with
partitioning capability should be employed to manage the data warehouse. Many
software
application handling data also have advanced functions like traffic shaping and
policing so
that sufficient bandwidth can be maintained.
Relational database management systems (RDBMS) effectively manage data sites. This
database system follows the relational model introduced by E. F. Codd in which data
is
stored tables while the relationship among data is stored in another tables. This
is in
contrast to flat files where all data is stored in one contiguous area.
Since RDMS data is not stored in one contiguous area but instead broken down into
tables,
it becomes easy to partition data whether manually or automatically for easy
sharing and
distribution.
The biggest advantage to data partitioning is that I can divide large tables and
indexes into
smaller parts and as a result, the system's performance can be greatly improved
while
contention is reduced and data availability and distribution is increased.
Automatic data
partitioning makes the job of the database administrator a lot easier especially in
labor
intensive jobs such as doing back ups, loading data, recovering and processing a
query.
Vertical partitioning is another technique wherein tables are created with fewer
columns
with additional separate tables to store the rest of the remaining columns.
Usually, the
process involves the use of different physical storage.
Some of the partitioning methods used as criteria include range partitioning, list
partitioning, hash partitioning and composite partitioning.
Hash partitioning uses the value taken from a hash function. For instance, if there
are
partitions, the value returned for the function could be from 0 to 3.
What is Cache
A cache is a type of dynamic and high speed memory that is used to supplement the
function of the central processing unit and the physical disk storage. The cache
acts as a
buffer when the CPU tries to access data from the disk so the data traveling from
the CPU
and physical disks can have synchronized speed. Disk reading and writing process is
temporary storage where data which are most frequently accessed are stored so fast
processing. In future processing, the CPU may just access the duplicated copy
instead of
getting it from the physical disk storage which is slower and performance can
suffer.
Other memory caches are directly built in the main body of the microprocessor. For
instance, the old Intel 80486 microprocessor has 8K of memory cache while the
Pentium
had 16K. The cache as also called Level 1 (L1) caches. Modern computers come with
external cache memory which is called Level 2 (L2) cache. The L2 cache is situated
between
the CPU and the DRAM.
On the other hand, disk caching is almost similar to memory caching except that
disk cache
uses the conventional old memory instead of the high speed SRAM. Frequently
accessed
data from the disk storage device are stored in the memory buffer. The program
first needs
to see if there is data from the disk cache before getting data from the hard disk.
This
method significantly increases performance because access speed in RAM can be as
much
as thousands of times faster than access speed in hard disks.
A cache hit is a term used when data is found in the cache. The cache's
effectiveness is
determined by its hit rate. A technique known as smart caching is used by many
cache
systems as well. The technique is able to recognize certain types of data which are
being
frequently used and automatically caches them.
Another kind of cache is the BIND DNS daemon which maps domain names to IP
addresses.
This makes it easier for numeric IP addresses to be matched faster with their
corresponding
domain names.
Web browsers also employ a caching system of recently viewed web pages. With these
caching system, a user will not have to wait to get data from remote servers
because the
latest pages are on his computer's web cache. A lot of internet service providers
user proxy
cache for there clients to save on bandwidth in their networks.
Some search engines have indexed pages in their cache so when links to these web
pages
are shown in the search results and the actual website is temporarily offline or
inaccessible,
the search engine will give the cached pages to the user.
A query may request at least one variable to be filled up with one value or more. A
query
may look like this:
The query tells the computer to select all 'Smith' family names which may number to
a few
thousand from among tens of thousands within the database table. And so the
database
management system will have to estimate a filter factor using the determined value
for the
variable and an access path will have to be determined to get to the data.
Access path selection can make a tremendous impact on the overall performance of
the
system. The query mentioned above is a very simple query with one variable being
matched
to values from one table only. A more complex query may involve looking for many
variables which can be matched to many different records on separate tables. Some
of
these variables even have complex conditions such as greater than or less than some
value
of integers. Many relational database makers have their own algorithms to optimize
choosing of access paths while minimizing total access cost.
Optimization of access path selection maybe gauged using cost formulas with I/O and
CPU
utilization weight usually considered. Generally, query optimizers evaluate the
available
paths to data retrieval and estimate the cost in executing the statements using the
In choosing an access path, the RDBMS optimizer examines the WHERE clause and the
FROM clause. It then lays out possible plans of execution using the determined
paths and,
with the use of statistics for the columns, index and tables accessible to the
statement, the
optimizer then estimates the cost of executing the plan.
Access paths selection for joins where data is taken from than one table is
basically done
using the nested loop and merging scan techniques. Because joins are more complex,
there
are some other considerations for determining access path selections for them.
In general, the most common ways of paths selections include the following:
Full Table Scan - The RDBMS software scans all rows from the table and filters out
those
that do not mach the criteria in the query.
Row ID Scan � This is the fastest retrieval method for a single row. The row
identification
(rowed) give the exact location of the row in the specified database.
Index Scan � With this method, the RDBMS retrieves a row of records by traversing
the
index using the indexed column values required by the query statement. There are
many
types of index scans which many include Index Unique Scans, Index Range Scans,
Index
Skip Scans, Full Scans, Fast Full Index Scans, Index Joins and Bitmap Indexes.
Cluster Access Scan � This is used to retrieve all rows that have same cluster key
value.
The rows are coming from a table stored in an indexed cluster.
Hash Access Scan � This method locates rows in a hush cluster basing on some hash
value. All rows containing the same hash values are stored within the same data
block.
A new system present for optimizing access path selection is by defining an index
and
segment scan, the two types of scans which are available for SQL statements. Before
returning the tuples, simple predicates called search arguments (SARGS) are added
to the
indexes. There are many more techniques under research by RDBMS vendors like
Microsoft
(SQL), Oracle, MySQL, PostgreSQL and many others.
An Ad-Hoc Query is a query that cannot be determined prior to the moment the query
is
issued. It is created in order to get information when need arises and it consists
of
dynamically constructed SQL which is usually constructed by desktop-resident query
tools.
This is in contrast to any query which is predefine and performed routinely.
The word Ad hoc comes from Latin which means "for the purpose". It generally refers
to
anything that has been designed to answer very specific problems. An Ad hoc
committed for
instance is created to deal with a particular undertaking and after the undertaking
is
finished, the committee is disbanded. In the same manner, an ad hoc query does not
reside
in the computer or the database manager but is dynamically created depending on the
In the past, for users to analyze various kinds of data, multiple sets of queries
are being
constructed. These queries are predefined under the management of a database or
system
administrator and so a barrier between the users' needs and the canned information
exists.
As a result, the end user gets a bombardment of unrelated data in his query
results. The IT
resources also get a heavy toll since a user may have to execute several different
queries at
any given period.
Today's widely used active data warehouses accelerate retrieval of vital
information to
answer interactive queries in a mission critical application.
Most users of data are in fact non technical people. Everyday, the retrieve
seemingly
unrelated data from different tables and database sources. Many ad hoc query tools
exists
so non technical users can execute very complex queries without trying to know what
happens at the backend. Ad hoc query tools include features to support all types of
query
relationships which include one-to-many, many-to-one, and many-to-many. End users
can
easily construct complex queries using graphical user interface (GUI) navigation
through
object structures in a drag and drop manner.
Ad hoc queries are used intensively in the internet. Search engines process
millions of
queries every single second from different data sources. Any keywords typed the
internet
user are dynamically generated with an ad hoc query against virtually any database
back
end. As the basic structure of an SQL statement consist of SELECT keyword FROM
table
WHERE conditions, an ad hoc query dynamically supplies the keyword, data source and
the
conditions without the user knowing it.
Although issuing an ad hoc query against a database may be more efficient in terms
of
computer resources because using a predefined query may cause one to issue more
than
one query, using an ad hoc query may also have a heavy resource impact depending on
the
number of variables needed to be answered.
To reduce impact on memory due to usage of ad hoc queries, the computer must have
huge
amount of memory, provide very fast devices to be used as temporary disk storage
and the
database manager must prevent very high memory usage ad hoc queries from being
executed. Some database managers anticipate huge sort requirements by having exact
match pre-calculated results sets.
Data warehouses are very large repository of all sorts of data. These data maybe of
different formats.
To make these data useful to the company, the database running the data warehouse
has
to be configured so that it obeys the business data model which reflects all
aspects of the
business operation. These aspects include all business rules pertaining to
transactions,
products, raw materials management, human resource management, customer relations
and all other including the tiniest of details.
size of the data warehouse and the bulk of data being processed. The database
administrator has many responsibilities. Among his important duties are creating
and
testing backups for database recoverability; verifying and ensuring data integrity
by making
sure tables, relationships and data access are constantly monitored; maintaining
database
security by defining and implementing access controls on the databases, ensuring
that the
database is always available or uptime is at maximum; performance tuning of the
database
system of software and hardware components; and offering end user support as well
as
coordinating with programmers and IT engineers.
Many database administrators who manage large data warehouses set certain policies
so
that the flow and data gathering, transformation, extraction, loading, and sharing
will have
as minimal problem as possible.
Data Access Policy refers to who has access privileges to what particular data.
Certain data
like administrative data can contain confidential data therefore access is
restricted to the
main public. Other data are freely available to all staff members of the company
only so
that they can view particular trends in the industry and they can be guided in
coming up
with needed innovations in their products or services.
Data Usage Policy is a guide so data will not be misused or just be distributed
freely when it
is not intended to be. The use data falls into several categories like update, read
only or for
external dissemination.
Most data warehouses have data integration and integration policy in which all the
data are
represented within a single logical data model where physical data models take data
from.
The database administrator should be extra keen with details in developing such
model and
all corresponding data structures and domains. All the needs of an enterprise data
are
considered in the development and future modifications of domains, values and data
structures.
Data application control policies are implemented to ensure that all data taken
from the
warehouse are appropriately handled with care to ensure integrity, security and
availability.
It is not uncommon in business organizations where staff save data locally in
division
computers, discs and networks but these data are often saved in different kinds of
applications like Word and Excel. Database administrators are responsible for
access
controls for different kinds of data formats so that the resources cannot be
overloaded and
data warehouses re protected from unauthorized modifications or disclosure which
may
result to data loss or loss of integrity.
Relational databases, which are the most widely used kinds of database today
especially in
very large warehouse, have very complex structures and need to be carefully
monitored.
They need to strictly obey the business rules of the enterprise as well as in sync
with the all
the data models. Investing in robust software applications, which can be available
from a
wide array of software developers and vendors, and hiring a highly trained and
skillful
database administrator can certainly be a company asset.
Aggregation can be made from different data occurrences within the same data
subject,
business transactions and a de-normalized database and between the real world and
detailed data resource design within the common data architecture.
Reporting and data analysis applications that work closely to tie together company
data
users and data warehouses need to overcome problem on database performance. Every
single day, the amount data collected increases at exponential proportions. Along
with the
increase, the demands for more detailed reporting and analysis tools also
increases.
In a competitive business environment, the areas that are given more focus to gain
competitive edge over other companies include the need for timely financial
reporting, real
time disclosure so that the company can meet compliance regulations and accurate
sales
and marketing data so the company can grow a larger customer base and thus increase
profitability.
Data aggregation helps company data warehouses try to piece together different
kinds of
data within the data warehouse so that they can have meaning that will be useful as
Statistics have shown that 90 percent of all business related reports contain
aggregate
information making it essential to have proactive implementation of data
aggregation
solutions so that the data warehouse can substantially generate data for
significant
performance benefits and subsequently open many opportunities for the company to
have
enhanced analysis and reporting capabilities.
But while these approaches have been proven and tested, they may have some
disadvantages in the long run. In fact those approaches have already been lumped
among
the traditional techniques by some database and data warehouse professionals.
Top data warehouse experts recommend that having a good and well define enterprise
class
solutions architected to support dynamic business environments have more long term
benefits with data aggregation. The enterprise class solutions provide good methods
to
ensure that the data warehouse has high availability and easy maintenance.
Having a flexible architecture also allows for future growth and flexibility and
most business
trends nowadays tend to lean towards exponential growth. The data architecture of
data
warehouses should use standard industry models so they can support complex
aggregation
needs. It should also be able to support all kinds of reports and reporting
environments.
One way to test if the data warehouse is optimized is if can process pre-
aggregation with
aggregation on the fly.
Data warehouses should be scalable as the amount of data will definitely grow very
fast.
Especially now that new technologies like RFID can allow gathering of more
transactional data,
scalability will be important for the future data needs of the company.
Data aggregation can really grow to be a complex process through time. It is always
good to plan
the business architecture so that data will be in sync between real activities and
the data model
simulating the real scenario. IT decision makers need to make careful choice in
software
applications as there are hundreds of choices that can be bought from software
vendors and
developers around the world.
What is Data Collection Frequency
Data Collection Frequency, just as the name suggests refers to the time frequency
at which
data is collected at regular intervals. This often refers to whatever time of the
day or the
year in any given length of period.
extract, transform and load data onto the storage system. Along with these
processes, there
could be a potentially large number of data consumers simultaneously accessing data
warehouses getting aggregated data reports for statistical analysis of both company
and
industry trends.
Having a log of the data collection frequency of a data warehouse is very important
for a lot
of reasons.
For one, having knowledge about data collection frequency is extremely important in
All these activities are monitored by the data warehouse system and data collection
frequency will be useful in analyzing so many things like if the transactions were
legal or
illegal and many other related useful information.
In a company data warehouse, data collection solutions are important because they
enable
the business organization to have real time information and visibility in supply
chain. This
can greatly improve decision making processes, accuracy in customer information and
products or services sales and material availability and reporting data warehouse
operations. The data collection frequency can also help increase return of
investments (ROI)
through improved equipment and labor productivity.
Data collection frequency record is a good determinant for media exposure of the e-
commerce site and the products and services it offers. The record for frequency of
data
collected could be used in calculating the number of prospects which have been
reached
with different media vehicles at different levels of frequency of exposure.
Sometime, data warehouses can experience problems both hard and software in nature.
To
troubleshoot problems, IT professionally generally look at the logs to see which
point the
system encountered such problems. Having a record for data collection frequency can
give
the troubleshooter some hints about problem. For instance, at some point, data
collection
was so heavy that it could cause processing to be intensive to the point of
hardware
breakdown.
Business intelligence is fast evolving and has long been a critical component of a
company's
daily operations. As it continues to evolve, the need for real time data warehouse
which can
provide data consumers with rapid updates becomes even more demanding.
Many companies are finding that they need to refresh their data warehouses on more
frequent basis because tools in business intelligence are being used more and more
often
for decision making in operations. According to may data warehouse specialists,
data
warehouse is not just about loading data for business analyst to forecast; it is
more about
daily decisions.
With real time data collection, for sure database managers and data warehouse
specialists
will surely make more room for recording data collection frequency.
Aggregation can be made from different data occurrences within the same data
subject,
business transactions and a de-normalized database and between the real world and
detailed data resource design within the common data architecture.
Reporting and data analysis applications that work closely to tie together company
data
users and data warehouses need to overcome problem on database performance. Every
single day, the amount data collected increases at exponential proportions. Along
with the
increase, the demands for more detailed reporting and analysis tools also
increases.
In a competitive business environment, the areas that are given more focus to gain
competitive edge over other companies include the need for timely financial
reporting, real
time disclosure so that the company can meet compliance regulations and accurate
sales
and marketing data so the company can grow a larger customer base and thus increase
profitability.
Data aggregation helps company data warehouses try to piece together different
kinds of
data within the data warehouse so that they can have meaning that will be useful as
But data aggregation, when not implemented well using good algorithm and tools can
lead
data reporting inaccuracy. Ineffective way of data aggregation is one of the major
components that can limit performance of database queries.
Statistics have shown that 90 percent of all business related reports contain
aggregate
information making it essential to have proactive implementation of data
aggregation
solutions so that the data warehouse can substantially generate data for
significant
performance benefits and subsequently open many opportunities for the company to
have
enhanced analysis and reporting capabilities.
But while these approaches have been proven and tested, they may have some
disadvantages in the long run. In fact those approaches have already been lumped
among
the traditional techniques by some database and data warehouse professionals.
Top data warehouse experts recommend that having a good and well define enterprise
class
solutions architected to support dynamic business environments have more long term
benefits with data aggregation. The enterprise class solutions provide good methods
to
ensure that the data warehouse has high availability and easy maintenance.
Having a flexible architecture also allows for future growth and flexibility and
most business
trends nowadays tend to lean towards exponential growth. The data architecture of
data
warehouses should use standard industry models so they can support complex
aggregation
needs. It should also be able to support all kinds of reports and reporting
environments.
One way to test if the data warehouse is optimized is if can process pre-
aggregation with
aggregation on the fly.
Data warehouses should be scalable as the amount of data will definitely grow very
fast.
Especially now that new technologies like RFID can allow gathering of more
transactional
data, scalability will be important for the future data needs of the company.
Data aggregation can really grow to be a complex process through time. It is always
good
to plan the business architecture so that data will be in sync between real
activities and the
data model simulating the real scenario. IT decision makers need to make careful
choice in
software applications as there are hundreds of choices that can be bought from
software
vendors and developers around the world.
Data Collection Frequency, just as the name suggests refers to the time frequency
at which
data is collected at regular intervals. This often refers to whatever time of the
day or the
year in any given length of period.
extract, transform and load data onto the storage system. Along with these
processes, there
could be a potentially large number of data consumers simultaneously accessing data
warehouses getting aggregated data reports for statistical analysis of both company
and
industry trends.
Having a log of the data collection frequency of a data warehouse is very important
for a lot
of reasons.
For one, having knowledge about data collection frequency is extremely important in
All these activities are monitored by the data warehouse system and data collection
frequency will be useful in analyzing so many things like if the transactions were
legal or
illegal and many other related useful information.
In a company data warehouse, data collection solutions are important because they
enable
the business organization to have real time information and visibility in supply
chain. This
can greatly improve decision making processes, accuracy in customer information and
products or services sales and material availability and reporting data warehouse
operations. The data collection frequency can also help increase return of
investments (ROI)
through improved equipment and labor productivity.
Data collection frequency is particular a great help in advertising and marketing
by
determining media exposure of, say, an e-commerce website. And e-commerce website
needs to have intensive media exposure. In the internet where e-commerce takes
place,
there are thousands of other competitors. There competitors will do all they can to
get top
exposure to internet users and buyers. One way to do this is to get top ranks in
search
engines.
Data collection frequency record is a good determinant for media exposure of the e-
commerce site and the products and services it offers. The record for frequency of
data
collected could be used in calculating the number of prospects which have been
reached
with different media vehicles at different levels of frequency of exposure.
Sometime, data warehouses can experience problems both hard and software in nature.
To
troubleshoot problems, IT professionally generally look at the logs to see which
point the
system encountered such problems. Having a record for data collection frequency can
give
the troubleshooter some hints about problem. For instance, at some point, data
collection
was so heavy that it could cause processing to be intensive to the point of
hardware
breakdown.
Business intelligence is fast evolving and has long been a critical component of a
company's
daily operations. As it continues to evolve, the need for real time data warehouse
which can
provide data consumers with rapid updates becomes even more demanding.
Many companies are finding that they need to refresh their data warehouses on more
frequent basis because tools in business intelligence are being used more and more
often
for decision making in operations. According to may data warehouse specialists,
data
warehouse is not just about loading data for business analyst to forecast; it is
more about
daily decisions.
With real time data collection, for sure database managers and data warehouse
specialists
will surely make more room for recording data collection frequency.
Data completeness refers to an indication of whether or not all the data necessary
to meet
the current and future business information demand are available in the data
resource.
It deals with determining the data needed to meet the business information demand
and
ensuring those data are captured and maintained in the data resource so they are
available
when needed.
A data warehouse has six main processes. These processes should be carefully
carried out
by the data warehouse administrator in order to achieve data completeness. The
processes
are as follows:
� Data Extraction � the data in the warehouse can come from many sources and of
multiple data format and types with may be incompatible from system to system. The
process of data extraction includes formatting the disparate data types into one
type
understood by the warehouse. The process also includes compressing the data and
handling
of encryptions whenever this applies.
� Data Loading � After the first two process, the data will then be ready to be
optimally
stored in the data warehouse.
� Job Control � This process is the constant job of the data warehouse
administrator and
his staff. This includes job definition, time and event job scheduling, logging,
monitoring,
error handling, exception handling and notification.
In most cases, data warehouses are available twenty four hours a day, seven days a
week.
So that comprehensive data is gathered, extracted, loaded and shared within the
data
warehouse, regular updates should be done. Parallel and distributed servers target
for world
wide availability of data so data completeness can be achieved with investing in
high
powered servers and robust software applications. Data warehouses are also designed
for
customer level analysis, aside from organizational level analysis and reporting. So
flexible
tools should be implemented in the data warehouse database to accommodate new data
sources and support for metadata. Reliability can be achieved when all these are
considered.
Having complete data can give an accurate guidance of the business organization's
decision
maker. With complete data, statistical reports will be generated with will reflect
and
accurate status of the company and how it is faring with the trends and patterns in
the
industry and how to make innovative moves to gain competitive advantages over the
competitors.
Data Compression is a method using which the storage space required for storing
data is
reduced with the help of mathematical techniques. Data compression is also referred
to as
source coding. This is the process of encoding data information using as few bits
as possible
compared to the unencoded data.
As a real life non digital analogy, the world "development" could be compressed as
the word
"dev't or dev". Despite the few use of letters, the all three words give the same
meaning to
a person with the benefit of saving space on the computer and saving paper space
and ink
in printing.
In the more technical and mathematical sense, data compression is applying certain
algorithms in order to reduce bits in a data file. Most computer software
applications for
compressing data use a variation of the LZ adaptive dictionary-based algorithm in
reducing
file sizes without changing the meaning of the data. "LZ" refers to the name of the
creators
of the algorithm, Lempel and Ziv.
Data compression is very useful in two main areas: resource management and data
transmission. With data compression, consumption of expensive resources like hard
disk
can be greatly reduced. But the downside to this is that compressed data often
needs extra
processing for decompressing so extra hardware may be needed.
In terms of transmission, compressed data will help save bandwidth and as result, a
company may not need to spend extra money for bandwidth. But as with any
communication, a protocol need exists between the sender and receiver to get the
message
across.
There two main types of data compression namely lossless compression and lossy
compression. As the name implies, lossy compression results in a lot of lost bits
while the
lossless compression may not remove bits but eliminate them but by changing them
into
data information with lesser demands for number of bits.
The lossless compression may let one recreate exactly the original file to be
compressed
while the lossy compression is based on the concept of break the file into smaller
formats
for storage and easy transmission and putting the parts back together at the target
site
after transmission.
In a lossless data compression for instance, a picture may have a nice blue sky but
the file
size is big and the user may want to reduce the file size without compromising the
quality of
the nice blue color. To make this possible, one has to change the color value for
particular
pixels. Because the picture has lots of blue, the program would then pick one color
of blue
and use it for every pixel. An algorithm will take care of this such that the file
is rewritten in
a manner where every sky pixel refers to the picked blue color so redundancy by
using
different pixels of different shades of blue is reduced.
On the other hand, lossy compression is very useful internet applications as the
nature of
the sending files over the internet is breaking a file into packets. The problem
with lossy
compression is that one could get stuck with the receiving application's
interpretation of the
compression program from the source. Data that needs to be reproduced exactly like
databases cannot use lossy compression. But the benefit of lossy compression is the
big
reduction in files size.
. Wavelet compression,
Data Concurrency ensures that both official data source and replicated data values
are
consistent, that means whenever data values official data source is updated then
the
corresponding replicated data values must also be updated via synchronization in
order to
maintain consistency.
Allowing more than one application or other data consumers to access the same data
simultaneously while being able to maintain data integrity and database consistency
is the
main essence of data concurrency.
Because transactions are isolated from each other, data will definitely be
replicated. For
example, I and two other friends are simultaneously buying the same item from the
same
e-commerce site. We are also simultaneously buying with one thousand others from
different parts of the globe. Therefore we are technically doing the same
transaction. But
unbeknown to us, our transactions are processed in isolated cases in the backend
data
warehouse. Yet, the database interprets all of us as using the same data
simultaneously.
When multiple users attempt to make modifications to a data at the same, some level
of
control should be established to that having one user's modification affect
adversely can be
prevented. The process of controlling this is called concurrency control.
There are three common ways that databases manage data currency and they are as
follows:
1. Pessimistic concurrency control � in this method, a row is available to the
users when
the record is being fetched and stays with the user until it is updated within the
database.
3. Last in wins � with this method, any row can never be available to users while
the data
is currently being updated but there is no effort made to compare updates with the
original
record. The record would simply be written out. The potential effect would be
overwriting
any changes that are being made by other concurrent users since the last refresh of
the
record.
Data Conversion, as the name implies, deals with changes required to move or
convert data
from one physical environment format to that of another, like moving data from one
electronic medium or database product onto another format.
Every day, data is being shared from one computer to another. This is a very common
Data conversion is technical process of changing the bits contained in the data
from one
format to another format for purpose of interoperability between computers. The
most
simple example of data conversion is a text file converted from one character
encoding to
another. Some of the more complex conversions involve conversion of office file
formats
and conversion of audio, video and image file format which needs to consider
different
software applications to play or display them.
Data conversion can be difficult and painstaking process. While it may be easy for
a
computer to discard information, it is difficult to add information. And adding
information is
not just simply padding bits but sometimes is would involve human judgment.
Upsampling,
the process of converting data to make it more feature rich, is not about adding
data. It is
about making room for addition, a process which also needs human judgement.
To illustrate, a true color image can be easy to convert to grayscale but not the
other way
around. A Unix text file can be converted to Microsoft file by simply adding a CR
byte but
adding color information to a grayscale image cannot be programmatically dne
because only
human judgment can know which colors are appropriate for each section of the image;
this
is not rule based that can be easily done by a computer.
Despite that fact that data conversion can be done directly from format to another
desire
format, may application use pivotal encoding in converting data files. For
instance,
converting Cyrillic text files from KO18-R to Windows-1251 is possible with direct
conversion
using a look up table between encodings. But a more people use conversion from by
first
converting the KOI8-R file to Unicode before converting to Windows-1251 because of
manageability benefits. Conversion with character encoding is a lot easier this way
because
having a lookup table and all permutations of character encodings involves hundreds
of
records.
Inexactitude can also be a result of data conversion. This means that the result of
the
conversion can be conceptually different from the source files. An example would be
the
extant in word processors, WYSIWG paradigm and desktop publishing applications
compared to the structural descriptive paradigm found in XML and SGML.
It is important to know the workings of both source and target format when
converting
data. If the format specifications are unknown, reverse engineering can be applied
to carry
out any conversion as this can attain close approximation of the original
specification
although there is no assurance that there can be no error or inconsistency. In any
case,
there applications that can detect errors so appropriate actions can be done.
schema. Elements which are specific to the company or organisation are defined in
data
architecture schema. For instance, the administrative structure should be designed
according to the real life undertakings of the company's administrative department
so that
data resources can be managed to simulate the administrative department.
A company trying to acquire other business or having mergers with another company
may
potentially experience difficulty with data fragmentation. For instance, in the
field of
manufacturing and retail, multiple order entries or order fulfillment systems may
be poorly
integrated resulting in data fragmented since stored at different storages in
different
locations. In another instance, poor integration with delivery services over
multiple channels
such as the web, retail offices and call centers can also result in data
fragmentation.
The fundamental cause of data fragmentation also often lies in the complexity of an
IT
infrastructure especially if there is an absence of an integrated architectural
foundation
which is substantial in the interoperation of big volumes of heterogeneous data
from various
applications and business data accumulated over many years. It is not uncommon for
business organizations to involve and have significant changes in business rules so
an IT
infrastructure should evolve as well, and this means the company has to invest
more.
In many companies, more than 50 percent of the budget for IT operations is focused
on
building and maintaining points of integration especially when dealing with legacy
systems
dedicated to supply chain, finance, customer relations management and other mission
Problems related to data fragmentation can be serious and relatively difficult and
costly to
attend. But they can be prevented with a good data architecture and IT
infrastructure
design that takes into consideration the future growth of a business organization.
The data architecture phase of an information system planning, when properly and
carefully
executed to the tiniest detail, can force a company to specify and draw a line
between
internal and external flow of information. A company should be keen in seeing
patterns
developed over the years and trends for the future. From this particular stage, it
could be
highly possible that a company can already identify costly pitfalls and shortfalls
related to
information, disconnection between department and branches, and disconnection
between
current and future business endeavors. At this stage alone, more than half of the
problem
stemming data fragmentation is prevented.
The Data Flow Diagram is commonly used also for the visualization of structured
design
data processing. The normal flow is represented graphically. A designer typically
draws
context level DFD first showing interaction between the system and the outside
entities.
Then this context level DFD will then be exploded in order to further show the
details of
system being modeled.
Larry Constantine invented the first data flow diagrams based on Martin and
Estrin's data
flow graph model of computation.
A DFD is one of the three essential perspectives of Structured Systems Analysis and
Design
Method (SSADM). In this method, both the project sponsors and the end users need to
collaborate closely throughout the whole stages of the evolution of the system.
Having a
DFD will make the collaboration easy because the end users will be able to
visualize the
operation of the system, the will see a better perspective what the system will
accomplish
and how the whole project will be implemented.
External Entities / Terminators - These refer or points to the outside parts of the
system
being developed or modeled. Terminators, depending on whether data flows into or
from
the system, are often called sinks or sources. They represent the information as
wherever it
comes from or where it goes.
Processes � The Processes component modifies the inputs and corresponding outputs.
Data Stores � refers to any place or area or storage where data will be placed
whether
temporarily or permanently.
Data Flows � refers to the way data will be transferred from one terminator to
another, or
through processes and data stores.
As a general rules, every page in a DFD should not contain more than 10 components.
So, if
there are more than 10 components in one processes, one or more components should
have
to be combined and then make another DFD to detail the combination in another page.
Each component needs to be number. Same goes for each subcomponent so that it will
be
easy to follow visually. For example, a top level DFD must have components numbered
1,2,3,4,5 and next level subcomponent (for instance of number 2) numbered 2.1, 2.2,
2.3
and so on.
There are two approaches to developing a DFD. The first approach is the Top Down
Approach where a DFD starts with a context level DVD and then the system is slowly
decomposed until the graphical detail goes down to a primitive level.
The other approach, Event Partitioning Approach, was described by Edward Yourdon in
Just
Enough Structured Analysis. In Event Partitioning Approach, a detailed DFD is
constructed
all events are made. For every event, a process is constructed and then each
process is
linked with other processes through data stores. Each process' reaction to a given
event is
modeled by an outgoing data flow.
There many DFD tools available in the market today. Some of these DFD tools include
It is a known fact there telephone numbers are being written down in different ways
by
different people. With a data dictionary, the format of the telephone number within
the
whole organization will always be the same, and hence consistency is maintained.
Most data dictionaries contain different information about the data used in the
enterprise. In
terms of the database representation of the data, the data table defines all schema
objects
including views, tables, clusters, indexes, sequences, synonyms, procedures,
packages,
functions, triggers and many more. This will ensure that all these things follow
one standard
defined in the dictionary. The data dictionary also defines how much space has been
implementation also include default values for database columns, names of the
database
users, the users privileges and limitations, database integrity constraint
information, and
many other general information.
information about data. It is typically structured in tables and views just like
other data in a
database. Most data dictionaries are central to a database and are very important
tool for
kinds of users from the data consumers to application designers to database
developers and
administrators.
A data dictionary is used when finding information about users, objects, schema and
Organizations that are trying to develop an enterprise wide data dictionary need to
have
representational definition for data elements and semantics. Semantics refer to the
aspects
of meaning expressed in language. In the same manner, an enterprise wide data
dictionary
semantics component focuses on creating a precise meaning of the data elements.
Representational definition, on the other hand, defines the way that data elements
are
being stored in the computer such as data types including string, integers, floats,
double or
data formats.
Glossaries are similar to data dictionaries except that glossaries are less precise
and contain only
terms and definitions not very detailed representations of data structures. Data
dictionaries may
initially start with a simple collection of data columns and definitions of the
meanings of the
columns content and may start to grow at a high rate.
Data dictionaries should not be confused with data models because the latter
usually include
more complex relationships between elements of data.
To be effective with a data driven operation, data which is the basis for
statistical results for
trending should be accurate and timely. In order to achieve timeliness, a company
should
invest in top of the line server hardware technologies which includes fast
computers and
network equipment, a task which is relatively easy to do as long as there is money.
But in order to achieve accuracy, the data warehouse should be based on a carefully
planned data architecture based on real life business rules. This process is not
just
expensive but it also takes so much time and careful attention to the tiniest of
details so
that the data architecture reflects the real life business operations.
There are three main functions of a data dimension which are filtering, grouping
and
labeling. For instance, in a company data warehouse, each person, regardless of
whether
this person is a client, a company staff, or company official, is categorized
according to
gender � male, female or unknown. If a data consumer wants a report by gender
category,
say, all males, the data warehouse will have a fast and efficient means in sifting
the big bulk
of data within the data warehouse.
In general, each dimension found in the data warehouse could have one or more
hierarchies. For example, the "Date" dimension may contain several hierarchies like
Day >
Month > Year; or Week > Year. It is up to the design of the data warehouse how the
hierarchy in data dimension will be laid out.
application with the same database recycle data dimensions. For example, in the
"Date"
dimension again, the said "Date" can be used for "Date of Delivery" as well as
"Date of
Sale" or "Date of Hire". This can help the database save space on storage.
A dimension table is used in a data warehouse as one of the set of companion tables
to a
fact table (which of course contains business facts). A dimension table contains
attributes or
fields which are used as constraints and group data when performing a query.
Another related term used in data warehousing is the degenerate dimension. This is
a
dimension derived from a fact table but it does not have its own dimension table.
This is
generally used in cases where the grain of the fact table represents transactional
level data
and a user wants to main specific system identifiers like invoice or order numbers.
When
one wants to provide a direct reference back to a transactional system without
having to
care about overhead cost from maintaining a separate dimension table, then a
degenerate
dimension is the way to go.
The best example of dissemination is the ubiquitous internet. Every single second
throughout the year, data gets disseminated to millions of users around the world.
Data
could sit on the millions of severs located in scattered geographical locations.
Using the internet, there are several ways data can be disseminated. The world wide
web is
an interlinked system where documents, images and other multimedia content can be
accessed via the internet using web browsers. It uses a mark up language called
hyper text
markup language (HMTL) to format disparate data into the web browser.
The Email (electronic mail) is also one of the most widely used systems for data
dissemination using the internet and electronic medium to store and forward
messages. The
email is based on the Simple Mail Transfer Protocol (SMTP) and can also be used by
companies within an intranet system so that staff could communicate with other.
The more traditional ways for data dissemination which are still in wide use today
are the
telephone systems which include fax systems as well. They provide fast and
efficient ways
to communicate in real time. Some telephone systems have been simulated in internet
Of course, the use of non digital materials for data dissemination can never be
totally
eliminated despite the meteoric rise of electronic communication media. Paper memos
are
still widely used to disseminate data. The newspaper is still in wide circulation
to
communicate vital everyday information in news and feature items.
Despite the efficiency of electronic means of data dissemination, there are still
drawbacks
which may take a long time to overcome, if at all. Privacy is one of the most
common
problems with electronic data dissemination. The internet has thousands of loop
holes
where people can peep into the private lives of other people. Security is also a
related
problem with electronic data dissemination. Every year, millions of dollars are
lost to
electronic theft and fraud. Every time a solution is found for a security problem,
another
malicious programs spring up somewhere in the globe.
Data warehouse involves a process called ETL which stands for extract, transform
and load.
During the extraction phase, multitudes of data come to the data warehouse from
several
sources and the system behind the warehouse consolidates the data so each separate
system format will be read consistently by the data consumers of the warehouse.
Despite all these counter measures against data duplication and despite the best
efforts in
trying to clean data, the reality still remains that that data duplication will
never be totally
eliminated. So it is extremely important to understand its impact on the quality of
a data
warehouse implementation. In particular, the presence of data duplication may
potentially
skew content distribution.
There are some application systems that have duplication detection functions. These
functions are developed by calculating a unique hash value for a certain data or
group of
data such as a document. Each document, for instance, is being examined for cases
of
duplication by comparing it against some hash value in either an in-memory hash or
persistent lookup system. Some of the most commonly used hash functions include
MD2,
MD5, or SHA. These three are the most preferred due to their desirable properties.
They are
also easily calculated based on arbitrary data or document lengths and they have
lower
collision probability.
Data duplication can also be similar to problems like plagiarism and clustering.
But the case
of plagiarism could either be exact data duplication or just plain similarity to a
certain
documents. Documents which are considered to be plagiarized may refer to the
abstract
idea and not the word for word content. Clustering on the other hand is a method
which is
used to make clusters of data that have somehow similar characteristics. Clustering
is used
for fast retrieval of relevant information from a database.
Spatial Data is a kind of data that reflects the real world which has become too
complex for
the direct and immediate understanding of data consumers. Spatial Data are used to
create
models of reality and designed to have some similarity with selected aspects of the
real
world including status and nature of the reality.
For example a layer representing a real life landscape may have only stream
segments or
may have streams, coastlines, lakes and swamps.
The entity set to be included in a Data Layer actually depends on the system as
well as the
database model although in some cases, the database may have already been built by
combining all the entities into a single Data Layer.
The basic elements in a Spatial Database are actually the same as a regular
database. It
also has an entity but this entity refers to any "a phenomenon of interest in
reality that is
not further subdivided into phenomena of the same kind" such as a "city" entity
which could
be further broken down into component parts or "forest" entity which could be
further
subdivided into smaller forests.
Accessing the Spatial Data within the DBMS may be possible through a generic
Application
Programming Interface (API). The API can encapsulate any internal differences among
database systems. The API can also map spatial data types into the specific DBMS
implementations with the use of spatial indexing or their in-built optimization
facilities
It is common nowadays to have Data Warehouse that have database systems which can
integrate spatial data types in object-relational data base management systems.
New advancement will make this set up even more popular in the future with the
development of GIS technology. Companies can generate reports not just in
traditional
tables but with graphical maps reflecting data about the company as well.
In data warehousing, there is common term called ETL which stands for Extract,
Transform
and Load.
In general, before data can be loaded, the database and the tables to load must
have been
created already. There are many utility programs available which can build
databases and
define the user table with the SQL CREATE TABLE statements. When the load process
begins, the database system typically builds primary key indexes for each of the
tables
which have a primary key. Also, user-defined indexes are also built.
In some really large databases especially those used in data warehouses, it is
common to
encounter several stages when data loading. It is also common in Data Warehouse
implementations to have data loaded into the database from an input file.
Data Warehouse typically employs automated data loading. During the input stage of
this
loading process, the database validates syntaxes and control statements. It then
inputs
records, monitor progress and status which is indicated by the error handling and
cleanup
functions.
At the conversion stage, input records are transformed into row format. Data is
then
validated and checked for any referential integrity. All arithmetic and conditional
expressions are defined within each input column specification. Finally, data is
written in the
rows of the table.
Data Loading could be a critical process when the design and implementation of a
Data
Warehouse is not done well or performed in a controlled environment. Contingency
measures must be prepared during the data loading process in case of an
administration
failure.
When such failure occurs, the administrator should be ready with the knowledge of
the
structure of the processes and the whole database in particular and the Data
Warehouse in
general. All specific traces of the processes being executed should be tracked in
down not
just.
In fact, due to the complexity of the Data Warehouse loading process, there are a
lot of
specialized Extraction, Transformation, Loading (ETL) software applications which
can be
bought in the market today.
The most important benefits that can be derived from these tools include easy
identification
of relevant information inside the data source; easy extraction or retrieval of the
resulting data set based on define Business Rules; and efficient propagation of
data to the
Data Warehouse or Data Marts.
Data Loading is part of a larger and more complex component of the Data Warehouse
architecture called Data Staging. Complex programming is often involved in data
staging.
This component also often involves analysis of quality data and filters which can
identify
certain patterns and data structures within the existing operational data.
But whether a database administrator uses data loading tools or generates his own
programming codes, one of the most effective ways to manage a Data Warehouse is to
develop a good Data Warehouse data loading strategy.
A database can be vast shared collection composed of data which are logically
related to
each other. Businesses rely heavily on data as they are Databases are used for
managing
the business day to day tasks so Data Collection happens every single day.
Collection of data may seem a simple and trivial task. But databases have gone a
long way
from simply being able to define, create, maintain and control data access. Today
most
complex applications cannot function without data and database managers. And Data
Collection is one of the most critical tasks to handle by companies and their IT
staff.
Two popular approaches to constructing database management systems emerged in the
1970s. The first approach was exemplified by IBM involving a data model which
requires
that all data records are assembled into a collection called Trees.
As a consequence, some records were roots while all others had unique parent
records. An
application programmer is permitted to query and navigate from the root to the
record of
interest one at a time. This process was rather slow but at the time, records were
stored on
serial storage devices particularly magnetic tapes.
The other approach at the time was the Integrated Data Store (IDS) developed from
General Electric. This approach led to a new development of a new kind of database
system
called the Network Database Management System (DBMS).
This database was designed to represent more complex data relationships compared to
those represented by the Hierarchical Database Systems like that of IBM's. But
still, query
navigation involved moving from a specific entry point to the record of interest.
Today's dominant databases, if not all, are based on the relational database model
proposed
by E. F. Codd. This design tried to overcome the shortcomings of the previous
databases
like the data inefficient data retrieval scheme.
With relational databases, data is represented in table structures which are called
relations
and access to these data is through high level and non-procedural query language
done in a
declarative manner.
The problem with previous database involving algorithms which obtain desired
records one
at a time has been overcome with having to specify only a predicate that identifies
the
desired records of combination of records in relational databases.
Big and competitive companies invest money for Data Collections that can
incorporate
advanced numeric and text string searches, table handling methods, relational
navigations
through pages, and user defined rules to help spot relationships between data and
elements.
In a company, a database contains millions of atomic data. Atomic data are data
information that cannot be further broken down. For example, product name is an
atomic
data because it can longer be broken down but product raw material can be broken
further
into raw components depending on the good. An individual products sales is another
atomic
data.
But business organizations are not just interested in the minute details but they
are also
interested in the bigger picture. So, atomic data are combines and aggregated. When
this is
done, the company can already determine regional or total sales, total cost of
goods,
selling, general and administrative expenses, operating income, receivables,
inventories,
depreciation, amortization, debt, taxes and other figures.
Data Mining, or taking data from the vast repository of data warehouse, uses
combined data
intensively. Software applications in conjunction with a good relational database
management system have been developed to come up efficient ways to store and access
One of the biggest problems with Data Mining is level of Data Aggregation. For
example, in
an online survey by a private organization on the smoking trends of one region, it
can be
reflected that one data set contains records of those who currently smoke, another
of those
who have quit smoking and another data set contains records of those who have never
smoked at all.
The collection within each data set continuously rises as data from other sources
keep
coming. The traditional ways to combine these data are done with either using ad
hoc
method or putting each data set to certain model and them combining them.
Newer methods have been developed to efficiently Combine Data from various sources.
Several data coming various tables and databases can now be combined into a single
information table. One method used is a likelihood procedure which provides an
estimation
technique to address identifiable problems with aggregated data from some tables
related
to other tables.
OLAP can make complex ad hoc and analytical queries on a database configured for
OLAP
use and the execution can be very fast given the fact that a server needs to answer
many
users at a time from different geographical locations. OLAP combines data to give a
matrix
output format with dimensions forming rows and columns representing the values and
measures.
Combined Data is also heavily found in Data Farming, a process where high
performance
computers or computing grids run simulations billion times across a large
parameters and
value space to come up a landscape of output data to be used for analyzing trends,
insights
and anomalies of many dimensions. It can be compared to a real plant farm and a
harvest
data comes after some time.
Change Data Capture refers to the process of capturing changes which are made to a
production data source. Change Data Capture is typically performed by reading the
source
of database management software logs. Some of the features of Change Data Capture
are
If Change Data Capture was not implemented, extracting business data from any
database
would be an extremely difficult and cumbersome process. This will involve moving
the entire
contests of the tables into flat files and load the files into the data warehouse.
This is not
just cumbersome but also expensive.
Change Data Capture is not dependent on any intermediate flat files to temporarily
contain
data outside the relational database. Changed data resulting from INSERT, UPDATE,
and
DELETE operations are captured and then stored in a database object which is called
change
table. The changed data will then be made available to any applications which will
need
them in a controlled manner.
Some of the terminologies describing Change Data components include the following:
Source System � This refers to the production of database containing the source
table
where Change Data Capture will capture the changes.
Source Table � Is the table in the database which contains data the user will want
to
capture. Any changes made to the source table will be instantly reflected in the
change
table.
Change Table � This is the database table which contains the changed data which
results
from DML statements made to a single source table. This table can consist of the
change
data itself and the system metadata. The Change Data is stored in a database table
while
the system metadata is needed for maintaining the change table.
Change table need to be managed so that its size will not grow without limit. This
is done by
managing the data in change tables and automatically purging change data which are
no
longer needed. A procedure can be automatically set to be called periodically to
remove
data from the change table which are no longer required.
Security is imposed in the Change Data process by having data subscribers, any user
application which will want to get data, register with the database management
system.
They will then specify their interest from one or more source to tables and the
database
manager or administrator will give the subscribers their desired permissions,
privileges or
access.
The Change Data Capture environment is very dynamic. The data publisher can add and
remove change tables at whim at any time. Depending on the database application,
subscribers may not get explicit notification when the publisher makes changed to a
table
but views can be used to check by the subscriber. There are many more mechanism
employed so that subscribers can always adjust to changes in the database where
subscription is active.
In a real business environment, the data warehouse is the main repository of the
company's
historical data as well as data subscribed from other sources so that the company
can up
with statistical analysis to better understand the patterns and trends in the
industry where
they are operating. When they have a clear understanding of the industry trends,
the can
adjust their business rules and policies as well as come up with innovations in
their products
and services to gain competitive advantage over other companies within the same
industry.
In the Classic Data Warehouse Development, the first step is to define the
enterprise
business model.
During this phase, all real life business activities are gathered and listed. A
case model for
the entire business is drawn. This includes interaction between the business and
its external
stakeholders. For a enterprise business model to be consistent, business
requirements are
identified using a very systematic and complex approach.
Some enterprise business modelers do base the functions on an organizational
structure as
it is prone to change over time with the fast changing of business trends and the
potential
growth. What is essential is a consistent framework for the business is defined
that can last
a long period.
An enterprise business models show how the business workers and other entities work
together to realize business processes. The object model can be made from the
aggregate
collection of all the process and people and events involved.
When the enterprise business model is in place, the next step would be to create a
system
data model. This data model is actually an abstract data model describing how data
is used.
This data model represents entities, business events, transactions and other real
life
activities defined by the enterprise business model.
In a technical sense, the system data model would be used in the actual
implementation of
the database. The system data models are the technical counterparts of the entities
created
in the enterprise business model.
The next step would be defining the data warehouse architecture. The data warehouse
When all is set, planned and documented, it will time to set up the physical
database. The
demand of the data warehouse specifies the need for a physical database system.
Computer
hardware is one of the biggest considerations in setting up a physical database.
The processing power of the computer should be able to handle labor intensive
processing.
The storage devices should be able to hold large bulks of data which get updated
every few
minutes. Networking support should be fast and efficient.
When the physical database is set, measure and dimensions have already been laid.
Measures are individual facts and dimensions refer to how facts need to be broken
down.
For example, data warehouse for a grocery may have dimensions for customers,
managers,
branches and measures of revenue and costs. The next step in the classic data
warehouse
development is to populate the fact and dimension tables with appropriate data. The
� Basis on the product gross margin and customer cost of service, who are the most
profitable customers today?
� Where these the same customers who were also the most profitable last month or
last
year?
� What products are selling the best and which ones are the least in sales
performance?
� Which particular products are popular among a certain age bracket?
� What products need to be reinvented and what new products can be derived to cater
to
the taste of the market aged 21-30 year old?
People on the internet cruise one from one site to another, read articles, register
on their
favorite website and make purchases online through the website that they trust. As
they do
these activities, they are actually giving information about themselves.
It may surprise some people that when they open certain web pages, the ads that
appear
are often related to their taste and interest.
The capture and use of consumer profiles are on web activities have been a great
fueling
force in e-commerce. Many online websites set up separate database to exclusive
function
as recommendation engines.
Using consumer profiles for e-commerce sites can be a very complicated activity.
People do
not just stick to one website. And online companies will have to make sure that
they
recommend the appropriate product to the right market. But consumer profiles are
constantly changing.
There can be several reasons why consumer profiles get out of data. It could be
that the
consumer has been away from the site or they have simply changed preferences. One
reason why consumer profiles change is not because the consumers themselves caused
the
changed but because of poor algorithms in the servers that could not come up with
correct
analytical processes.
Most data warehouses for e-commerce sites have engines that observe behavioral
activities
on the site. These engines track purchases, registrations, visited product reviews
and all
other activities which they may get information about. This way, consumer profiles
are
constantly updated.
Getting consumer profiles can be a heavy workload on the data warehouse servers.
Servers
need to weed out irrelevant data. More sophisticated data warehouse setups use a
complex
combination of content, age, frequency and other unique factors to deliver the best
possible
way of targeting advertising to the right markets.
There are several software applications on the market that specifically deals with
consumer
profiles. Some intelligent business solutions are composed of a suite of many
solutions to
many aspects of business and consumer profiling is among the suite of solutions.
Critical Success Factors are areas of activity in which favorable results are
necessary for a
company to reach its goal. Critical Success Factors are intensive used in business
organizations as essential guides for the company or project to achieve its mission
and
goals.
For example, one of the Critical Success Factors for of a company involved in
developing
information technology solutions is user involvement. Some general critical success
factors
include money factors like having positive cash flow, profit margins and revenue
growth;
customer satisfactions factors; product development factors and many others.
D. Ronald Daniel first presented the idea of Critical Success Factors in the 1960s.
A decade
later, John F. Rockart of MIT's Sloan School of Management popularized the idea and
since
then, the idea has been extensively used in helping business organizations
implement
projects and industry strategies.
Today, there are already different ways in which the concept of Critical Success
Factors is
being implemented and this will probably tend to evolve.
To illustrate the concept of Critical Success Factors, let us say someone want to
set up a
bookstore. The person defines his mission as "To be the number bookstore in town by
offering the widest selection of books and sustain customer satisfaction rating of
90%.
From the objectives the activities to realize them would then be listed and lay out
in a very
clear perspective. This will give the company better focus and making good at these
In the bookstore case mentioned above, we can already identify some information
needs in
a few minutes although identifying these needs in detail should take time and
participation
from different staff of the company. To have a wide array of book, the information
needed
would be where to find book suppliers, how to build strong and stable relationships
with
publishers, how to come up with fast and efficient shipment system and others.
In sustaining 90% customer rating, the needed information would be what the main
topics
that buyers like are, what promotional activities will the bookstore undertake. In
the book
store expansion, the needed information would be which location will the bookstore
expand,
how will the physical set up be like, what IT needs will be taken into
consideration.
Knowing critical success factors in the operation of the business can really
strengthen
management strategy. Risk management process can be more focused and many issues
will
be corrected and probability of failure greatly reduced. Every single activity
within the
organization will be directed towards achieving the overall success of the company.
What is Crosstab
A Crosstab should never be mistaken for frequency distribution because the latter
provides
distribution of one variable only. A Cross Table has each cell showing the number
of
respondents which gives a particular combination of replies.
Cross Tabulations are popular choices for statistical reporting because they are
very easy to
understand and they are laid out in a clear format. They can be used with any level
of data
whether the data is ordinal, nominal, interval or ratio because the Crosstab will
treat all of
them as if they are nominal data. Crosstab tables are provide more detailed
insights to a
single statistics in a simple way and they solve the problem of empty or sparse
cells.
Since Cross Tabulation is widely used in statistics, there many statistical process
and terms
that are closely associated with it. Most of these processes are methods to test
the
strengths of Crosstabs which is needed to maintain consistency and come up with
accurate
data because data being laid out using Crosstabs may come from a wide variety of
sources.
The Lambda Coefficient is a method of testing the strength of association of
Crosstabs when
the variables are measured at nominal level. Cramer�s V is another testing method
that test
the strength of Crosstabs which adjusts the number of rows and columns. Other ways
to
test the strength of Crosstabs associations include Chi-square, Contingency
Coefficient, Phi
Coefficient and the Kendall tau.
Companies find the services of a data warehouse very indispensable. But inside the
data
warehouse can be found billions of data which most of them are unrelated. Without
the aid
of tools, these data will not make any sense to the company. These data are not
homogenous. They may come from various sources, often from other data suppliers and
other warehouses which may be coming from other branches in other geographical
locations.
Relational database applications have a Crosstab query function. This function can
transform
rows of data to columns or any table for statistical reporting. With Crosstab
query, one can
send a command to the database server and the server can aggregate data like
breaking
down reports by months, regional sales, product shipment and many more.
Many advance database systems have dynamic Crosstab features. This is very useful
when
dealing with columns that do not have a static number. Crosstabs are heavily used
in
quantitative marketing researches.
Thorough computing history, there have been different methods and languages already
that
were used for data access and these varied depending on the type of data warehouse.
The
data warehouse contains a rich repository of data pertaining to organizational
business
rules, policies, events and histories and these warehouses store data in different
and
incompatible formats so several data access tools have been developed to overcome
problems of data incompatibilities.
Recent advancement in information technology has brought about new and innovative
software applications that have more standardized languages, format, and methods to
serve
as interface among different data formats. Some of these more popular standards
include
SQL, OBDC, ADO.NET, JDBC, XML, XPath, XQuery and Web Services.
JDBC which stands for Java Database Connectivity can be to some degree the same as
ODBC but is used for the Java programming language.
ADO.NET is a Microsoft proprietary software component for accessing data and data
services. This is part of the Microsoft .Net framework. ADO stands for ActiveX Data
Object.
XML stands for Extensible Markup Language is primarily a general purpose markup
language. It is used to tag data so that sharing of structure data can be done
through
disparate systems across the internet or any network. This makes data of any format
portable among different computer systems making XML one of the most used
technologies
in data warehousing.
XML data can be queried using XQuery. This is almost semantically the same with
SQL. XML
Path Language is used to address portion of an XML document or other computing
values
like strings, Booleans, number and others based on any XML document.
Web services are software components that make possible the interoperability of
machine to
machine interaction over the internet. They are also commonly known as Web API that
are
accessed over the internet and execute on another remote system.
Many software vendors develop applications that have graphical user interface (GUI)
tools
so that even non programmers or non database administrators can build queries by
just
clicking the mouse. This GUI data access tools give users access via data access
designer
and data access viewer. With the data access designer, an end user can create
complex
databases even if he does not have intensive background.
Templates that are complete with design framework and sample data are available
readymade. With the data access viewer, the user can run and enter data and make
changes and modifications and graphically see what see the commands without having
to
care the complex process happening in the background.
Data access tools makes the tasks of database administrators a lot easier
especially if the
database being management is a large data warehouse. Having a graphical interface
for
data access gives the administrator a clearer status of the database because most
programmatic query languages may look cryptic on the command line interface.
In simple but technical term, metadata is a data that describes another data. It
can be any
item describing an individual datum or a collection of multiple content items.
A Common Metadata is the bases for sharing data within an enterprise. These data
refer to
a common definition of data items, Common Data Names and Common Integrity Rules.
With these commonalities come Common Transformations for all master data items
including customer, employee, product, location and many others. This also includes
Common Transformations for all business transaction data and all business
intelligence
system metrics.
Common Warehouse Metamodels are also useful in enabling users to trace data lineage
as
they provide objects that effectively describe where the data came from and how or
when
the data is being created. The instance of the metamodel are exchanged through the
XML
Meta Data Interchange documents.
Today's business trends are heading towards the internet as the main highway to
gather
and share data. But the internet is full of all sorts of data. This includes
different data
formats, different applications using and sharing data and different server
systems.
Problems can arise in terms of hardware and software portability.
The use of Common Metadata tries to melt the boundary down because the format by
which
a common data is packaged can be read by disparate systems. So, whether the data
shared
in used a relational database or an excel flat files, the processing server within
the data
warehouse will know how to deal with data for processing.
Many modern business organizations are striving towards a common goal of uniting
business and data applications in order to increase productivity and efficiency.
Such goals
have been translated into recent trends in data management like such technologies
as
Enterprise Information Integration (EII) and Enterprise Application Integration
(EAI).
These technologies try to answer question on how organizations can integrate data
meaningfully from many disparate systems so that companies can better execute and
understand the very nature of their business.
In order to interconnect business more efficiently, many companies need to map data
and
translate these data between the many kinds of data types and presentation format
that are
in wide use and availability today.
There several ways to do data mapping using procedural codes, using XSLT
transformation
or using tools with graphical interfaces. Newer methods to do data mapping involves
evaluating actual data values in two data sources and automatically discovering
complex
mappings between the sets at the same time. Semantic data mapping can be achieved
by
having a metadata registry consulted to look up synonyms with data elements.
Today's enterprise data stored in data warehouses is of high volume data stored in
relational databases and XML based applications. Both of these applications
generally
cannot provide for an attractive way for presentation of data to company data
consumers,
customers and partners.
In order to address these problems, several XML based single source strategies have
been
developed. But in some of these supposed solutions, there is still a lack of
multiple
transformation stylesheets for each desired final output. In many cases, the need
to publish
content from relational database is not met.
The HTML to XML mapping software can access data stored in HTML format and convert
it to
XML without losing the document's style. The conversion results in an XML schema
reflecting the content model while an instance of the XML document contains the
actual
content and an XSLT stylesheet takes care or the presentation style.
The extensive use of data mapping together with a company's robust data warehouse
architecture and business intelligence system can definitely result in an orderly,
efficient
and fast business day to day operation.
What is Conceptual Schema
In any data warehouse implementation there are many different considerations which
should in place before the final physical setting up. This is to avoid in problems
related to
quality of data and consistencies in data processes.
A conceptual schema, although it greatly represents the data warehouse and the
common
structure of data, is not a database design. It exists as different levels of
abstractions.
These abstractions are the basis for the implementation of a physical database.
Any conceptual schema is done by using a human oriented natural language. This
natural
language is used to fine elementary facts. The conceptual schema is totally
independent of
any implementation whether database or non-IT implementation.
The data model and query design of a business architecture and should be performed
at the
conceptual level and then mapped to other levels. This means that at the conceptual
schema everything should be gotten right in the first place. And then as the
business grows
and evolves changes can be made later. But many keen data architects usually design
the
data model for scalability. This means that all business growth and evolution are
already
taken into consideration in the concept schema level.
Making the conceptual schema commonly involves the close coordination between the
domain expert and data modeler. The domain expert best understands the application
domain. He or she understands the scope of the enterprise activities including the
individual
roles of the staff and the clients. He or she also understands the scope of the
products and
services involved. On the other hand, the task of the data modeler is to formalize
the
informal knowledge of the domain expert.
As the case should be, the communication between the domain expert and the data
modeler
involves verbalizing fact instances from data use cases, verbalizing fact types in
natural
language, validating rules in natural language and validating rules using sample
populations.
With close coordination between the domain expert and the data modeler, the
expected
output should be a conceptual schema that have data expressed as elementary facts
in
plain English sentences (or in any language appropriate depending on the users).
Facts are
also laid out on how they are group into structures.
What is Connectivity
Computer networks are the main connectivity mechanism for passing data in an
electronic
environment. A network is composed of several computers connected by a wired or
wireless
medium so data and other resources can pass through for sharing.
Computer networks may also be classified according to the hardware technology used
in
connecting each device. The classification include Ethernet, wireless, LAN, Home
PNA and
power line communication.
The arrangement of computers in a network can also vary. The network topology
refers to
geometric forms in network connectivity. This could also describe the way computers
see
each other in relation to their logical order. Examples of network topologies are
mesh, ring,
star, bys, star-bus combination, tree or hierarchical topologies. It is good to
note that
although topology implies form, network topology is really independent of the
physical
placement or layout of computers. For instance, a star topology does literally mean
that
computers form a star but it means that computers are connected using a hub which
has
many points to imply a star form.
Perhaps the biggest aspect of computer connectivity is the use of communications
protocol.
In a network, different formats of data are being shared by different computer
systems
which may have different hardware and software specifications. Communications
protocol
tries to break down the disparity so that data could be shared and appropriately
processed.
Communications protocol are the set of rules and standards by which data is
represented,
signaled, authenticated and corrected before or after sending over the channel of
communication. For example, in a voice communication like the case of radio
dispatcher
talking to mobile stations, they follow a standard set of rules on how to exchange
communication.
. devices do handshaking, a process of trying to find out if the other one exists
The internet, the largest arena for computer and data connectivity, the protocols
are
assigned by the Internet Engineering Task Force (IETF) in close coordination with
the W3C
and ISO/IEC standard bodies. These bodies deal mainly with standards of the TCP/IP
and
Internet protocol suite.
Some of the major protocol stacks include open standards for connectivity including
the
Internet protocol suite (TCP/IP), file transfer protocol (FTP), Open Systems
Interconnection
(OSI), iSCSI, Network File System (NFS) and Universial Plug and Play (UPnP).
Data Derivation refers to the process of creating a data value from one or more
contributing
data values through a data derivation algorithm.
Almost all business organizations in today's environment are becoming more and more
dependent on the data produced from the data warehouses and information systems in
order to support the company's operations. Since data accuracy is important,
knowledge of
how data is derived is very vital and important.
As systems evolve, the bulk data increases too, most especially that more people
and
business are moving to the internet for what used to be offline transactions. With
the
evolution of information systems, functionalities also grow complex so a need to
have an
associated documentation for data derivation becomes more indispensable.
Data derivation applies to all real life activities which are being represented in
the data
model and aggregated in the process within the data information systems or data
warehouse. For instance, in a database that keeps records of wild migratory birds,
there are
records of data pertaining to a variable called "Population Size". The basic
question would
be "How was the population size of migratory birds derived?" The answer may be that
data
derived from the recorded observations, estimation, inference or a combination of
all and
then getting the sort of average or any other formula.
It is a known fact that proper data derivation is the key to having an accurate
understanding of the core content of any output as the this is the process of
making new
and more meaningful data from the aggregation or basis of the raw data which had
been
collected by the database.
A derived data could be any variable. For example, in a database that computes a
person's
age when the record only keeps his birthday, the age is computed using certain
formula
deriving age from the birthday.
As can be seen, the same variable can have several ways to derive. There it is
extremely
important to have a data dictionary so that the users can be guided about which
data
derivation they are using and stick to one algorithm if they want consistency.
Problems arising with data derivation can be hard to find. Therefore, data
derivation
formulas should always be carefully planned and documented so the flow of day to
day
operations will definitely be smooth and very efficient.
Data Partitioning is the formal process of determining which data subjects, data
occurrence
groups, and data characteristics are needed at each data site. It is an orderly
process for
allocating data to data sites that is done within the same common data
architecture.
Data Partitioning is also the process of logically and/or physically partitioning
data into
segments that are more easily maintained or accessed. Current RDBMS systems provide
Data Partitioning can be of great help in facilitating the efficient and effective
management
of highly available relational data warehouse. But data partitioning could be a
complex
process which has several factors that can affect partitioning strategies and
design,
implementation, and management considerations in a data warehousing environment.
Since data warehouses need to manage and handle high volumes of data updated
regularly,
careful long term planning is beneficial. Some of the factors to be considered for
long term
planning of a data warehouse include data volume, data loading window,
Index maintenance window, workload characteristics, data aging strategy, archive
and
backup strategy and hardware characteristics
There are many benefits to implementing a relational data warehouse using the data
partitioning approach. The single biggest benefit to a data partitioning approach
is easy yet
efficient maintenance. As an organization grows, so will the data in the database.
The need
for high availability of critical data while accommodating the need for a small
database
maintenance window becomes indispensable. Data partitioning can answer the need to
small database maintenance window in a very large business organization. With data
partitioning, big issues pertaining to supporting large tables can be answered by
having the
database decompose large chunks of data into smaller partitions thereby resulting
in better
management. Data partitioning also results in faster data loading, easy monitoring
of aging
data and efficient data retrieval system.
database administrator can define the partition function with boundary values,
partition
scheme having file group mappings and table which are mapped to the partition
scheme.
There are so many ways wherein data partitioning can be implemented. Implementation
Some of the key components of data warehousing are Decision Support Systems (DSS)
and
Data Mining (DM).
Data volumes in data warehouse could grow at an exponential rate so there should be
a
way to handle this tremendous growth. With respect to storage requirements, the
critical
needs that need to be seriously considered in a data warehouse are high
availability, high
data volume, high performance and scalability, simplification and usability and
easy
management.
Partitioning of data into a logical or in some cases physical Data Repository could
greatly
help meet the requirement in relation to dealing with the exponential growth of
data
volumes in the data warehouse. If all the data in the data warehouse were not
partitioned
into several Data Repositories, then there will be profound disadvantage in terms
of
perfomance and efficiency.
For one, if the central server fails, the system would come to a halt. This is
because data is
just located in one monolithic system, and when the hardware fails, there is no
sort back
up. It may take some time to get the server up, depending on the nature of the
problem.
But in a business company, even a few minutes of business stoppage can already
translate
into thousands of potential dollars lost from the business.
When Data Repository is employed in the data warehouse, the load can be distributed
across many databases or even across many servers. For instance, instead of having
one
computer handle the database related to customers, several databases could be
handling
the different aspects of customers.
In a very large company such as a company that has several branches around the
country,
instead of having all the customers in one database, several databases may be
handling
different branch customer database in a data repository. Or as earlier mentioned,
several
company departmental database may be broken down into various Data Repository such
as
one data repository supporting several databases (revenues, expenses) which support
Data Repository offers easier and faster access due to the fact that related
information are,
to some degree, lumped or clustered together. For instance, in the example with
financial
Data Repository, anybody from the financial department or any other data use
wanting
information related to financials will not have to dig through the entire volume of
the data in
the data warehouse.
For database administrators, employing Data Repository means a lot easier way to
maintain
the data warehouse system because of the compartmentalized nature. When there is
problem within the system, it may be easy to trace the cause of the problem without
having
to use a top down approach for the whole data warehouse. In most companies, one
database manager or administrator is usually assigned to one data repository to
ensure
data reliability for the whole system.
displayed. A Data Scheme can be a complex diagram with all sorts of geometric
figures
illustrating data structure and data relationships to one another in the relational
database
within the data warehouse.
As an example, let us take a generic website and illustrate the Data Scheme.
One of the general data categories in website Data Scheme is the User Accounts and
Privileges and Watchlist. The Data Scheme may draw one big box for User Accounts
and
Privileges and Watchlist. Within this big categories are four smaller data
categories boxes
named User, Watchlist, User Group and User New Talk.
The User box contains basic account and information about users such name,
password,
preferences, settings, email address and others. The Watchlist box contains
registered users
and the pages the user watches, the namespace number, notification timestamp and
others.
The User Group permission box maps users to their groups with defined privileges.
The User
New Talk box stores notification of user talk page changes for the display of "You
have have
new messages" box.
Within each of the four boxes are defined the data and each of the data names as
well as
data types. The Watchlist box may contain the following data, the corresponding
name and
type:
The same structure goes with the other tables as well. It is very clear that all
data
structures are being defined with names and data types. In a real data scheme
diagram,
there could be hundreds of boxes, data names and types and crossing lines
connecting one
entity to another.
The graphical look for a data scheme has some similarities to a flowchart which is
a
schematic representation of an algorithm or a process. But while a flowchart allows
business
analysts and programmers to locate the responsibility for performing an action or
making
correct decisions and allowing the relationship between different organizational
units with
responsibility over a single process, a data scheme is just merely a graphical
representation
of data structure. There is no mention of any process whatsoever.
It may be a good point to note the Data Scheme may or may not represent the real
lay out
of the database but just a structural representation of the physical database. To a
certain
degree, the data scheme is a graphical representation of the logical schema in data
Data Schemes are also highly useful in troubleshooting databases. If some points of
the database
are faulty, Data Schemes helps to pinpoint the cause of the error. Some errors in
database and
computer programming languages which are not related to syntax can be very hard to
trace.
Logic errors and errors related to data can be very hard to pin down, but with the
help of a
graphical Data Scheme, errors may be made easier to spot.
Data Store
A data store is very a very important aspect of a data warehouse in that it acts as
support
of the companies need for up-to-the-second, operational, integrated, collective
information.
It is a place where data such as databases and flat files are saved and stored.
Data stores
are great feeders of data to the data warehouse.
In a broad sense, a data store is a place where data is integrated from a variety
of different
sources in order to facilitate operations, analysis and reporting. It is can be
considered an
intermediate data warehouse for databases despite the fact that a data store also
includes
flat files.
Some data warehouses are designed to have data loaded from a data store which
consists
of tables from a number of databases which supported administrative functions like
financial, human resource, etc).
In some cases, the store are contained in one single database, while in other
cases, the
data store is scattered in different databases in order to allow tuning to support
many
different roles.
Those who prefer not to store a data store in a single database argue that the
tuning
choices are based on the very nature of the data and not on database design and the
access
on the large volumes of data would be negatively affected to a certain degree. It
also
matters in terms of the politics of getting everyone's concurrence.
The data store, being an integral part of the data warehouse architecture, is the
first stop
for the data on its way to the warehouse. The data store is the place where data is
collected
and integrated and made sure of its completeness and accuracy.
In a lot of data warehousing implementations, all data transformations cannot be
completed
without having a full set of data being available. So, if there is a high rate of
data, capturing
is possible without having to constantly change data in the warehouse.
In general, data stores are normalized structures which integrate data which are
based on
certain subject areas and not on specific applications. For instance, a business
organization
may have more than 50 premium applications.
A premium subject area data store collects data using a feed from different
applications
providing near real time enterprise wide data. The data store is constantly
refreshed in
order to stay current. The history will then be sent to the data warehouse.
A data store can be a great tool as a reporting database for line-of-business
managers and
service representatives who will be requiring an integrated picture of the
enterprise
operational data. Some important aspects of business operation such as operational
level
reports and queries on small amounts of data can be made more efficient by the data
from
the data store.
For instance, if one wants specific data for only one calendar quarter, it may be
wise to just
query the data store. It would be much faster because querying the data warehouse
will
involve sifting through data for several years.
Data Thesaurus
A data thesaurus really consists of several metadata. Metadata is any kinds of data
which
describes another data. On the other hand, the literal meaning of thesaurus
according to
dictionary.com is "an index to information stored in a computer, consisting of a
comprehensive list of subjects concerning which information may be retrieved by
using the
proper key terms."
The data thesaurus, as part of the whole data warehouse system, is being
implemented in
line with all business rules and enterprise data architecture. The terms within it
are all
pertaining to business words because these terms are chosen and assigned as subject
For example, in a medical data thesaurus, the word "infant" may be the preferred
term
instead of the word "baby" despite the fact they are commonly used interchangeably
in the
real world.
Non-preferred terms are of course the opposite to the preferred terms but they have
their
own considerations too. In the event that there are two or more words which can be
used in
expressing the same concept, the data thesaurus specifies which one to use as the
preferred while listing the others as non-preferred terms.
Specifiers are used when there are two or more words which are needed to express a
concept. An example would be "chief executive officer". The data thesaurus will
then make a
cross reference about the specifier against a combination of preferred terms so the
system
can now how to represent the group of words. In many data thesaurus, they are also
written in italics but they are typically followed by a + sign e.g. chief executive
officer+.
Indicators are also the same as the non-preferred terms but they point to a
selection of
some possible preferred terms in case there is not exact match that can be found
for the
concept and a single preferred term.
The ISO 2788 sets the Guidelines for the establishment and development of
monolingual
thesauri. This standard defines all aspects of a data thesaurus including Scope and
field of
application;
. References
. Definitions
. Compound terms (General, Terms that should be retained as compounds, Terms that
should
be syntactically factored, Order of words in compound terms)
. Basic relationships in a thesaurus (General, The equivalence relationship, The
hierarchical
relationship, The associative relationship)
Data Warehouse Engines handle storage, quering and load mechanisms of large
database. It is an undisputable fact that implementing a data warehouse is such a
very
challenging task. This becomes even more challenging and difficult to do when we
take into
consideration the diversity of both operational data sources and target data
warehouse
engines. Both target and source data engines may be totally different when it comes
to
semantics such as considerations with regards to core data models.
Also, they may be totally different in the aspects of infrastructure such the
operational
details on data extraction and importation. When there is no common and sharable
descriptions for both the structures of data sources and target data warehouse
engines
when result in having acquisition of more data warehousing tools.
A research has shown that a fifty percent growth is record every year when in comes
to the
amount of data that business organizations retain to be used for analytic purposes.
In some
other industries such those in e-commerce, the web, telecommunications, retail and
governments, the growth rate figure may even be higher. These increasing trends
show that
there is a need for more powerful data warehouse engines.
From just a couple of years ago when data needed to power business intelligence
were just
stored in a central warehouses and a few other data sources within the departments,
now
there are countless ways to deal with high volumes of data with a multitude of data
sources
coming from wide geographical locations.
There are many kinds of data warehouse engines. Some of these data engines are
specific
to relational database implementations while some are open and can be used by any
implementing database software.
Micro-Kernel Database Engine is used by Btrieve database developed by Pervasive.
This
database engine uses module method to separate the backend of a database from the
interface used by developers. The core operations such as update, write and delete
records
of the database are separated from from the Btrieve and Scalable SQL modules. By
doing
such, programmer can use several methods of accessing the database simultaneously.
Microsoft uses the Jet Database Engine many of its products. The Jet, which stands
for Joint
Engine Technology, had its first version developed in 1992 which consisted of three
modules
for manipulating a database. The Jet is used for databases dealing with lower
volume of
data.
For database engines that deal larger volumes of data processing, Microsoft
provided
Microsoft Desktop Engine (MSDE). This was later followed by SQL Server Express
Edition
and most recently by SQL Server Compact Edition. However, the Jet can be upgraded
to
SQL Server.
InnoDB is a storage engine used by MySQL and is included in current binaries
distributed by
MySQL AB. It features an ACID-compliant support for transactions as well as
declarative
referential integrity. When Oracle acquired Innobase Oy, InnoDB became a product of
Oracle Corporation. But InnoDB is dual license as it is also distributed under the
GNU
General Public License.
Data warehouse engines vary depending on the needs of the organization. But it is
common
today to acquire data warehouse engines that can handle the needs of very big,
terabyte-
scale business intelligence applications. This will make organizations get faster
information
to help then achieve success in the competition.
The pre-data warehouse is like stage or area where the designers need to determine
which
data contains business value for insertion. Some of the infrastructure found in
this area
includes online transaction processing (OLTP) database which stored operational
data.
Enterprise Resource Planning (ERP) and management software. OLTP databases need to
have very fast transactional speeds and up to the point accuracy.
Metadata computer application servers also can be found within this area. Metadata,
which
means data about data in computer speak, make sure that data which into the data
lifecycle
process are accurate and clean. It also makes sure that they are well defined
because
metadata can help speed up searches in the future.
During the data cleansing, data undergoes a collective process referred to as ETL
which
stands for extract, transform, and load. Data are extracted from outside sources
like those
mentioned in the pre-warehouse. Since these data may come in different formats from
disparate data, they will be transformed to fit the business needs and requirements
before
they are loaded into the data warehouse.
Tools at this phase include software applications created with almost any
programming
language. These tools could be very complex and many companies prefer to buy then
instead of having in house programmers. One of the requirements of a good ETL tool
is that
it could efficiently communicate with the many different relational databases. It
should also
be able to read various file formats from different computer platforms.
At the data repository phase, data are stored in corresponding databases. This is
also the
phase where active data of high business value to an organization are given
priority and
special treatment. Data repositories may be implemented as data mart of operational
data
store (ODS).
A data mart is smaller that a data warehouse and is more specific as in it is built
on a
departmental level instead of company wide level. An ODS are sort of resting place
for data
and they hold recent data before they are migrated to the data warehouses. Whether
a data
warehouse implements both or not, the tools in this stage are all related to
databases and
database computer servers.
The front-end analysis may be considered the last and most critical stage of the
data
warehouse cycle. This is the stage where data consumers will interact with the data
warehouse to get the information they need. Some of the tools used in this area are
data
mining applications which are used to discover meaningful patterns from a chaotic
system
or repository.
Another tools is the Online Analytical Processing (OLAP) will used in analyzing
historical data
of the organizations and slice the required business information. Some of the other
tools are
generic reporting or data visualization tools so that end users can see the
information in
visually appealing layouts.
Data Value
Data values are what actually take place in the data variable set aside by the data
entities
and all its attributes. It consists of facts and figures of data items, data
attribues and data
characteristcs.
From the data model whose structural part includes collection of data structures
used in
creating objects and entities modeled by the database, to the integrity part
defining rules
that govern constraints placed on the data structures, to the manipulation part
which
defines the collection of operations that can be applied to the data structures to
update and
query, data values are the concrete entities for all those abstract models.
For example, a database may have a table called "employee" with employee attributes
such
as first name, family name, address, age, address, email address, marital status,
job title,
monthly salary and many others. All these mentioned terms are simply descriptions
about
the entity any they are the building blocks of the whole database table structure.
They do not yet have value until somebody inserts real value into them. In the next
step, an
end user may add a record about a new employee so the following data values may be
entered into the database table:
JOHN (first name),
SMITH (family name),
15 OAK AVENUE, BRONX, NEW YORK, USA (address � in most cases, the address is broken
For instance, if a column is defined to accept only integer data values, it can
never accept
any letter or a string of letters. In the above example, the age attribute may be
defined to
be of integer data type, which generally accepts a range of 0-255 for an 8-bit
unsigned
integer. So, no data user can enter the value "thirty two" into the age field.
For example, let us take the case of US Social Security numbers. In the database
table, the
data type for Social Security number maybe character or VARCHAR(11) aside from the
structural restrictions and semantic restrictions for the data. The structural
restriction may
take the form of 3 digits (0-9) followed by a hyphen (-), followed by 2 digits, a
hyphen,
then 4 digits. The semantic restrictions on the other hand specify the rules about
the
number itself.
In actual implementation, the first 3 digits would refer to the state or area. The
next 2 digits
would refer to the group number which is issued in a certain given order such as
odd
numbers from 01 through 09, followed by even numbers from 10 though 98.
All these definitions and restriction ensures that the data value entered into the
database
table is always correct and consistent structurally. The only problem that would
arise could
only come from the data entry but the structure will always be correct.
Data values need to be clean all the time as they are the source of information
that can give
an organization a better picture of itself and can come up with wide decisions and
moves.
Database
queries, a set of computer codes translated into a language that the database
system can
understand. The computer program that is employed to manage and query a database is
The Flat model, also called table model, is made up of single, two-dimensional
array of data
elements. All members of a certain column are assumed to contain similar values
while all
the member of a rows are assumed to have relationships with one another. Flat
models are
no longer popular today because they hard to manage when the volume of data rises.
The Hierarchical model organizes data into a tree-like structure. The model would
have a
single upward link in each record in order to describe the nesting. It contains a
sort field
used for keeping the records in a particular order in each of the list in the same
level.
Network model stores records linking with other records as the name implies.
Pointers,
which can be node numbers or disk addresses, are used to track all the associations
within
the database.
The term relational refers to the fact that various tables in the database have
relation to
other tables and programmatic algorithms make it easy to insert, update, delete and
all
other operations on different tables without sacrificing data quality and
integrity. Databases
implemented using the relational mode as managed by a relational database
management
system (RDBMS).
Object database models are newer types of models under the object-oriented
paradigm.
This model makes an attempt to bring together the database world and the
application
programming.
The object database model tries to avoid computation overhead by creating reusable
objects
based on a class or template. This model attempts to introduce into the database
world
some of the main concepts of object oriented programming such as encapsulation and
polymorphism.
Some of the database internal considerations include aspects related to Storage and
Databases are ubiquitous in the world of computing with and they are used in a lot
of
applications that spans virtually the entire range of computer software. They are
the
preferred data storage methods for large multi-user applications and environments
where
large chunks of data are being dealt with. Databases have become an integral part
of many
implementations of web servers. With the fast rise of e-commerce websites today,
databases are becoming indispensable tools for internet business.
Large enterprise data warehouses cannot run without the use of databases. This
sophisticated repository of data needs to be managed effectively by a database
management software application.
Decentralized Warehouse
In the case of mergers and selling of business units, many large business
organizations
often end up in restructuring of activities and data considerations that can be
given benefit
to by implementing a decentralized warehouse.
A decentralized data warehouse separates data management but many key areas of the
business enterprise. For instance, a warehouse management system can be a stand
alone
decentralized system or can be operated with a centrally operated enterprise
resource
planning (ERP) system.
A really large business enterprise may have the following implementation: Goods
Movement
in the Decentralized Warehouse; Goods Receipt in a Decentralized Warehouse; Goods
Issue
in a Decentralized Warehouse; Stock Transfers in a Decentralized Warehouse; Posting
Today, it is a fact that most business organizations from small to medium sized to
large
multinational corporations, can hardly go into operation without having to rely on
information. The term "data-driven" has already been in wide use and has become all
too
real in cutting edge business operations.
are coming into the market. In the past, it was common to have a central database
to serve
all of the organizations needs.
Many information system designers and architects have been holding the belief that
a
central control is better for database management. From this standpoint, they saw
that a
centralized server handling all data in one logical and physical system is good in
the aspect
of data integrity and less expensive in the aspect of economic due to the cost
associated
with redundant systems.
But because of the coming out of more advance hardware that are relatively cheaper
in
price when the issue of speed and efficiency is accounted for, decentralized
database has
become a better choice.
If all of the data output from these areas are being handled by one central
database, the
possibility of failure is potentially high. And when a failure occurs, the business
process
would definitely stop. When there is stopping, no matter how long the period, it
could mean
loss in revenue and income for the company.
With this kind of setting, data integrity may be maintain more securely because
there will
be a better sense of responsibility by each of the department. If something goes
wrong, it
could be very easy to pinpoint which department caused the problem and specific
persons
or group can take the responsibility.
The decentralized database could also significantly boost the access and processing
speed of
the whole system. In a centralized database setup, when one data consumer wants to
view
a particular for, say, sales report, the database will have to scan through the
whole central
database and this could mean slowing of the whole system. With a decentralized
database,
the whole system can immediately lead the data consumer's query to the specific
department where the sale report is being stored.
Finally, with a decentralized database, there can be no reason why the whole system
will go
done or fail and business temporarily halted because all the data are scattered
across
different departments within the organization. This means lesser potential for
revenue loss.
End User Data can either be data provided by a data warehouse or the data created
by end
users for query processing.
The technical world of computing and computers has always been divided into two
general
reams. On the one realm are the high priests, the knowledgeable group of people who
how
the ins and outs of computers its most complex details. These people shape the
computer
codes and programs and enable computer behaviors which are rich and valuable.
On the other real are the novice users who are at the mercy of the high priests of
computing, who can be denied or granted access to knowledge, information or
education
from computers.
End user data, in its very essence, different from entered or supplied by the other
realm of
the novice. But this is not the case all the time because the high priests can be
suppliers of
end user data too. So, the very essence of end user data is the data supplied into
any
processes written or developed by the programmers (the high priests) in order to
produce
the desired output.
Although not exclusively true in some cases, the end user data can also refer to
those data
generated for the end users which has been the result of querying a database for
specific
information. For example, if the a user wants to know the how many people there are
in a
specific company department, the data answer of the database from the certain query
could
be said as an end user data.
While end user data entry may be done in many different ways, there are many
available
graphical software tools for end user data entry. In the graphical interface, there
are text
boxes, radio buttons, combo boxes, list boxes, check box where end user data can be
collected easily. The text box can accept any characters and string data into the
system.
The radio buttons can collect data pertaining to a selection but in most cases, the
radio
button can accept only one choice from among the many.
The checkbox is similar to the radio button in the sense that it allows multiple
selections but
it can accept more than one choice. The list box and combo gives the choices in
terms of a
list. These tiny components typically compose an end user data entry interface.
End user data can be said to be the lifeblood of an information system. They are
the very
data from which processes are done to come out with output information that will be
used
by the organization.
End user data may also refer to data not just from human end users but from various
data
sources as well. As common in an enterprise information system such as a data
warehouse
implementation, several physical computers are each running as servers and doing
their
own computation processes.
The output of computer data source may server as end data user input to another and
so
on. As the system of many data sources grow in complexity, end user data may come
from
several sources. One user data may be used into another process in another data
source
computer and then the process output may be thrown into the network to be used
another
computer as yet another end user data.
The mixed use and exchange of end user data is an indication of how complex and
information system is. The more the data sources, the more complex data exchange
become. The internet is composed of various servers each communicating not just
with
each other but with end users using browsers as well. One can never imagine the
amount of
end user data traversing the internet every single second of the day!
The technical world of computing and computers has always been divided into two
general
reams. On the one realm are the high priests, the knowledgeable group of people who
how
the ins and outs of computers its most complex details. These people shape the
computer
codes and programs and enable computer behaviors which are rich and valuable.
On the other real are the novice users who are at the mercy of the high priests of
computing, who can be denied or granted access to knowledge, information or
education
from computers.
End user data, in its very essence, different from entered or supplied by the other
realm of
the novice. But this is not the case all the time because the high priests can be
suppliers of
end user data too. So, the very essence of end user data is the data supplied into
any
processes written or developed by the programmers (the high priests) in order to
produce
the desired output.
Although not exclusively true in some cases, the end user data can also refer to
those data
generated for the end users which has been the result of querying a database for
specific
information. For example, if the a user wants to know the how many people there are
in a
specific company department, the data answer of the database from the certain query
could
be said as an end user data.
While end user data entry may be done in many different ways, there are many
available
graphical software tools for end user data entry. In the graphical interface, there
are text
boxes, radio buttons, combo boxes, list boxes, check box where end user data can be
collected easily. The text box can accept any characters and string data into the
system.
The radio buttons can collect data pertaining to a selection but in most cases, the
radio
button can accept only one choice from among the many.
The checkbox is similar to the radio button in the sense that it allows multiple
selections but
it can accept more than one choice. The list box and combo gives the choices in
terms of a
list. These tiny components typically compose an end user data entry interface.
End user data can be said to be the lifeblood of an information system. They are
the very
data from which processes are done to come out with output information that will be
used
by the organization.
End user data may also refer to data not just from human end users but from various
data
sources as well. As common in an enterprise information system such as a data
warehouse
implementation, several physical computers are each running as servers and doing
their
own computation processes.
The output of computer data source may server as end data user input to another and
so
on. As the system of many data sources grow in complexity, end user data may come
from
several sources. One user data may be used into another process in another data
source
computer and then the process output may be thrown into the network to be used
another
computer as yet another end user data.
The mixed use and exchange of end user data is an indication of how complex and
information system is. The more the data sources, the more complex data exchange
become. The internet is composed of various servers each communicating not just
with
each other but with end users using browsers as well. One can never imagine the
amount of
end user data traversing the internet every single second of the day!
Metadata Synchronization
Metadata are data about data; each metadata describes an individual data, content
item or
a collection of data which includes multiple content items. Metadata
Synchronization
consolidates related data from different systems and synchronizes them for easier
access.
Metadata are very important components of any data warehouse implementation because
they are of great help in facilitating the understanding, use and management of
data. The
metadata which are required for efficient data management may vary depending on the
type of data and the context where these metadata are being used.
For instance, in a library database system, the data collection may involve the
contents of
book titles which are being stocked so the metadata to use should be abut the a
title that
would often include the description of the content, the book author, the date of
publication
and the physical location.
In the context of the camera, the metadata to use would describe such data as
photographic image, the date when the photograph was taken, and other details that
pertain to the settings of the camera. Within the contest of an information system,
the data
would pertain to the content of the files of the computers and so the metadata to
use may
include individual data item, the name of the field and its length and many other
aspects of
the file.
Each of these departments of the company comes up with their own sets of the data
but
even if these data are primarily needed to ran and manage their own departments,
their
departmental data are also needed by the entire business enterprise in order to
come up
with statistical analysis and reports which will become the basis for future
decisions to move
the company forward and compete in the industry arena.
To efficiently manage the information system in general and the data warehouse in
particular, there has to be a way to "iron out" the disparity of all sorts of data
including
metadata itself. In fact, since the very structure of an enterprise information
system
involves disparate data sources giving out disparate data, there is a world
standard of
processes called ETL which stands for extract transform and load which takes in all
sorts of
data from disparate systems and platforms and transform these disparate into a
unified
format that the business enterprise can understand and thus efficiently process.
The same is true with metadata. They need to be synchronized so that the
information
system would know which metadata comes and is needed by what department and when
the particular given period the metadata will be given.
Given the different geographical sources of the company branches, metadata should
also be
synchronized so that they entire system can get the impression of a barrier free
information
system.
Synchronization is a process that is not just for metadata. It could be for types
of computing
processes as well. This will have to ensure that end users are getting up to date,
relevant
and accurate data so that their decisions may be based on facts.
Information Consumer
Information consumers are everywhere are it has become of life that data and
information
have become driving forces in almost all aspects of our daily operations. With the
ubiquity
of the internet connection, today's information consumers includes people of all
ages and all
walks of life and even non humans like artificial intelligence technologies are
fast become
major information consumers.
Some information systems give certain access privileges to different kinds of
information
consumers. For instance, administrative staff may only gain access to information
related to
admin relate database. Or sales staff may only access sales related data.
The best way to imagine information consumers which are not humans is to take the
scenario of a data warehousing environment. This environment is a place for many
computer servers running database management systems and other application programs
that produce data whether in flat format or other digital data format.
As the whole system progresses, each of the data stores as well the central data
warehouse
itself takes turn being information consumers from each other while they also take
turns
being distributors.
The ubiquity of e-commerce websites has also produced more information consumers
with
very high demands. Most customers for e-commerce websites make their transactions
on
the internet and most of these transactions need to deal with very sensitive
information
such are bank account details and credit card numbers.
Information systems processing these sensitive data from the back end needs to
implemented high security features to avoid other "bad and unwelcome" information
consumers who may be lurking in some dark corners of the internet waiting to fish
for the
sensitive information.
There are many codes or tools that just sit in one corner of the server as an
information
consumer waiting for the data to come and be processed accordingly. Some of these
tools
are called middleware and they act as an interface to an application with the
server so in
effect a middleware acts both as an information consumer and information
distributor.
The process of unifying disparate data is referred to as ETL which stands for
extract,
transform and load. The extract and transform are mostly done in the operational
data store
before the transformed data is "loaded" into the data warehouse. With this picture
wherein
the data warehouse only get the loading part, many people get the impression that
the data
warehouse indeed is a mere static repository does not do a lot of things except
accept data
for storage.
In fact, the concept of data warehouse has been taken from the analogy with real
life
warehouses where good are put before the need arise to get them. And so with data,
the
operational data store goes to the data warehouse to get the data and process them
at the
operational data store area. Hence the term operational because it refers to the
data
currently being operated on or manipulated with.
But modern data warehouses are no longer as static as they seem or look. Data
warehouses
today are already managed by software application tools that have the functionality
that
allows the data warehouse itself to track data and perform all sorts of analysis
related to the
movement of data from the warehouse to the other data stores and back.
There are many companies specifically offering data warehousing software solutions
which
come with sophisticated proprietary intuitive functions. Many of these vendors even
offer
integrated solutions that add data warehousing functions with such complex features
as
data transformation, management, analytics and delivery components.
Imagine the data warehouse whose database is the repository of all of the company's
historical data. The data warehouse is the corporate memory. And then there are the
Online
Analytical Processing (OLAP) that handles all sorts of data so that the analysis
can be the
basis for wise and sound corporate decisions. And then there the Online
Transactional
Processing (OLTP) which handles online and real time transactions like that of an
automated
teller machine or a retails point of sales. In short, the enterprise data
management handles
very high volume of data every single minute and all throughout the year as long as
the
business is operating.
An enterprise data management information system has a data store that is a dynamic
place for data coming from different data source and delivering disparate data from
different
platforms. This is where the disparate data are being processed in a series of
activities
called the ETL which stands for extract transform load so that the disparate data
can be
formatted in a unified form before being processed.
Speaking of a data store, the data that periodically gets to the data store are
coming from
the data sources.
For instance, let us take the case of the United States Environmental Protection
Agency
which is implementing an Envirofacts Data Warehouse are an example of a data source
and
where the primary data source applies. This United States agency is so large and it
deals
naturally with large volumes of data so its data handling is broken down into many
individual EPA databases and databases are administered by program system offices.
Sometimes, the industry is required to report information to state where it
operates and
sometimes also, the information is being collected at federal level.
So the data sources of the Envirofacts Data Warehouse provide information that
makes it
easy to trace the origin of the information. Some of these data sources are:
Superfund Data Source � This data source are from Superfund sites which hav those
uncontrolled hazardous wastes sites designated by the federal to be cleaned up. In
this data
source are stored information about these sites in the Comprehensive Environmental
Response, Compensation, and Liability Information System (CERCLIS), which has been
integrated into Envirofacts.
Safe Drinking Water Information Data Source � This database stores information
related to drinking water programs.
Master Chemical Integrator Data source � This database integrates various chemical
identifications used in four program system components.
Other data sources Envirofacts Data Warehouse are Hazardous Waste Data, Toxics
Release
Inventory, Facility Registry System, Water Discharge Permits, NDrinking Water
Microbial
and Disinfection Byproduct Information and the National Drinking Water Contaminant
Occurrence Database.
Now, all these data sources contribute seemingly unrelated data which may come in
disparate files formats. This may also come from different geographical locations
from
different federal governments within the United States. The data that they share
finally
converged in a central data warehouse which manages them so they become more
meaningful and relevant to be redistributed or shared to anybody who needs them.
Each of these departments may or may not act as the primary data source. For
example, if
the data originating from the Safe Drinking Water Information Data Source comes
from yet
another source, then the Safe Drinking Water Information Data Source is not a
primary data
source. If data really comes from the actual raw activity of the department where
the real
paper took place, then the department may be a primary data source.
Primary Key
Also known as a primary keyword or a unique identifier, a primary key is key used
in a
relational database which uniquely represents each record. It is a set of one or
more data
characteristics and its value uniquely identifies each data occurrence in a data
subject. It
can be any unique identifier in a database table's records such a driver's license,
a social
security number, or a vehicle identification number. There can only be one primary
key in a
relational database. Typically, primary keys appear as columns in relational
database tables.
The administrator has the power of choice of a primary key in a relational database
although it may be very possible to change the primary key for a given database
when the
specific needs of the users changes. For instance, it may be more convenient to
uniquely to
uniquely identify people by their telephone numbers in some areas than to use
driver
license numbers in one application to uniquely identify records.
There all many kinds of keys in a database implementation but a primary key is a
special
case of unique keys. One of the biggest distinctions of a primary key from other
unique keys
is that the implicit NOT NULL constraint is automatically enforced unlike the case
of the
other unique keys. With this enforced restriction, the primary key will never
contain any
NULL value. Another main distinction of primary keys is that the keys must be
defined using
a certain syntax.
As expressed through relational calculus and relational algebra, the relational
model
distinguished between primary keys and other kinds of keys. The primary keys were
only
added to the SQL standard for the main reason that it gives more convenience to the
One of the most important things to note in implementing and designing a good
database is
in choosing a primary key. Each database would definitely need a primary keys so it
can
ensure row level accessibility. When an appropriate primary key is being chosen,
one can
already specify a primary key value which lets the person query each of the table
row
individually and modify each of the row without having to alter other views in the
same
table The values composing a primary key column are unique so no two values will
ever be
the same.
Each database table has one and only one primary key which can consist of one or
many
columns and this is very important. But it can also be possible to have a
concatenated
primary comprised of two or more columns. There might be several columns or groups
of
columns in a single table that may serve as a primary key and are called candidate
keys. A
table can have more than one candidate key but there can only be one candidate key
that
can become a primary key for the table.
There are some cases in database design that the natural key that uniquely
identifies a
table in a relation is difficult to use for software development. The case may
involve having
multiple columns or large text fields. This difficulty may be addressed by
employing what is
called a surrogate key. A surrogate key can be used as a primary key. In some other
cases
there may be more than one candidate key for a relation, and no candidate key is
apparently preferred. This can be addressed again by using a surrogate key to be
used as
primary key in order to avoid having to give one candidate key artificial primacy
over the
others.
Data Types
A Central Data Warehouse is a single physical database which contains business data
for a
specific function area, department, branch, division or the whole enterprise.
Choosing the
central data warehouse is commonly based on where there is the largest common need
for
informational data and where the largest numbers of end users are already hooked to
a
central computer or a network.
A Central Data Warehouse may contain all sorts of data for any given period.
Typically, a
central data warehouse contains data from multiple operating systems. It is built
on
advanced relational database management systems or any form of multi-dimensional
informational database server.
A central data warehouse employs the computing style of having all the information
systems
located and managed from one physical location even if there are many data sources
spread
around the globe.
It has been a few decades that most companies' survival depends on being able to
plan,
analyze and react to the fast and constantly changing business conditions. In order
to keep
with rapid changes, business analysts, managers and decisions makes in the company
need
more and more information.
Information technology itself has also rapidly evolved with the changing business
environment and today, more innovative IT solutions have been springing like
mushrooms
on the internet. And with these, business executives and many other critical
decisions
makes have found ways to make do with business data.
Every single day, billions of data are created, moved and extracted from various
sources
whether from company local area network, wide area network or the internet. These
data
come in different formats, attributes and contents. But for the most part, data may
be
locked up in disparate computer systems and could be extremely difficult and
complicated
to make use of.
Central data warehouses are created by installing a set of data access, data
directory and
facilities for process management. A copy of all operational data should be built
from a
single operating system to enable the data warehouse to have a series of
information tools.
Perhaps the most optimal way of data warehousing strategy is to select a user
population
based on enterprise value. From there the company can do issue analysis.
Based on the discovered needs, a data warehouse prototype is built and populated
for end
users to experiment and do appropriate modifications. Once an agreement on the
needs is
arrived at, data can then be acquired from current operational systems across the
company
or from external data sources and then load the data into the warehouse.
A central data warehouse is where the company solely depends for business analysis
and
decisions making. It should have the following attributes:
Completeness � the data warehouse should have data model whose scope includes even
the most minute and seemingly trivial detail about the company.
Flexibility � data warehouse should be able to manage all sorts of data from
heterogeneous sources and satisfy a wide array of requirements from end users as
well as
from data sources.
Timeliness � data should be submitted on a schedules time bases so the company can
get
the latest updates on trends and patterns in the industry.
Subject-oriented means that the data captured is organized to have similar data
linked
together. Time-variant data changes are recorded and tracked so that a change
patterns
can be determined over time. Non-volatile means that when data is stored and
committed,
it can be read only and never deleted for comparison with newer data.
An active data warehouse has a feature that can integrate data changes while
maintaining
batch or scheduled cycle refreshes.
The early data warehouses were stored in separate computer databases designed
specifically for the purpose of management information and analysis. Data came from
As technology evolved, data warehousing methods improved along with greater demands
from company users. Data warehouses had several stages of evolution. At the early
stage,
data is copied from an operational system database into an offline database server
where
processing requirements do not affect the performance of the operational system.
The
offline data warehouse regularly updates data from the operational systems and
store the
data in an integrated data structure.
Real time data warehouse updates data on during actual transaction time in the
operational
system. Integrated data warehouse generate transaction events which are given back
to
operational systems for worker's daily use.
Online transaction processing (OLTP) is the storage system often used for active
data
warehousing. OLTP is a relational database design that breaks down complex
information
into simple data tables. It is very efficient in analyzing and reporting billions
of captured
transactional data into user friendly format. It can also be tuned up to maximize
computing
power although data warehousing professionals recommend having a separate reporting
database in other computer given the fact that millions of data may be processed by
the
OLTP database every second.
Active data warehouse professionals are often called Data Warehouse Architects.
They are
primarily top-notched database administrators who are tasked to handle a huge
amount of
complex data from different sources sometimes coming from different countries
around the
world.
Many tools which are user-friendly query software applications and near real-time
updating
have been spent money on. To some degree, they have data which are accurate,
accessible
and timely. According to a recent report from a survey, less than 40 percent say
they have
come up wit accurate automated reports. This was because there never was a Meta
data
warehouse.
Most companies use an Active Data Warehouse to capture transaction data from many
different sources. Since millions of data of transaction data may be processed
during any
given second in any data warehouse, storage for data is commonly separated in other
computer from the operational system of the company. This is to ensure optimal
resource
management of the Data Warehouse Server. Also having a separate Active Metadata
Warehouse significantly speeds up searching, analyzing and reporting data from the
data
warehouse.
Metadata has various advantages. In its most wide usage, it is useful if we want to
speed up
our searches. Search queries with metadata expedite the process especially in
performing
very complex filter operations. Many web application locally cache Metadata by
automatically downloading them and thus improve speed files access and searches.
Locally,
Metadata can be associated with files as in the case of scanned documents. When the
files
are digitally stored, the user may open the file using a viewer application which
reads the
document key values and store in a Metadata Warehouse or any similar repository.
Bridging a semantic gap is one of the notable uses of Metadata. For example, a
search
engine may understand that Edison is an American Scientist and so when one queries
on
"American Scientists" it may provide a hyperlink to pages on Edison even if the
keyword did
not mention "Edison". This approach is called semantic web and Artificial
Intelligence.
Even in Multimedia files, Metadata is also very useful. For example, they are used
to
optimize lossy compression algorithms as in a video uses Metadata to tell the
computer a
foreground from the background to achieve better compression rate without losing
much
quality as effect of the lossy compression algorithm.
Metadata can be stored both internally meaning it is found within the file itself
or externally
as in a separate file that points to the file it describes. Storing Metadata
externally is more
efficient for searching like in Database queries
There are two general types of Metadata, Structural or Control Metadata and Guide
Metadata. The Structural Metadata is generally used in database systems such as
columns,
tables and indexes. On the other hand, Guide Metadata is used to help users in
looking for
specific things like Natural Language searches.
Enterprise Data Warehouse is a centralized warehouse which provides service for the
entire
enterprise. A data warehouse is by essence a large repository of historical and
current
transaction data of an organization. An Enterprise Data Warehouse is a specialized
data
warehouse which may have several interpretations.
Several terms used in information technology have been used by a so many different
vendors, IT workers and marketing ad campaigns that has left many confused about
what
really the term Enterprise Data Warehouse means and what makes it different from a
general data warehouse.
In order to give a clear picture of an Enterprise Data Warehouse and how it differs
from an
ordinary data warehouses, five attributes are being considered. This is not really
exclusive
they bring people closer to a focused meaning of the Enterprise Data Warehouse from
among the many interpretations of the term. These attributes mainly pertain to the
overall
philosophy as well as the underlying infrastructure of an Enterprise Data
Warehouse.
The first attribute of an Enterprise Data Warehouse is that it should have a single
version of
truth and that entire goal of the warehouse's design is to come up with a
definitive
representation of the organization's business data as well as the corresponding
rules. Given
the number and variety of systems and silos of company data that exist within any
business
organization, many business warehouses may not qualify as an Enterprise Data
Warehouse.
The second attribute is that an Enterprise Data Warehouse should have multiple
subject
areas. In order to have a unified version of the truth for an organization, an
Enterprise Data
Warehouse should contain all subject areas related to the enterprise such as
marketing,
sale, finance, human resource and others.
The third attribute is that an Enterprise Data Warehouse should have a normalized
design.
This may be an arguable attribute as both normalized and de-normalized databases
have
their own advantages for a data warehouse. In fact, may data warehouse designers
have
used denormalized models such as star or snowflake schemas for implementing data
marts.
But many also go for normalized databases for an Enterprise Data Warehouse in the
consideration of flexibility first and performance second.
Because of the fast evolution of information technology, many business rules have
been
changed or broken to make way for rules which are data driven. Processes may
fluctuate
from simple to complex and data may shrink or grow in the constantly changing
enterprise
environment. Hence, a real Enterprise Data Warehouse should scale to these changes.
Today's business environment is very data driven and more companies are hoping to
create
competitive advantage over other business organization competitors by creating a
system
whereby they can assess the current status of their operations any at any given
moment
and at the same time, they can also analyze trends and patterns within the company
operation and its relation to the trends and patterns of the industry in a truly
up-to-date
fashion.
The establishment of one large data warehouse addresses the demand for up-to-date
information reflecting the trends and patterns of the business operations and its
relation to
the large world of the industry where the company is doing business in. A data
warehouse is
not just a repository of historical and current transactional data of a business
enterprise; it
also serves as an analytical tool (in conjunction with a business intelligence
system) to give
a fairly accurate picture of the company.
Business companies vary in structure. Some Business companies are composed only of
a
few departments focused on the core business functions such as finance, admin and
human
resources. Some companies are big and their the scope of the operation is very wide
which
may include manufacturing, raw materials purchasing, purchasing and many others.
As the company grows, so will its need for data. While a data warehouse is itself
an
expensive investment, it is not uncommon to see one big organizational company
implementing several different functional data warehouses working together in a
large
information system to function as an Enterprise Data Warehouse.
In large business organizations where there are several departmental divisions, the
Enterprise Data Warehouse is broken down into Functional Data Warehouse. Depending
on
the size of the company and their financial capability, a Functional Data Warehouse
may
serve on department or may server more. There are also companies that have branches
in
many different geographic locations around the globe and their Enterprise Data
Warehouse
may set up differently with different clustering for Functional Data Warehouses.
Despite the breaking down of Enterprise Data Warehouse into several Functional Data
Warehouses, each of these warehouses is basically the same. Each of them still
defined to
be �a subject-oriented, integrated, time-variant and non-volatile collection of
data in
support of management�s decision making process or a collection of decision support
technologies, aimed at enabling the knowledge worker to make better and faster
decisions�.
Breaking down the Enterprise Data Warehouse into several Functional Data Warehouses
can
have many big benefits. Since the organization as a data driven enterprise deals
with very
high level volumes of data, having separate Functional Data Warehouses distributes
the
load and compartmentalize the processes. With this set up, there will no way the
whole
information system will break down because if there is a glitch in one of the
functional data
warehouses, only that certain point will have to be temporarily halted while being
fixed. As
opposed to one monolithic data warehouse setup, if the central database breaks
down, the
whole system will suffer.
Having Functional Data Warehouses will also ensure that data integrity and security
is
maintained because each department or the group of departments represented by the
Functional Data Warehouse will have a sense of ownership and responsibility. This
also
means that if there is a problem with the Functional Data Warehouse, it will be
easy to
pinpoint the responsible department or individual representing the department
maintaining
the Functioning Data Warehouse.
Operational Metadata
Operational Metadata are metadata about operational data. Metadata is basically a
kind of
data that describes another data, content item or another collection of data which
includes
multiple content items. Its main purpose is to facilitate better understanding, use
and
management of data.
The use and requirement of metadata varied depending on the context where it is
being
used. For example, when metadata are employed in a library information system, the
metadata that will be used would be about description of book contents, title, data
of
publication, location of the book on the shelf and other related information. If
metadata are
to be employed in a photography system, the metadata to use would involve
information
about cameras, camera brand, camera models and other.
When used with an information system, the metadata to be used would involve data
files,
name of the field, length, date of creation, owner of the file and other related
information
about the data.
Metadata describe operational data which are subject-oriented, integrated, time-
current,
volatile collection of data that support an organization�s daily business
activities and outputs
of the operational data stores.
They are just as important as the operation data itself because an enterprise
information
and data management system can be greatly enhanced in efficiency when operational
metadata are being employed.
Let us take an example with an enterprise resource planning (ERP). Metadata greatly
helps
in building a data warehouse in an ERP environment. An enterprise data management
system involves Decision Support Systems (DSS) metadata, operational metadata and
data
warehouse metadata. The DSS metadata is primarily used by data end users. The data
warehouse metadata is primarily used for archiving data in the data warehouse. The
operational metadata is primarily for use by developers and programmers.
Since operational metadata describe operation data, they are also very dynamic in
nature.
Since operational data are data that are currently in use by the businesses, they
are
constantly changing as long as transactions are happening and even beyond such as
during
inventories. As such, new transactional data are added and removed any given time
and the
operational metadata needs to catch up with these changes.
Today, there are many vendors that offer many implementations of operational
metadata in
relation to a data warehouse as well as the general setup of an enterprise data
management
system. Many software implementations of operational metadata help provide business
as
well as technical users better control when accessing and exploring metadata in all
other
aspects of the business operation and its IT implementation. Some software
applications
can even help depict visually the interrelationships of data sources and users and
provide
data consumers with data linkage back to the system source.
Today, having information means having power. In any a lot of aspects of daily
living,
having relevant information can give us more ease in daily activities. This is made
manifest
by the use of the internet. Because of the information that can be obtained
everyday,
internet users are growing by the day and the time people spend on the internet is
getting
longer as web services are getting more and more sophisticated with applications
that can
gather and aggregate billions of disparate data into useful information.
Businesses are the biggest user of data whether on the internet or within the
corporate IT
infrastructure. Many companies implement a business data repository which is called
data
warehouse.
In a typical business setting wanting to have a data warehouse, data architects try
to define
real world business activities in terms of information technology jargon. Real
business
activities, persons, transactions and events are defined in terms of entities of
data
representation in a database system. As soon as the entities are all defined, IT
professionals
then develop programmatic algorithms to represent business rules, policies, best
practices
and other undertakings within the company.
These data and algorithms are then synthesized in one system in the data warehouse
so the
whole system can simulate real world activities with a lot faster speed, better
efficiency and
less prone to errors.
It is not uncommon these days to have a company that has a presence in different
geographic locations. Having this set up is like taking advantage of the
advancement in
information technology which has broken boundaries. Communication can be very easy
and
fast.
Companies can have the option to have several data warehouses in different
locations.
These data warehouses communication with each other and send, extracts, transforms
and
loads data for statistical analysis. Each warehouse typically has a database
administrator to
manage the data and overcome compatibility problems
Data security is a critical issue in data warehouses from several locations trying
to send and
receive data and constantly make contact with each other. Communication lines can
be
open to sniffers and malicious hackers and crackers may be able to steal important
information and breach privacy. Securing a network is an expensive activity so
companies
will have to spend more on buying appropriate technology measures.
Having a centralized data warehouse has its own advantages. The company will have
to
invest only a central IT team. The central team will be responsible for defining
and
publishing corporate dimensions. This is can especially true if the company has
multiple
lines of business to be combined in one robust framework. The team is also
responsible for
providing cross divisional applications. The need to purchase to software and
database tools
will only be for the central data warehouse and it can be fairly easy to implement
cross
divisional applications.
The main disadvantage though with centralized data warehouse is that if the
warehouse
breaks, it may temporarily mean stoppage for the operations. But of course this can
be
easily overcome by investing on many computers and other hardware for backing up.
These
computers must have to be very powerful because they will be dealing with billions
of
complicated process in the central location of the organization.
Metadata Warehouse
Metadata Warehouse is a database that contains the common metadata and client-
friendly
search routines to help people fully understand and utilize the data resource. It
contains
common metadata about the data resource in a single organization or an integrated
data
resource that crosses multiple disciplines and multiple jurisdictions. It contains
a history of
the data resource, what the data initially represented, and what they represent
now.
A metadata warehouse is just like any data warehouse in that it stores all kinds of
metadata
to be used by the information system. Since today's data driven business
environments are
relying heavily on data, there needs to be separate storage for both data and
metadata in
order for the enterprise data management system to function efficiently.
In the not so distant past, metadata has always been treated as a "second class
citizen" in
the database and data warehouse world. This may be because the primary purpose of a
When implemented and used properly, a metadata warehouse can provide the business
organization with tremendous value so companies need to understand what metadata
warehouse can and cannot do.
There are a lot of large business organizations nowadays that have had some
experiences
with data warehousing implementations. Today, data warehouses often take the form
of
data mart style implementations in many different departmental focus areas like
financial
analysis or customer focused systems that assist business units.
But this approach has got many companies to the legacy data Tower of Babel and some
areas of the business have begun showing signs of stress in the implementation.
Both data
and metadata in this approach are spread across multiple data warehouse systems and
the
administrator are becoming stressed at coordinating and managing the dispersed
metadata.
There need to be consistency in the business rules when they change as a result of
corporate reorganizations, regulatory changes, or other changes in business
practices.
Likewise there should be a way to handle when an application wants to change the
technical
definition.
One of the significant steps to handle the needs stated above is coordinating
metadata
across multiple data warehouses and the way to achieve this is to have a metadata
warehouse.
This can also allow all the data users to share common data structures, data
definitions and
business rules definitions from one system to another across the business
organization. The
metadata warehouse can efficiently facilitate consistency and maintainability as it
provides a
common understanding across warehouse efforts promoting sharing and reuse. This can
result to better exchange of key information between business decision managers and
Aggregate Data
An aggregate data is the data that is the result of applying a process to combine
data
elements from different sources. The aggregate data is usually taken collectively
or in
summary form. In relational database technology, aggregate data refers to the
values
returned when one issues an aggregate function.
The query function examines the aspects of the data in the table to reflect the
properties of
many groups of rows instead of an individual row. For example, one might want to
find the
average amount of money that a customer pays for something or how many professors
are
employed in a certain university department.
In a larger scale as in a data warehouse, aggregate data gathers information from a
wide
variety of sources like databases around the world to general useful report that
can spot
patterns and trends. A company may have different table customer information,
product
table, prices table, sales tables, employee table and branch tables. Each of these
tables may
general reports base on their records. For example, the products table may generate
report
of all products or the sale table may generate a sales report and so on.
But managers and decision makes need more than that. They may need sales reports
from
different branches so data from two sources, sales and branches may be queried to
get
aggregate data. In the same manner, a manager may also want reports for what
particular
product is the top in sales of a particular employee in a particular branch around,
say,
France. In this case, several table may have to be queried.
The use of aggregate data is intensively used not just in business but in all forms
of
statistics as well � whether it is in governance, biodiversity sampling,
pharmaceutical
laboratories and weather watch. Many governments rely on aggregate data taken from
statistical surveys and empirical data to determine the economy and give assistance
to the
less privileged areas. Weather stations share aggregate data to spot patterns in
the
constantly changing weather.
In global business where the internet has become the main conduit, a data warehouse
is for
a company is becoming ubiquitous. These data warehouse aggregate diverse data from
different sources and when used with an electronic tool for analysis, the results
can give
amazing insights into the corporate operation and behaviors of the buying public.
Gathering and aggregating data is very labor intensive for the computer most
especially if
the data warehouse get very frequently updated. High speed computers are employed
as
stand alone servers just for the purpose of aggregating data with the use of a
relational
database management system.
Atomic data are data elements that represent the lowest level of detail. For
example, in a
daily sales report, the individual items sold would be atomic data, while rollups
such as
invoice and summary totals from invoices are aggregate data.
The word atomic data is based on the atom where in chemistry and physics is the
smallest
particle that can characterized a chemical element. In natural philosophy, the atom
is the
indestructible building blocks of the universe. In the same light, atomic data is
the smallest
data that has details that come up with a complete meaning.
They are the integer data type which is a whole number that does not have a
fraction
component, the floating point data type, which can contain a decimal point and the
character, which refers to any readable text. Another atomic data is the Boolean
data type
which refers to two values only � on or off, yes or no, or true or false.
Atomic data types have a common set of properties which include class name, total
data
size, byte order referring to how the bits are arranged as they reside in memory,
precision
which refers to the significant part of a data, offset or the location of the
significant data
within the entire data itself and padding which identifies that data which is not
significant.
changed. In another language like Lisp, an atomic data types refers to the basic
unit of a
code that executes.
Relational databases can be the best example of how atomic data are stored and
retrieved
to form a larger set called aggregate data. There is a need to manage and access
data that
describe or represent properties or object whether real or imaginary in all
computer
systems.
A database, which is in its very essence a record keeping system in one example
where
objects are referred to in terms of item information. An object could be a client
or a
corporation having many characteristics. Data inside the database is structured
into a
separate and unassociated atomic data item where each contains relevant
information.
The database has a structure called a relationship where a query is executed to
combine
atomic data into aggregate data and reports are generate for statistical analysis
so that an
organization can draw a profile of many different aspects.
Atomic data can come from several sources. It can come from the same table or can
come
from different tables within the same database. The internet is teeming with atomic
data
traversing the information superhighway every single second. Search engines use a
special
code called crawlers of spiders to index these atomic data to be later used in
ranking pages
when a user types the keywords in search engines.
Data Source, as the name implies provides data via data site. Data site in turn
stores an
organization's database, data files including non-automated data. Companies
implement a
data warehouse because they want a repository of all enterprise related data as
well as a
main repository of the business organization's historical data.
But such data warehouses need to process high volume levels of data with complex
queries
and analysis so a mechanism has to be applied to the data warehouse system in order
to
prevent slowing down the operational system.
A data warehouse is designed to periodically get all sorts of data from various
data sources.
One of the main objectives of the data warehouse is to bring together data from
many
different data sources and databases in order to support reporting needs and
management
decisions.
Let us take the United States Environmental Protection Agency which is implementing
an
Envirofacts Data Warehouse. Because this is such a large agency dealing with large
volumes
of data, the Envirofacts database is designed to be a system composed of many
individual
EPA databases and databases are administered by program system offices.
So the data sources of the Envirofacts Data Warehouse provide information that
makes it
easy to trace the origin of the information. Some of these data sources are:
Superfund Data Source � This data source are from Superfund sites which have those
uncontrolled hazardous wastes sites designated by the federal to be cleaned up.
In this data source are stored information about these sites in the Comprehensive
Environmental Response, Compensation, and Liability Information System (CERCLIS),
which
has been integrated into Envirofacts.
Safe Drinking Water Information Data Source � This database stores information
related to drinking water programs.
Master Chemical Integrator Data Source � This database integrates various chemical
identifications used in four program system components.
Other data sources Envirofacts Data Warehouse are Hazardous Waste Data, Toxics
Release
Inventory, Facility Registry System, Water Discharge Permits, Drinking Water
Microbial and
Disinfection Byproduct Information and the National Drinking Water Contaminant
Occurrence Database.
Now, all these data sources contribute seemingly unrelated data which may come in
disparate files formats. This may also come from different geographical locations
from
different federal governments within the United States.
The data that they share finally converged in a central data warehouse which
manages
them so they become more meaningful and relevant to be redistributed or shared to
anybody who needs them.
For really big companies which operate with various geographical locations around
the
country or around the world, the data sources may from more sources. Data sources
may
be divided in hierarchical fashion.
For instance, a data sources in one geographical branch may be broken down into
various
data sources coming from the different departments within the branch. In the
overall global
data warehouse system, the data sources from the atomic departments become like
twigs in
the global data warehouse tree structure.
The whole data warehouse system with different data sources make the whole system
easy
to manage because when of the data sources breaks down, the whole system will not
halt in
its operations.
Data Type
Data Type describes how data should be represented, interpreted and how the values
should be structured or how objects are stored in the memory of the computer. It
refers to
the form of a data value and constraints placed on data interpretation, The form of
data
value vary and can take different forms such as date, number, string, float,
packed, and
double precision.
The type system uses data type information so that it could check for correctness
of the
computer programs which try to access or manipulate the data.
A group of eight of these "on or off or 0 or 1" is called a byte and is the
smallest
addressable unit on the storage device. A "word" is the unit processed by machine
code and
is typically composed of 32 or 64 bits. The binary system can represent both signed
and
unsigned integer values (representing negative or positive values).
For instance, a 32-bit word can be used to represent unsigned integer values from 0
to 232
- 1 or signed integer values from - 231 to 231 - 1. A specific set of arithmetic
instructions
is used for interpreting a different kind of data type called a floating-point
number.
Different language may give different representation of the same data type. For
example,
an integer may have a slight range difference in C language compared to Visual
Basic. In
another instance, a string data type may have different number of range between
Access
and Oracle relational databases.
This is just for the sake of example and the name mentioned may not be the exact
applications having different data type interpretation but this is a common
occurrence in
information systems. But this does not erase the essence of the data type. For this
purpose,
we will describe below the primitive data types.
Integer data types � An integer can hold a whole number but not fractions. It can
also
hold negative values as signed integer. A signed integer holds only non-negative
values. The typical sizes of integers are:
. Byte (composed of 8 bits with a range of -128 to +127 or signed and 0 to 255 for
unsigned)
Booleans � Are data types composed of one bit only to signify true (1) or false
(0).
Floating-point � This data type represents a real number which may contain a
fractional
part. They are internally stored in scientific notations.
Characters and strings � A character is typically denoted as "char" and can contain
a
single letter, digit, punctuation mark, or control character. A group of characters
are called
a string.
Many other data types such as composite, abstract and pointer data types but they
are very
specific to the implementing software.
Demographic Data
Demographic data are data output of demography which is the study dealing with the
human population. Demographic data can be related to the Earth, the same as
geographic
data. Demographic Data usually represent geographical location, identification, or
describe
populations.
This field of science and research can be applied to anything about the dynamic
nature of
the human population including how it changes over time and what factors are
affecting the
changes. This study also covers aspects of human population such as the size,
structure,
distribution, spatial and temporal changes in response to birth, death, aging or
migration.
Demographic data which are most commonly used include crude birth rate, general
fertility
rate, age-specific fertility rates, crude death rate, infant mortality rate, life
expectancy, total
fertility rate, gross reproduction rate and net reproduction ratio.
Demographic data can be used in analyzing certain patterns and trends related to
human
religion, nationality, education and ethnicity. These data are also the basis for
certain
branches of studies like sociology and economics.
Collection of demographic data can be broadly categorized into two methods: direct
and
indirect. Direct demographic data collection is the process of collecting data
straight from
statistics registries which are responsible for tracking all birth and death
records and also
records pertaining to marital status and migration.
Perhaps the most common and popular methods of direct collection of demographic
data is
the census. The census is commonly performed by a government agency and the
methodology used is the individual or household enumeration.
The interval between two census surveys may vary depending on the government
conducting. In some countries, a census survey is conducted once a year or once
every two
years and still others do census once every 10 years. Once all the data collected
are in
place, information can already derived from individuals and households.
The indirect method of demographic data collection may involve only certain people
or
informants in trying to get data for the entire population. For instance, one of
the indirect
demographic data methods is the sister method. In this method, a researchers only
asks all
the women on the number of their sisters who have died or have had children who
have
died at what age they died.
From the collected data, the researchers will draw their analysis and conclusions
based on
indirect estimates on birth and death rates and then apply some mathematical
formula so
they can estimate trends representing the while population. Other indirect methods
of
demographic data collection may be to collect existing data from various
organizations who
have done a research survey and collate these data sources in order to determine
trends
and patterns.
There are a lot of ways for demographic methods for modeling population processes.
Some
of these models are population projections (Lee Carter, the Leslie Matrix),
population
momentum (Keyfitz), fertility (Hernes model, Coale-Trussell models, parity
progression
ratios), marriage (Singulate Mean at Marriage, Page model) and disability
(Sullivan's
method, multistate life tables).
In fact, it is now a lot easier to get demographic data that can cover the whole
planet while
data users can drill down deep into the database to get more demographic data
pertaining
to very specific geographical area. With the popularity of the internet, looking
for
demographic data with corresponding analyses has become a lot easier and faster.
Legacy Data
Legacy data comes from virtually everywhere within the information system and
support
legacy systems. The many sources of legacy data include databases, often relational
but
hierarchical, network, object, XML, and object/relational databases as well. Legacy
data is
another term used for disparate data.
Some files such as XML documents or �flat files� such as configuration files and
comma-
delimited text files may also be sources of legacy data. But the biggest sources of
legacy
data are those from the old, updated and antiquated legacy systems.
These systems are usually large and companies have invested so much money in
implementing legacy systems in the past that despite some potentially problematic
identified by IT professionals, many still want to keep them for several reasons.
One of the main problems with legacy systems is that they often run on very slow
and
obsolete hardware parts which, when broken, would be very difficult to look for
replacements. Because of the general lack of understanding of these old
technologies, they
are often very hard to maintain, improve and expand. And because they are old and
obsolete, chances the operations manual and other documentations may have been lost
One of the biggest reasons is the legacy systems were implemented to be large and
monolithic in nature and coming up with a one time redesign and reimplementation
would
be very costly and complicated. If legacy systems are taken out at one single
moment, the
whole business process would be halted for sometime because of the monolithic and
centralized nature of these systems.
Most companies cannot afford any business stoppage especially in today's very fast
paced
data driven business environment. What worsens the situation even more is that
legacy
systems are not very understood by younger IT professional so redesigning them to
adopt
to newer technologies would take so long and intensive planning.
That is why it is very common to see data warehouses nowadays which are a
combination of
new and legacy systems. The effect would be having legacy data which are very
incompatible with the data coming from the data sources using newer technologies.
In fact, different new technology vendors are encountering differing disparity data
related
problems with using legacy systems. IBM alone has enumerated some typical legacy
data
problems which include among others:
. Missing data
. Missing columns
. Additional columns
. The purpose of a column is determined by the value of one or more other columns
. Data values that stray from their field descriptions and business rules
Legacy data and the problem regarding data disparity they bring to a data warehouse
can
be solved by the process of ETL (extract, transform, load). This is a mechanism of
converting disparate data not just from legacy systems but all other disparate data
sources
as well before they are loaded into the data warehouse.
Foredata
Foredata is a very new term that stands for "Developed From Fore" meaning
beforehand, up
front, at or near the front. In fact not many are aware of the existence of such
word but
the underlying function of the foredata has always been there and has existed as
early as
the database system has existed years ago. Foredata are all data about the objects
and
events, including both praedata and paradata.
In a data warehouse implementation, every data that a data consumer interacts with,
Foredata are the upfront data which are used for describing a data architecture's
objects
and events. They are also used for tracking or managing the said objects and events
in the
real world as they really represent also these said objects and events.
But foredata are no different from the data inside the data warehouse or from its
various
data sources. The foredata is only a term to represent the way data are being used
although
they are structurally the very same data circulating around from one data source to
another
or the same data being stored in the data warehouse until someone queries them for
specific information.
To some degree, a foredata could be considered some kind of a replica of the data
which
are being used in the backend processing of the system. For instance, as the
definition goes
that "foredata are the upfront data that an organization sees about objects and
events", any
report generated by the company's data consumer are foredata in that their
momentary
purpose is to present the data and not to be input for a functional backend
processing.
Since the term foredata is really very new, there are others from the IT profession
who
differentiate foredata from the other kinds of data in that foredata refers to the
data which
is current, revolving and active. This is in contrast to the data warehouse data
which are
dormant or in some sort of archived state.
Foredata in general refer to the data after the all those disparate data coming
from various
data sources have already undergone through the process of extract, transform and
load
(ETL). This is because the raw data before ETL may have come from other sources and
they
have not yet been stripped or their attached formatting and other information.
Once the data start getting into the first ETL stage which is the extract, the are
already
stripped to the core. After that they are transformed and this can be done by
adding XML
tags and other attributes which make them fit into the business rules and data
architecture
that they are intended to be used.
If we go back to the above definition "Foredata are all data about the objects and
events,
including both praedata and paradata and they are the upfront data that an
organization
sees about objects and events", these extracted and transformed data will see fit.
This will also distinguish foredata from all other data within the data warehouse
and
enterprise information system. As we all know, the other data within the
information system
are flat files, networking protocol associated data, and multimedia data.
Foredata constitute elements for reporting which is the most essential purpose of a
data
warehouse. Companies need to make sure that they get the latest trends and patterns
so
that they can evaluate the efficiency of their business operation strategy and
reformulate
some policies are revise some product management in order for the company to gain
competitive advantage.
And in some more cases, some departments may be using flat files or legacy data.
With the
help of an integration software tool, the problem arising from data disparity will
be
minimized at least and eliminated at most.
Since data comes from various disparate sources, data integration at this data
store are
being cleaned, resolved for redundancy and enforced with the corresponding business
rules.
This is the place where most of the data used in current operation are located
before they
transferred to the data warehouse for temporary or long term storage or archiving.
The operational data store is a very busy area. Every single minute, data comes in
from
various sources which are progressively handling other transactions and goes to
another
database and information systems that need the transformed data. Since this place
requires
labor intensive applications, a mechanism should be done so as not to overload the
operational data stored and not cause it to break down from handling and
concurrently
processing large quantities of data.
Every one in a while in a regular interval, those data which are not momentarily
used for
the operation should be moved to another place. For the sake of clarity, let us say
that the
reason why the area is called operational data store because it is the repository
of the
currently operational data which are used for current operations.
So, when the operational data store sees that data is not needed at the moment and
the
data store has already transformed the data by adding corresponding format in line
with
data architecture, the data will then to be moved to the data warehouse or more
specifically, an integrated historical data portion of the data warehouse.
Integrated historical data are just like any other enterprise data which has
generally passed
through the process of extract, transform load (ETL). Since they are basically the
same
data, they are also contained in a database inside the data warehouse. The only
distinct
thing about then that differentiates them from operational data is that they more
at rest.
The term historical data is very apt because these data are really relatively past
in the they
have already served their intended purpose. The are placed one area or may be in
different
areas but connected by some application tools so that they can be easily retrieve
when a
need arises.
Historical data are very important especially in the area of statistical analysis.
For instance,
if the company want to know the sale trend within the last few months, operational
data
alone cannot address this need and the business intelligence system will have to
get data
from the integrated historical database.
Redundant Data
Redundant Data as the name suggests is data duplication. It means, same data of a
single
organization is stored at multiple data sites
Dealing with redundantly data means that a company has to spend a lot of time,
money and
energy. Since, as mentioned, these redundant data are unknown to the organization,
they
can crawl into the system and give the system unwanted and unexpected results such
as
slowing down the entire system process, giving inaccurate data output and affecting
data
integrity very negatively. Redundant data can also create a risk to information
quality if the
different databases are not updated concurrently.
The problems associated with redundant data can be addressed by data normalization.
Normalized tables generally can contain no redundant data because each attribute
only
appears in one table. Also, normalized tables do not contain derived data and
instead, the
data contained can be computed from existing attributes which has been selected as
an
expression based on the said attributes.
Having normalized tables can also greatly minimize the amount of disk space used in
the
implementation while making the updating very easy to do. But with normalized
tables, one
can be forced to use joins and aggregate functions which can sometimes be time
consuming
to process. An alternative to database table normalization would be to have new
columns to
contain redundant data as long as the trade offs involved are fully understood.
A correctly designed data model can avoid data redundancy by keeping any attribute
only in
the table for the entity which it describes. In case the attribute data is needed
in a different
perspective, then a join can be used although using a join may take time. If the
join really
greatly affects the performance in a negative way, then it can be eliminated by
duplicating
the joined data in another table.
But despite all the negative effect and impressions associate with redundant data,
there is
also some positive impact that redundant data may bring. Redundant data can also be
useful and may even be required in order to satisfy service-level goals for
performance,
availability, and accessibility.
It has been shown in the different representations of the same data by data
warehouses,
operational data stores, and business intelligence systems that redundant data is
essential
in providing new information. The important thing is to know is that in some cases,
when
redundant data is managed well, it can give some benefits to the entire information
system.
There are many special hardware available in the market today especially designed
for
handling redundant data. A redundant data storage hardware can help decrease a
system
downtime by removing some of points of failure. Some storage arrays can provide a
system
with redundant power supplies, cooling units, drives, data paths, and controllers.
While
servers attached to the redundant data storage can include multiple connections,
providing
path failover capabilities.
Integrated Operational Data are the output of the operational data store. An
operational
data store is an architectural construct which is part of the larger enterprise
data
management system. It is subject-oriented, integrated and time-current.
An integrated operational data store answers all these problems related to system
disparity.
For sake of clarity and simplicity, the term operational data store literally means
that it is a
storage for all data currently used in operation. Perhaps the easiest term we can
find as the
opposite to the term operation is the term archive.
then there are large customers of the bank which has many accounts as well. In
order to
manage these status changes involves a complex array of customers, an operational
data
store handles the processes.
An integrated operational data store works closely with the data warehouse. In fact
the data
warehouse itself is a data store. But while the operational data store deals with
current
data, the data warehouse usually stores data for storing and archiving. They are
both
database systems with the operational data store acting as an interim area for a
data
warehouse. The operational data store contains dynamic data constantly updated
through
the course of business operations while the data warehouse generally contains
relatively
static data.
In a large business enterprise, data demand is very high and so it is not uncommon
to find
an information system with several operational data stores is designed for very
quick
performance of relatively simple queries on small amounts of data like finding
tracking
status of shipments instead. Several operational data stores share the load of
enterprise
data processing. An integrated operational data store connects all scatted
operational data
store with a software tool so that they work as one large efficient system.
Operational data stores need to implemented with top of the line and robust
computer
hardware and sophisticated software tools due to the nature of its processes that
involve
complex computing of very high quantity of data coming various sources. They need
to
work very fast because they exist to give very up to date information to data
consumers.
Spatial Data
In short, spatial data has anything to do with any multidimensional frame. This
frame may
include engineering drawings which are referenced to a mechanical object, medical
images
referenced to the human body or architectural drawings which are referenced to a
building.
But spatial data is more widely used in geographical databases. In fact, geo-
information
which is a short term for geographical information, is a specialized data that has
a
specialized field of study. Geographic information is created by manipulating
spatial data by
a computerized system which may include computers and networks as well as standards
and protocols for data use and exchange between users within a range of different
applications.
There are so many popular fields of science that use spatial data and these fields
include
land registration, hydrology, cadastral, land evaluation, planning or environmental
Spatial data are stored in a spatial database which is of a special kind because
some
extensions may be considered for it to be capable of storing, handling, and
manipulating
spatial data. The out geoinformation output is processed with a special kind of
computer
program called a geographic information system (GIS) which has become very popular
these days with the rising ubiquity of the likes of Google Maps. A spatial
information system
is an environment where GIS is operated along with machines, computers, network
peripherals and people.
A spatial database describes the location and shape of geographic features, and
their spatial
relationship to other features. The information which is in the spatial database
consists of
data in the form of digital co-ordinates, which describe the spatial features. The
information
can pertain points (for instance location of museums), lines (representing roads)
or
polygons (may represent district boundaries). The information is typically in
different sets of
data in separate layers which can later be combined in various ways to be used for
analysis
or production of maps.
Spatial analysis is a not a new technical method and it may have arisen during the
early
attempts at surveying or cartography. There are also many other fields in science
that have
contributed to the development of spatial analysis. The science of biology has its
contribution through botanical studies of global plant distributions and local
plant locations,
studies of movement of animals, studies of vegetation blocks, studies of spatial
population
dynamics, and the study of bio-geography.
During a cholera outbreak in the past, a research mapping was done thus
epidemiology also
contributed to the development of spatial data. Statistics has also contributed
especially in
the field of spatial statistics. The same can be said of the contribution of
economics with
spatial econometrics.
Today, computer science and mathematics are some of the biggest users and
developers of
spatial data. Many of today's business organizations use spatial data to map out
the
progress of their business operations. Maps and GIS images are becoming ubiquitous
in the
internet and mashups applications are easily becoming available that even personal
websites can have their useful and fancy functionalities.
Data Structure
There are two general algorithms used in data clustering. These categories are
hierarchical
and partitional. Hierarchical algorithms work by finding successive clusters with
the use of
previously established clusters. Hierarchical algorithms can be further
subcategorized as a
agglomerative ("bottom-up") or divisive ("top-down"). On the other hand,
partitional
algorithms work by determining all clusters at once and them partitioning them.
Agglomerative vs. Divisive: This issue refers to the algorithmic structure and
operation of
data clusters. The agglomerative approach starts with each pattern in a distinct
cluster
(singleton) and then successively does merging of rest of the data until a certain
condition
is being met. The divisive approach starts with all clusters patterns within a
single cluster
and then splits them until a condition is satisfied.
Monothetic vs. Polythetic: This issue refers to the sequential or simultaneous use
of
features in the process of clustering. Most data clustering algorithms are
polythetic in
nature. This means that all features are done in computation of distance patterns.
The
monothetic approach is simpler. It considers features sequentially and then divides
the
given group of patterns.
Hard vs. Fuzzy: Hard clustering is done by allocation each pattern into one cluster
during
the clustering operation and in its final output. On the other hand, a fuzzy
clustering is done
by assigning degrees of membership in many clusters to each input pattern. A fuzzy
clustering method can be converted to hard clustering method by assigning each
pattern to
another cluster having the largest measure of membership.
The advent of data mining where relevant data need to extracted from billions of
disparate
data within one or more repositories has furthered the development of clustering
algorithms
designed to minimize number of scans and therefore effect in lesser load for
servers.
Incremental clustering is based on the assumption that patterns can be considered
one at a
time and have them assigned to other existing clusters.
The process of data clustering is sometimes closely associated with such terms as
cluster
analysis, automatic classification and numerical taxonomy.
At the logical data model, a data modeler needs to describe all data in the most
detailed
way possible. This should be regardless of how the physical database will be
implemented.
The logical data model includes identification of all entities are relationships
among them. It
also lists all the attributes for each entity which is being specified.
Each type entity will have one more data attributes. In logical data modeling, data
attributes should always be cohesive from the perspective of the domain. This is
often called
a judgment call for the data modeler or data architect. Getting to the deepest
level of detail
can make a real significant impact on the development and maintenance efforts
during the
future of the implementation.
Data attributes will always exist for an entity regardless of whatever is being
represented by
the entity in the real business situation. For instance in the business scenario, a
logical data
model may have an entity of Customer. The data attributes to the Customer entity
may
include but not limited to first name, middle name, last name, address, age,
gender,
profession, and many more.
Data processing is about data attribute values. These data attribute values
represent the
most tangible or least abstract areas of data processing and they are the core of
any
information management systems.
The first normal form (1NF) states that any entity type is in the first normal form
when it
contains no repeating or redundant groups of data.
The second normal form (2NF) states that an entity is in the type of the second
normal
form when it is in the 1NF and when all of the non key attributes are fully
dependent on the
primary key.
The third normal form (3NF) states that any entity is in the third normal form when
it is
in the 2NF and when all of its attributes are directly dependent on the primary
key.
As can be seen here, knowing the correct data attribute of an entity and how
arrange them
in tables and defining the correct relationships can give a database performance a
great
improvement.
At its most basic level, a logical data model defines things about which data is
kept such as
people, places events. These are technically known in database term as entities.
The world
relationship is another database term used in a logical data model to mean the
relationship
or connection between the entities. Finally, the term attribute refers to the
characteristics of
the entities.
There are certain rules to follow in using attributes in logical data modeling. The
rules below
are as follows:
1. An attribute should posses a unique name and the same meaning must be
consistently
applied to the name
2. An entity may have one or more than one attributes. Every attribute is owned by
exactly
one entity in a key based or fully attributed model. This is also referred to as
the Single
Owner Rule.
5. An attribute which is not part of a primary key can be null or meaning not
known. In the
past, this was known as the No-Null Rule but is no longer required now. The data
modelers
in the past refused to take a non key attribute which could be set to null .
6. Models should no not constrain two distinctly named attributes in which the
names are
synonymous. Two names are said to be synonymous if both as alias for one another,
whether directly or indirectly. Also, they are said to be synonymous if there is a
third name
for which both names are aliases.
Attributes may be multi valued, composite of derived. A multi valued attribute can
have
more than one value for at least one of the entity's instance. As an example, a
software
whose entity is called application may have a multi value attribute called platform
because
different instances of the same application may run on different platforms. To
illustrate
further, the application may be a document processor which can run on Microsoft,
Apple and
Unix platforms.
A derived attribute is an attribute whose value is taken from other data and may be
a result
of a formula. A person's age can be an attribute derived from another attribute
which is the
birthday. Derived attributes are very common in business data warehouses where
atomic
data are aggregated heavily to form the report about the company profile.
There are actually three types of data cardinality each dealing with columnar value
sets.
These types are high-cardinality, normal-cardinality, and low-cardinality.
High data cardinality refers to the instance where the values of a data column are
very uncommon. For example, a data column referring to values for social security
numbers
should always be unique for each person. This is an example of very high
cardinality. Same
goes with email address and user names. Automatically generated numbers are of very
high
data cardinality. For instance, in a data table column, a column named USER-ID
would
contain values starting with an automatically increments every time a new user is
added.
Normal data cardinality refers to the instance where values of a data column are
somewhat uncommon but never unique. For example, a CLIENT table having a data
column
containing LAST_NAME values can be said to be of normal data cardinality as there
may be
several entries of the same last name like Jones and may other varied names in one
column. At close inspection of the LAST_NAME column, one can see that there could
be
clumps of last names side by side with unique last names.
Low data cardinality refers to the instance where values of a data column are not
very unusual. Some table columns take very limited values. For instance, Boolean
values
can only take 0 or 1, yes or no, true or false. Another table columns with low
cardinality are
status flags. Yet another example of low data cardinality is the gender attribute
which can
take only two values � male or female.
The Link Cardinality is a 0:0 relationship and defined as one side does not need
the other
to exists
The Sub-type Cardinality is a 1:0 relationship and defined as having one optional
side
only.
The Child Cardinality is a 1:M mandatory relationship and is one of the most common
A data table's cardinality with respect to another data table is one of the most
critical
aspects in database design. For instance, a database hospital may have separate
data
tables used to keep track patients and doctors so a many to one relationship should
be
considered by the database designer. If the data cardinality and relationships are
not
designed well, the performance of a database will greatly suffer.
Data modeling is also a process of structuring data and organizing data so that the
data
structures will then be the basis for the implementation of a database management
system.
Also in addition to organizing and defining data, the data modeling process also
implicitly or
explicitly imposes limitation and constraints on the data within the structure.
Data models are based on business rules. This is because business rules are
abstracts and
intangible concept that the database management system, which is basically a
computer
system, cannot understand. Business rules converted to data models convert data
into
format that the computer will finally be able to understand and thus implement.
Here is an example of a draft business rule that will be the basis of a data model:
F. Typical inquiries may include typically selling price for a certain number of
products
It is apparent from the draft business rule that data characteristics are
everywhere. As
mentioned earlier, data characteristics can either be developed directly through
measurement or indirectly through derivation, from a feature of an object or event.
For instance, if a product is a T-shirt, the data characteristics that are
developed by direct
measurement are
(1) material composition of the T-shirt,
(2) size range of available T-shirts,
(3) style of the T-shirt and
(4) supplier of the T-shirts.
On the other hand, some of the data characteristics which can be developed through
derivation may include
(1) bulk price of the T-shirts which can be derived depending on the number of
orders and
(2) shipment price of the T-shirts which also depends on the number of orders.
Data characteristics are very important in an area of data modeling called entity-
relationship
model (ERM). The ERM is a representation of structured data where final product is
an
entity-relationship diagram (ERD).
From the data models where the data characteristics are defined, relationships
among the
data, or more technically known as "entities" defined. Data characteristics are
also known
as "attributes" in data modeling jargon. Relationships could be as simple as "An
employee
entity may have a attribute which a social security number.
There are various types of relationships in ERM. Some may have many to one, one to
many, many to many or one to one. Database implementers need to give ERM design a
very careful consideration because any slight failure can result in weak data
integrity and
the resulting flaw could be hard to trace. As recommended by database experts, it
is always
good to draft a database plan using plain English language and data characteristics
should
also be defined likewise.
In a logical data model, the conceptual data model which is based on the business
semantic
is being defined. Thus, entities and relationships and corresponding table and
column
design, object oriented classes, and XML tags, among other things are being laid
regardless
of the database will be physically implemented.
A data file is a physical file. This means that this is a file that is represented
as real bit in
the storage hardware of the computer.
Dealing with data files in a large data warehouse is not as simple as dealing with
them on a
stand alone computer. Large data warehouses are managed by relational database
management systems. In relational databases, entities refer to any data that can be
of
interest and these entities have attributes.
For example, a CUSTOMER is an entity in the database. The customer could have
attributes
such as First Name, Middle Name, Last Name, Customer Number, SSS Number and a lot
more. When an entity is entered into the database, the database management system
connects an entity with its attributes in different ways called a relation.
An entity may have multiple attributes such as the number of places that he has
lived all
this life. All these information are saved as data file in a database management
system.
Today's data warehouses also make intensive use of extensible mark up language
(XML)
which is general purpose mark up language. XML is primarily used to facilitate
sharing of
structured data across several information systems which may have disparate servers
such
as the internet. XML is also used to encode documents and to serialize data so they
can be
easy to process.
Since XML can make its own mark ups, data warehouses could utilize an XML data file
to
store information about an entity and use the information later. XML data files may
reside
anywhere within the computer the storage. When information about an entity is
needed
from an XML data file, an XML needs to be processes using programming language in
conjunction with either SAX API or DOM API.
A transformation engine or a filter can also be used. Newer techniques for XML
processing
include push parsing, data binding and non extractive XML Processing API.
An entity can also be represented by manual data files. In fact, there are many
instances
that manual data filing is used instead of a database system or XML. For example
documents files such as last will and testament or long contract files have to be
stored
separately as manual data files.
Also, large video or photo files pertaining to a person need to be stored as data
files too.
But there has to be a mechanism to reference these manual files so they relate in
ownership
to a data entity. Both the database management system and the XML technique can be
used to do the referencing.
A data element has a definition done in metadata. The definition may either be a
phrase in
human readable form or a sentence which is associated with the data element within
the
dictionary. Having a good data element definition can add greater benefits in such
process
as mapping of one set of data into another set of data.
Data is the main component of a data warehouse. Most businesses today are heavily
reliant
on information from a data warehouse. The term data-driven business is very much in
use
today with the ubiquity of the internet.
resources are integrated. The data within the architecture are in turn based on the
common
data model.
Data modeling is the process of turning data into representations of the real life
entities,
events and objects that are of interest to the organization. So that the data
warehouse can
come up with consistent data, a data dictionary should also be set up.
The data dictionary, aside from containing the definitions of all data elements,
also contains
usernames and the corresponding roles and privileges, schema objects, integrity
constraints, stored procedure and triggers, information about the general database
structure and space allocations.
The way data elements are stored within the database may vary depending on the
database
design and the relational database management software application. But data
elements are
always the same in that they are atomic units of data containing identifications
such as data
element name. A data element should have a clear definition and a representation of
one or
more terms.
Data elements can be used depending on the application employing them. But their
usage
can be discovered by inspecting the applications or the data files of the
applications through
Application Discovery and Understanding which can be done manually or
automatically.
In Data Management, the design of the Data Structure determines the smooth and
successful implementation of the database that will power the Data Warehouse.
Since a Data Warehouse is a rich repository of all sorts of data � from company
data history
to data from outside data sources � it is always to a good idea to classify these
data in
order to get the relevant information and generate statistical reports for the
company.
Employee Subject Area which contains all entities and other data attributes
pertaining to
employees.
Despite the segregation of the high volume of disparate data coming from various
data
sources in a Data Warehouse, there is still no assurance of effectively looking for
the most
relevant data without any help from tools and other retrieving techniques. Using
Data Keys
for data subject can greatly help in the retrieval or important and relevant data
from the
database within the Data Warehouse.
Since Data Keys uniquely identify data occurrences in each data subject within the
data
resources, when a data consumer tries to look for a data, he is no longer
challenged by
sifting through the heavy volume. Instead, his search will be narrowed down because
of the
use of a key.
For example, without the help of a Data Key, when a data consumer want to look for
the
buying trend of say, people within the age range of 20-30 years old, he may be
confronted
with data coming all kinds of customers within the database.
But when the data consumers use a Data Key to identify the people within the 20-30
years
old, his search will definitely be narrowed down. As a result, his search will be a
lot faster
and the computer server will not be burdened with intensive processing load.
There are several layers in a Data Warehouse Architecture and one of the layers
include
extraction, cleansing and transformation of source information. This is the layer
where Data
Keys are attached to certain data so that it would be easy to find only the
relevant data in
the warehouse.
The Data Key is one of the important aspects of data structures used in general in
the
information technology field. In cryptography, a data key is a variable value which
is added
to a block of text or string. The encrypted data can only be opened by the key.
Data Key can also be an actual physical object that can store digital information
and
required to gain data access. A key analyzer is an associated program or mechanism
to
enable a computer to process data.
What is Cardinality
The Link Cardinality is a 0:0 relationship and defined as one side does not need
the other to
exists. For example, in a person and parking space relationship, it denotes that I
do not
need to have a person to have a parking space and I don�t need a parking space to
have a
person either. It also denotes that a person can only occupy one parking space.
This
relation need to have one entity nominated to become the dominant table and use
programs or triggers to limit the number of related records stored inside the other
table in
the relation.
The Sub-type Cardinality is a 1:0 relationship and defined as having one optional
side only.
An example would be a person and programmer relation. This is a 1:0 relation
meaning that
a person can be a programmer but a programmer must always be a person. The
mandatory
side of the relation, in the case the programmer side, is dominant in the
relationship.
Triggers and programs are again used in the controlling the database.
nullable foreign key column in the phone table is used to reference the person in
its table.
The Child Cardinality is a 1:M mandatory relationship and is one of the most common
These data may come from other warehouse data sources, or simply freshly entered
from
staff within various departments or any data coming from company subscriptions.
These data could highly likely come in different formats but the purpose of having
a data
warehouse is to give the company a clear statistical report of industry trends and
patterns
so data warehouses should have a mechanism of coming up with analysis and reporting
tools.
For a business to have an intelligent system which relies on the data supplied by
the data
warehouse, a well defined business data architecture is a very important
consideration. Just
as when we are building our house, to facilitate smooth flow of the construction
and to
make sure that all the materials, interior setup and design and other
specifications, a good
plan or blue print is essential to that carpenters, masons, electricians and other
builder
professionals who different areas of specialization can agree on one standard and
the house
will not go into disarray.
The same is true within the business organization. Different companies can have
different
perspective of the world transactions. For instance, for an organization offering
flower shop
services, the word transaction is definitely entirely different from an
organization offering
computer services.
function properly.
The fact is that in the design of software programs, the choice of Data Structures
is the top
consideration in design. Many IT professionals have experiences that tell that
building large
systems has shown that the degree of difficulty in software implementation and the
performance and quality of final output is heavily dependent on choosing the best
Data
Structure. So as early as the planning stage, the definition of a Data Structure is
already
given much time on.
In today's data warehouses where distributed systems are common, a Common Data
Structure can make it easy to share information between servers in distributed
systems.
Distributed systems are composed of many computer servers each trying to process
business events and sharing the results to be aggregated and used as statistical
report for
the company.
Data Entity represents a data subject from the common data model that is used in
the
logical data model.
A Data Model has three theoretical components. The first is the structural
component which
is a collection of data structures which will used to represent entities or objects
in the
database. The second is the integrity component referring to collection of
governing rules
and constraints on data structures. The third is the manipulation component
referring to a
collection of operators that can be applied to data.
Data Entity is one the components defined in a logical data model. A logical data
model is a
representation of all of the organizations data which are being organized in terms
of data
management technology. In today's database technology, there exist choices for
logical
data models relating to relational, object oriented or XML. Relational refers to
the Data
Entity as described in terms of tables and columns in the database. Object oriented
Data
Entity refers to the terms used in classes, attributes and association. XML Data
Entity refers
to terms described in tags similar to the web's HTML.
An entity defined in the conceptual data model may be any kind of thing about which
a
company wants to contain formation, attributes pertaining to the information and
relationships among the entities. For example, a company may have a person entity
stored
as "CUSTOMER" Data Entity in its database. Or it could be the other way around as
having
an abstract entity "PERSON" which may represent different real life entities such
as
customers, vendors, managers, suppliers and many more which are being defined in
the
conceptual data model. Conceptual data model is about definition of abstract
entities and
these definitions are being done in a natural a language.
The Data Entity is actually defined in the logical data model which is actually the
underlying
layer to a physical implementation of a database. Based on the abstract entities
from
natural language defined entities of the conceptual data models, the placement of
data
entities in columns is specified. The logical data model allows an end user to
access and
manipulate a relational database without him or her having to know the structure of
the
relational database itself. As such, the first part in creating a logical data
model is in fact
specifying which tables are available and then defining the relationships between
the
columns of those tables. It is in the logical data model where the definition of
the structure
for the data field of the Data Entity having a master level and a plurality of
detail levels can
be found.
The table is the basic structure of a relational model and this is where
information about the
Data Entity, for instance, an employee, is represented in the columns and rows. The
values
of a named columned are called attributes of the Data Entity. The term relation
refers to the
many tables within the database. For example, a column in the employee table
enumerates
all the attribute pertaining to the employee such as name, gender, age, address,
marital
status and others. The row is actually an instance of an entity represented in the
relation.
What is Data Restructuring
Data Restructuring is the process to restructure the source data to the target data
during
data transformation. Data Restructuring is an integral part in data warehousing. A
very
common set of processes is used in running large data warehouses. This set of
process is
called Extract, Transform, and Load (ETL).
The general flow of ETL involves extracting data from outside sources, then
transforming
based on business rules and requirements so that the data fit the business needs
and
finally, data is loaded in to the data warehouse.
If one looks closely at the process, the data restructuring part comes before the
loading.
This is extremely necessary. For one, in a data warehouse environment, high volume
levels
of data come into the data warehouse usually at very short intervals. In most
cases, the
data could come from disparate sources � this means that the server where data
comes
from maybe ran by different software platforms so the data may be of different
format; or
that sources may be based on different data architectures which may not be
compatible
with the data architecture of the receiving data warehouse.
When all the data coming from the different sources, there is need for the data to
be
restructured so they comply with all the business rules as well as the overall data
architecture of the data warehouse. Data restructuring makes the data structures
more
sensible to the database behind the data warehouse.
Data structure analysis includes making sure that all the components of the data
structures
are closely related, that closely related data are not in separate structures, and
that the
best type of data structure is being used. The data may be a lot easier to manage
and
understand when it is a representation which tries to abstract its relevant
similarities.
Often, in data warehouses, data restructuring involves changing some aspects of the
way
wherein the database is logically or physically arranged. There are many reasons
why data
restructuring should be performed. For instance, data restructuring is done to make
a
database more desirable by improving performance and storage utilization or to make
an
application more useful in order to support decision making or data processing.
. Trimming
. Flattening
. Stretching
. Grafting
In trimming, the extracted data from the input is placed in the output without
having to
change any of the change in the hierarchical relationships but some unwanted
components
of the data removed.
extracting all information at the level of the values of the basic attributes of
the branch.
The stretching operating can produce a data structure output which has hierarchical
levels
than the input.
One of the most important roles that data restructuring plays is in the field
information
processing applications. At the moment data is extracted from the data sources and
then
new fields are being created and placed in the output, the data structure of the
resulting
output sometimes does not resemble that of the input.
Sometimes, some query facilities which are designed for simple retrievals are not
adequate
enough to handle many of the real world scenarios so some programming may be
required.
But programming may not be for everyone, even for database administrators. Making
the
most of data restructuring may actually help eliminate some of the needs for
programming.
With a properly restructured data within a relational database, simple queries may
actually
be enough even in retrieving relatively complex and aggregated data structures.
Data Structure represents both physical and logical data contained in common data
architecture. It includes data entities, data subjects, its contents, relationships
and
arrangement.
Data component refers to a component of the metadata warehouse that contains the
structure of data within the common data architecture.
In general,
. data structures define what and how data will be stored in the database or data
warehouse
A real enterprise wide data warehouse has very complex data architecture. A data
warehouse is a repository of all enterprise related data coming from various data
sources
such are those coming from different departments (i.e. Finance, Administrative,
Human
Resource Departments).
Some of the high volumes of data will be stored in large legacy of package system
wherein
data structure may be unknown. Other enterprise data may be contained in
spreadsheets
and smaller personal databases such as Microsoft's Access and these aforementioned
data
may not be known by the IT department.
Some of the key information may be residing in some external information systems
which
are maintained by third party service providers or business partners.
Without a well defined data architecture and data structure, there can be very
little control
over the realization of high level business data concepts while data will likely be
highly
dispersed and of poor quality.
Another negative effect is that most data be redundant across the system and may
result in
conflicts in the organizational and business processes.
Data structures can be defined with high level data models which describe the
business data
from a logical perspective and not dependent on some actual system. This mode may
be
comprised of a canonical class model of the business entities and their
corresponding
relationships and the semantics, syntaxes and constraints of a superset of business
attributes. These high level data models defining data structures often exclude
class
methods but in most cases the methods may be summarized into one business data
object
which is responsible for managing the structure.
Data structures also include refers to the way that data are stored in terms of
relational
tables and columns as well as how they can be converted into objects in object-
oriented
classes and how they can be structure with XML tags.
A metadata is basically any data about another data and they are very useful in
facilitating
a better understanding, management and use of data. Therefore a common practice of
many data warehouse implementations to include a metadata warehouse to enhance the
performance of the whole system.
A metadata warehouses usually act as the interface in the data exchange between the
data
warehouse and the business intelligence. Since the metadata warehouse does not
really
contain the full data but just a description, the data structure is contained in
the data
component. The data components are very useful in data warehouses implemented in
distributed heterogeneous environments.
Data Structure Integrity
Data integrity in general is the measure of how well the data is maintained within
the data
resource after it has been created or captured. Data Structure Integrity is a
subset of this
data integrity that guides data relations. To say that a data has high integrity
means that
data has functioned in the way it was intended to be.
A data structure integrity rule defines the specification of a data cardinality for
a data
relation in a circumstance where there are no exception that apply. This rule make
the data
structure a lot easier to understand.
A conditional data structure integrity rule is slightly different in that this only
applies for a
data relations when there are conditions or exceptions that apply. This data
structure
integrity rule also shows that there is an option for coded data values which are
typically
difficult to show on an entity relation diagram.
To better illustrate the use and benefits of having achieved data structure
integrity, let us
say that we have two tables within the database. The first table, let's call it
"Persons" table
contains data in a list of names. Let us have another table and call it
"PhoneNumbers".
In the real world, people may have one, more than one or no telephone at all. In
database
terms, the two tables would the have three kinds of relationships: one to one; one
to zero;
and one to many relationships. This literally means that every person with the
"Persons"
table may have one, zero or many phones number within the "PhoneNumbers" table.
It is worthy to note that every phone number within the "PhoneNumbers" table has
one and
only person within the "Persons". By not applying the data structure integrity
rule, the two
tables may result in data being mixed up in the complicated relationships. It could
result in
data redundancy which can significantly slow down the whole system and result in
data
inconsistencies.
Database normalization is also one of the biggest factors that can help an
implementation
achieve data structure integrity. In database normalization, the tables are made
sure there
is no redundant data and some desirable properties of the tables are selected from
a
logically equivalent set of alternatives.
Referential integrity rules in relational databases make sure that data is always
valid. There
can be any referential integrity rules depending on the needs or requirements of
the data
model based on the business rules. The only thing to take careful notice of is that
in the
end, the data structure integrity is always maintained in the effort of meeting all
the
requirements.
The use of very precise rules for data integrity greatly solves a lot of problems
pertaining to
data quality which are very prominent in a lot of data warehousing implementations
in both
public and private sector organizations.
Precise rules for data integrity reduce the impact of bad information and allow
many
organizations to make the most use of their limited resources to more value added
undertakings. They also help many business organizations in their quest for
identifying
accountability in the area of data warehouse management which is as equally
important as
other areas like human resources, finance and sales.
With careful planning � from the data architecture to the physical implementation �
data
structure integrity can surely be achieved to give the company quality data as
basis for
sound decisions.
Data Storage
Enterprise Storage
Because of the scattered nature of data storage within the whole enterprise
storage, a
network is installed to connect the different data storage computers. A network
administrator works in close collaboration with the enterprise storage
administrator to
overcome pressure arising from providing secure and resilient storage to users,
group and
computer resources within a multi-platform heterogeneous environment. Some of the
network protocols used in an enterprise storage are CIFS, NFS, HTTP/DAV, FTP,
iSCSI.
An enterprise storage could very well work in conjunction with a data warehouse.
But
whereas a data warehouse is in nature very dynamic as it deals with both historical
and
current transactional data.
An enterprise storage may be tasked to contain data which are not activity used at
any
moment and the data warehouse will only fetch data as need arises. But
nevertheless,
companies should invest on very stable and robust data storage hardware to prevent
any
problem from data integrity, security and availability.
Enterprise System Connection Architecture (ESCON)
The ESCON can offer a communication rate of about 17 MB/second over distances of up
to
43 kilometers using half duplex medium. This technological architecture was
introduced
around 1990 by IBM so that it could replace the much older and slower copper-based
Bus &
Tag channel technology of 1960-1990 era mainframes.
The copper-based Bus & Tag channel technology was very unwieldy as the shielded
copper
cable allowing a throughput of 4.5MB/s or 45Mb/s; approximately equivalent to a T-3
connection is being installed all over the data storage and processing center.
Supplanting
ESCON is the Fiber Connectivity (FICON) which is substantially faster as it runs
over a fiber
channel.
Light-weight fiber optic cables having multimode, 62.5 micron supporting distances
of up to
3 kilometers and single mode, 9 micron that can support up to 20 kilometers are
being used
for the ESCON technology. It also uses signal regenerators such as the General
Signal
Networks CD/9000 ESCON Director or CX Converter.
The ESCON system takes care of the structure of a high speed backbone network used
in a
data storage and processing center. This center also saves a gateway to other
networks
attached which have lesser speeds. Some of the essential primary configurable
elements of
the ESCON network are the fiber optic links, the ESCON channels, the ESCON
Director, and
the ESCON control units.
On the other hand, the software support functions include the ESCON Manager program
for
ESCON Director and the ESON Dynamic Reconfiguration Management configuration
control.
With switches through the ESCON Directors, customers could create a high speed,
switched,
multi-point topology for dynamic connectivity of inter-data center applications.
The ESCON system was built to address some of the major concerns of the company
related
to interconnection of systems, control units, and channels are system disruption,
complexity, cable bulk, cable distance, and the need for increased data rates. The
general
advantages from using ESCON are reduced cable bulk and weight, greater distance
separation between devices, more efficient use of channels and adapters, higher
availability,
and having a axis for growth and I/O performance.
Hence, the ESCON has the capacity to permit operating systems and applications to
run
unchanged on computers as well as permit the in insertion of additional control
units and
systems into running configurations without having to turn off the power, thus
avoiding
scheduled and unscheduled outages for installation and maintenance.
ESCON can also improve the interconnection capability within the data processing
centers
which also embraces the intersystem connections and device sharing between systems.
It
can allow an increased number of devices to be accessible by channels which can be
very
useful for large companies with large enterprise storage needs and big data
warehouse.
Since today's companies could be said to be efficient if they are data-driven, many
data
storage are place strategically in different locations to complemented the business
intelligence system. ESCON can allow the extension of the distance for direct
attachment of
control units and direct system-to-system interconnection in the enterprise storage
environment. It can also provide significantly higher instantaneous data rates for
simultaneous data consumer serving as well as data sources gathering.
As the data warehouse or enterprise storage system grows, so will the need for
computer
grow. Since ESCON uses optical interface, it can significantly reduce the bulk and
number of
cables required to interconnect the system elements of a data storage and
processing
complex.
Historical Database
Dealing alone with the current value of the data may benefit a company by having to
expend on less on additional software and hardware acquisition but more often,
investing on
dealing with the historical perspective the data within the data warehouse can have
more
benefit not having a historical database.
True, having a historical database means that a business organization will have to
buy
additional computer servers with more computing power, random access memory and
larger
storage because processing the historical perspective of data can really a daunting
and labor
intensive work. But in the long run, the return of investment can be surprisingly
big.
The performance monitor is a very valuable tool which can help eliminate
troublesome profit
destroyers such as oscillations and swings.
The first would be to check for the code logic for any logical bug. And if there is
no problem
with the code, the other methods would be to check the value of the data entered.
the past. With a problem like this, the solutions would definitely come from the
historical
database.
A historical database is also very useful in spotting trends and patterns about the
operations
of the company and how well the company is performing compared to ther players in
the
same industry.
In fact, most companies will want to see a "panoramic" view of a performance not
just one
item but of all the items and all the business events and transactions including
the trending
in sales, income, human resource, manufacturing all other facets of the business.
So going
back to the trending in the sale of a particular item, the data to be considered
may include
sales from as far back as two years. Data from this past are already stored in the
historical
database so as to overload the current transactional database.
Data Replication is a set of data copied from a data site and placed at another
data site
during Data Replication. It is also a set of data characteristics from a single
data subject or
data occurrence group that is copied from the official data source and placed at
another
data site. Data Replicates are not the same as redundant data.
Data Replication also refers to a formal process of creating exact copies of a set
of data
from the data site containing the official data source and placing those data in at
other data
sites. Another aspect of Data Replication is the process of copying a portion of a
database
from one environment to another and keeping the subsequent copies of the data in
sync
with the original source. Changes made to the original source are propagated to the
copies
of the data in other environments.
Data Replication is a common occurrence in large data warehouses to help the system
function efficiently and guard against entire system failure. Many data warehouse
systems
use Data Replication to share information in order to ensure consistency among
redundant
resources like hardware and software components.
In some cases, it could be could be a data replication instance if the same data is
stored on
multiple storage devices or computer replication in cases where the same computing
task is
being executed many times. In general, a computational task is being replicated in
space
such as being executed on separate devices. It could also be replicated in time
such as in a
case where it is being executed repeatedly on one device.
Data Replication is transparent to an end user. So a data consumer would really not
know
where among the data sources the data he or she is using is coming from because she
only
gets the impression of one monolithic data warehouse. The access to any replicated
data is
usually uniform with access to a single, non-replicated entry.
There are in general two types of Data Replication. These two are active and
passive
replication. An active Data Replication refers to the process wherein the same
request is
being performed at the every data replica. On the other hand, passive Data
Replication is
done with each single request being processed on one replica and then the state is
being
moved to many other replicas. It at any given time on of the master replicas is
designated
to handle the processing of all requests, this is what is being referred to as
primary-backup
scheme (master-slave scheme). This scheme is predominantly used in high-
availability
clusters.
On the other hand, if any of the replica processes a request and then distributes a
new
state, this is what is being referred to as multi-primary scheme (called multi-
master in
database field). In this scheme, some form of distributed concurrency control such
as
distributed lock manager need to be employed.
In the area of distributed systems, Data Replication is one of the oldest and most
important
aspects. Some of the best known replication models in distributed systems include:
In statistics, the term official data refers to data collected in different kinds
of surveys
commissioned by an organization. It also refers to administrative sources and
registers.
These data are primarily used for purposes like making policy decisions,
facilitating industry
standards to be followed, outlining business rules and best practices among many
other
things.
Non-official data on the other hand are data coming from external sources. In
business,
these non-official data may come from other data sources that are randomly selected
by an
organization. A lot of new markets today like e-commerce and mobile technologies
are
commonly using more detailed data by non-official sources.
It is common for many business organizations also to publish business data outputs
on their
official websites and offer data as freely accessible to anybody. For another
company to get
these data, they are already getting non-official data integrated into their data
warehouse.
Problems on reliability can arise in these cases as the data managing expertise
from one
organization to another can vary.
Chain Data Replication involves having the non-official data set distributed among
many
disks which can provide for load balancing among the servers within the data
warehouse.
Blocks of data are spread in many clusters and each cluster can contain a complete
set of
replicated data. Each data block in each cluster is a unique permutation of the
data in other
clusters.
When a disk fails in one of the servers, any access for data from the failed server
is
automatically redirected to the other servers having disks containing the exact
replica of
non-official data.
During the disk installation and loading of replica, services which are providing
by the
existing array of disks are not affected as there are no additional I/O requests to
array of
disks and replicas are generated by the loading process itself. When the loading of
replica is
done, new replica can already start servicing data requests from various sources.
In terms of load balancing, Chain Data Replication works by having multiple servers
within
the data warehouse share data request processing since data already have replicas
in each
server disk.
Automatic Data Replication is the process wherein created data and metadata
automatically
replicates based in the request of the client at a specific data site. A data site
could maintain
many computers working as a system to manage one or more data warehouses. These
warehouses are repositories of millions of millions of data and more are gathered,
aggregates, distributed and updated every second.
Data replication makes use of redundant resources like hardware or software so that
the
whole data site system can have improved reliability and performance and become
tolerant
to unexpected problems arising from load intensive processes.
Active storage replication is done by having updates from data block devices
distributed to
many separate physical hard disks. The file system can be replicated without any
modification and the process is implemented either in a disk array controller in
the hardware
or in the device driver software.
Data replication is also employed in distributed shared memory systems. In this
system,
many nodes share the same page of the memory which means data is being replicated
in
different nodes. This is used to boost speed performance in large data warehouses.
Search engines where the biggest data warehouses of data and metadata are index
every
second employ the most intensive use of automatic data replication as they services
the
public internet users around the world.
Load balancing despite being different from data replication is often associated
with data
replication because it only distributes loads of different computation in many
machines.
Back up, while the process involves making copies of data, is different from data
replication
in that the data saves cannot be changed for a long period even if the replicas are
constantly updated.
Both load balancing and back up are important processes in large data warehouses.
Many
business companies invest on data warehouses with automatic data replication mainly
to
take advantage of enhance availability of specific and general data and to have
disaster
recovery protection. Other benefits of having a data warehouse with automatic data
replication include tolerance from disaster, ease of use and management and more
robust
system.
Data Quality
What is Data Denormalization
It is common that a normalized database uses stores different but related data in
separate
logical tables. These tables are called relation. In big data warehouses, some
these relations
are physical contained on separate disk files. Thus, logically, issuing a query
that gets
information from different relations stored on separate disk files can be slow.
This can be
even slower if many relations are being joined.
To overcome this problem, it is good to keep the logical design normalized while
allowing
the database management system to store separate redundant data on disk so that the
query response may be optimized. While doing this, the DBMS should be responsible
for
ensuring that the redundant replicas are kept consistent all the time. In some SQL
software,
this is called indexed views while in others, this is called materialized views.
For this matter,
the term view is the information laid out in a format which is convenient for query
with the
index ensuring that the queries against the view are being optimized.
There general three categories namely the Conceptual data model which is used to
explore
domain concepts with project stakeholders; Logical data model which used to explore
the
relationships among domain concepts; and the Physical data model which used to
design
the internal schema of the database with focus on data columns of tables and
relationships
between tables.
Denormalizing the logical data with extreme care also can result to an improvement
in
query response. But this can come with a cost. It will be the responsibility of the
data
designer to ensure that the denormalized database will not become inconsistent.
This can be
achieved by creating database rules called constraints. These constraints specify
the
synchronization measures of redundant copies of data. The real cost in this process
is the
increase in logical complexity of the design of the database as well as the
complexity of
additional constraints. They key to denormalizing logical data is exerting extreme
care as
constraints can create overhead of updates, inserts and deletes which may cause bad
Data warehouses, where a rich repository of company data may be found, are being
run by
database management systems that need to see one homogenous data in order for it to
flow smoothly and process data to be able to come up with statistical report about
company
trends and patterns.
But the problem arises because data warehouses gather, extract and transform data
from a
variety of sources. This means that data may come from a server that has totally
different
structure of hardware and the software behind the server format data differently.
When this
arrives to the data warehouse, it mixes data from yet other servers which are of
disparate
systems.
This is where data cleansing comes in. Data cleansing is also referred to as data
scrubbing,
an act of detecting and subsequently either removing or correcting a database'
dirty data.
These dirty data refers to data which are out of date, incorrect, incomplete,
redundant or
formatted differently. It is the goal of the data cleansing process to not just
clean up the
data within the database but also to bring inconsistencies into different sets of
data that
have been lumped together from separate databases.
Despite being interpreted as similar by many people, data scrubbing and data
cleansing
differs in that data cleansing, validation almost invariable means that data is
rejected from
the system right then and there at entry time. In contrast, data scrubbing is done
in
batches of data.
After the data has been cleansed, the data set will be consistent and can already
be used
with similar data in the system so the database can already have a standard process
to
utilize the data. It is a common experience with data warehouse implementation to
detect
and remove inconsistent data which have been supplied from different data
dictionary
definitions of similar entities within different stores. Other data problems may
have been
due to errors during end user entry activities or corruption during the process of
data
transmission or during storage after receipt from the source.
A good way to guarantee that data is correct and consistent is having a pre-
processing
period in conjunction with data cleansing. This will help ensure that data is not
ambiguous,
incorrect or incomplete.
Data cleansing in an important aspect in the goal of achieving quality data in data
warehouses. Data are said to be of high quality if they fit the purposes to serve
correct
operations, decision making and planning of the organization implementing the data
warehouse. This means that the quality of data is gauged by how realistically they
represent
real world constructs to which they are designed to refer to.
Data warehouses of an organization are filled with data which would reflect all the
activities
within the group. Data may come from various sources and gathered using routing
business
processes. It is imperative that the processes in the data warehouse should be
precise and
accurate because the usefulness of data goes far beyond the software applications
that
generate it.
All companies have been depending heavily on data from the business data warehouse
for
decision support systems. Data are frequently integrated with many other
applications and
connected with external applications over the internet so data is continually
expanding at
such tremendous proportions.
Data quality has been a persistent problem for many data warehouses. Data managers
or
administrators have found it a cumbersome task to fix erroneous data or changed
processes
to ensure accuracy and less important data have been overlooked. Business companies
have taken great efforts to have data warehouses with data quality requirements and
they
make intensive assessment an integral part of any data project.
In order to achieve Data Accuracy and good quality, data professional should
understand
the fundamentals of data which are quite simple.
The quality of data has many dimensions. These include accuracy, timeliness,
completeness,
relevance, easily understood by end users and trusted by end users.
Data Decay can lead to inaccurate data. Many data values which are accurate can
become
inaccurate through time; hence data decay. For example, people's addresses,
telephone
numbers, number of dependents and marital status can change and if not updated, the
data
decays into inaccuracy.
Data Accuracy is a very important aspect in data warehousing. While the problem can
still
persists, companies can have measures to minimize if not eliminate data inaccuracy.
Investing in high powered computer systems and top of the line database systems can
have
long term benefits to the company.
Consistent Data Quality refers to the state of a data resource where the quality of
existing
data is thoroughly understood and the desired quality of the data resource is
known. It is a
state where disparate data quality is known, and the existing data quality is being
adjusted
to the level desired to meet the current and future business information demand.
Data are said to be of high quality, according to JM Juran, "if they are fit for
their intended
uses in operations, decision making and planning". In business intelligence, data
are of high
quality if they accurately represent the real life construct that they represent.
Data warehouses are the main repositories of company business data which include
all
current and history data. Business intelligence mainly relies on these data
warehouses so
they can know the industry trends. With the information recommended by the business
And so companies should make a strong emphasis on having consistent data quality so
they
do not get garbage information from the data warehouse. Marketing efforts typically
focus
on name, address and client buying habits information but data quality is important
in all
other aspects as well. The principle behind quality data encompasses other
important
aspects of enterprise management like supply chain data and transactional data.
The difficult part with dealing with data is that it may sometimes be very
difficult or to an
extreme case, impossible to tell which is good quality data and which is bad
quality data.
Both could be reported as identical through the same application interface. But
there are
some guides to improve and have consistent quality data within the business
organization.
. It is important to involve the users. People are the main doers of data entry so
they can
be used as the first line of defense. On the other end, people are also the final
consumers of data so they could also be in the last line of defense to have
consistent
data quality.
. Having somebody or a group of skilled and dedicated staff to monitor the business
processes is a good move for the company. Data may actually start as good data but
will
turn bad through time as it decays. As an example, a project prospects list will
definitely
get out dated. Decayed data, data which become irrelevant through time, are hard to
detect and could cause damage and lots of monetary losses. A good business process
monitoring ensures timely and accurate update. It is also important to streamline
process when possible so that the number of hands touching the data will be
minimized
and the chances of corrupting data will be greatly reduced.
.
. The use of a good software can help maintain consistent data quality. There are
may
credible software vendors where a company can buy application from.
There are millions and millions of data stored in the database and this number
continues to
increase everyday as a company heads for growth. In fact, a group of process of
process
called extract, transform, load (ETL) is periodically performed in order to manage
data
within the data warehouse.
A data warehouse is a rich repository of data, most of which are historical data
from a
company. But in modern data warehouses, data could come from other sources. Having
data from several sources greatly helps in the overall business intelligence system
of a
company. With diverse data sources, the company can have a broader perspective not
just
about the trends and pattern within the organization but of the global industrial
trends are
well.
In order to get a view of trends and patterns based on the analytical outputs of
the business
intelligence system can be a daunting task. With those millions of data, most of
which
disparate (but of course ironed out by the ETL process), it may be difficult to
generate
reports.
Dealing alone with big volumes of data for consistent delivery of business critical
applications can already affect the network management tools of a company. Many
companies have found that existing network management tools could hardly cope up
with
the great bulk of data required by the organization to monitor network and
applications
usage.
The existing tools could hardly capture, store and report on traffic with speed and
granularity which are requirements for real network improvements. In order to keep
the
volume down to speed up network performance for effective delivery, some network
tools
discard the details. What they would do is convert some detailed data into hourly,
daily or
weekly summaries. This is the process called data generalization or as some
database
professionals call it, rolling up data. Ensuring network manageability is just one
of the
benefits of data generalization.
Data generalization can provide a great help in Online Analytical Processing (OLAP)
technology. OLAP is used for providing quick answers to analytical queries which
are by
nature multidimensional. They are commonly used as part of a broader category of
business
intelligence. Since OLAP is used mostly for business reporting such as those for
sales,
marketing, management reporting, business process management and other related
areas,
having a better view of trends and patterns greatly speeds up these reports.
The evaluational database on the other hand will supply data to be used to support
OLAP.
By creating these two databases, the company can be able to maximize the
effectiveness of
both OLAP and OLTP. The two databases will differ in the characteristics of data
contained
within and how the data is used. For instance, in the "currentness" attribute of
data, the
operational data is current while the evaluational data is historic.
. entity�attribute�class
. role�type�class
. prime�descriptor�class
. entity�adjective�class
. entity�attribute�class word
. entity�description�class
. entity keyword�descriptor�domain
Having a data name convention is important because they are a collection of rules
which
when applied to data could result in a set of data elements which are described in
a
standardized and logical fashion.
In the general area of computer programming, a data naming convention refers to the
set
of rules followed in order to choose the sequence of characters which will be used
as
identifiers in the source code and documentation. Following a data naming
convention, in
contrast to having the programmer choose any random name of their choice, make the
source code very easy to read and understand and enhance the source code appearance
for
easy tracing of bugs.
The data naming convention defined in this article focus on database implementation
which
powers a data warehouse.
The rules for developing a name convention are described by The International
Standard
ISO 11179, Information Technology-Specification and Standardization of Data
Elements.
These rules include standards for data classification, attribution, definition and
registration.
Data elements are the product of a development process which involves many levels
of
abstraction. These levels come from the most general to the most specific
(conceptual to
physical). The objects within each level are called the data element components but
their
name simply became components. To use the Zachman Framework, the highest levels of
definition are contained within the business view and the development progresses
down to
the implemented system level.
At each different level, components are defined and combined. Each component
contributes
its name or part of its name to the final output based on naming conventions.
Three kinds of rules exist for a data naming convention. The semantic rules refer
to the
description of data element components. The syntax rules refer to the prescribed
arrangement of components within a given name. The lexical rules refer to the
language
related aspects of names.
1. The terms for object classes should be based on the names of object classes
which can
be found in entities and object models properties.
2. The terms to be used for properties should be based on property names found in
attributes and object model properties.
3. When need arises, qualifiers may be added to describe data elements.
4. Representation of the data element's value domain may be described using the
representation term.
5. There can only be one representation term present.
1. Unless it is the subject of a qualifier term, the term for object class should
occupy the
leftmost position in the name.
2. A qualifier term should precede the component it qualifies.
3. The property term always follows the object class term.
4. The representation term is always placed in the rightmost position.
5. Redundant terms use will result to deletion.
The lexical rules state that:
These are just general rules although many software developers and database
management
software vendors also set their own rules. But in general, these three rules are
used in a
wide array of software implementations.
Data Optimization is a process that prepares the logical schema from the data view
schema.
It is the counterpart of data de-optimization. Data optimization is an important
aspect in
database management in particular and in data warehouse management in general. Data
applications in fetching data from a data sources so that the data could used in
data view
tools and applications such as those used in statistical reporting.
In some database applications, the database management system itself is loaded with
features to make querying data views easy by directly executing the query and
immediately
generating views. Some database applications have its own flexible language for
mediating
between peer schemas extending from known integration formalisms to more complex
architecture.
Data optimization can be achieved by data mapping, an essential aspect in data
integration.
This process of data optimization includes data transformation or data mediation
between a
data source and its destination, and in this case, the data sources could refer to
the logical
schema and the destination the data view schema. Data mapping as a means of data
optimization could translate data between various kinds of data types and
presentation
formats into a unified format used in different reporting tools.
Some software applications offer a graphical user interface (GUI) based tool used
in
designing and generating XML based queries and for data views. Since data can come
from
a variety of sources of from a heterogeneous data source, running queries with this
tool can
be an effective means of generating a data view. Using graphical data view can free
a data
consumer from having to focus on the intricate nature of query languages as they
tool can
provide pictorial and drag and drop mapping approach.
Being free from all the intricacies associated with query languages means that one
can focus
more on the information design and conceptual synthesis information which could
come
from many different disparate sources. Since high level tools need to shield end
users from
the back end intricacies, it needs to manage the data from the back end
efficiently.
Having a graphical tool may have its benefits but its downside is that the graphics
could add
load to the computer memory. So, graphical tools need so much data optimization in
order
to balance the load toll from the graphical components.
There are several modules available designed for data optimization. These modules
can be
easily "plugged" to existing software and the integration may be seamless. Having
these
pluggable data optimization modules can definitely make database related
applications focus
more on the development of graphical reporting tool for non technical data
consumers.
Data Normalization is a process to develop the conceptual schema from the external
schema. In its very essence, data normalization is the process of organizing data
inside the
database in order to remove data redundancy. The presence of many redundant data
can
have very undesirable results which include significant slowing of the entire
computer
processing system as well as negative effect on data integrity and data quality.
The first normal form, denoted as 1NF, is geared towards eliminating repeating
groups in
individual tables. This can be done by creating separate tables for each set of
related data
and attributes and giving each set of related data in the table with a primary key.
During
this normalization form, multiple fields should not be used in a single table to
store similar
data.
For instance, in tracking an inventory item that may possible come from two
different
sources, an inventory record may contain separate fields for Vendor code 1 and
Vendor
code 2. This is not a good practice because when there is another vendor, it is not
good to
add a Vendor Code 3. Instead, a table should be created separately for all vendors
and a
link the table to inventory using an item number key.
The third normal form, indicated by 3NF is geared towards the elimination of
columns which
are not dependent on a key. If an attribute does not contribute to a description of
the key,
then it has to be removed to a separate table. A value in a record which is not
part of the
key of the record should not belong to the table. Generally, whenever the contents
of a
group of fields may apply to more than one record in a table, it is good to place
these fields
in a separate table.
Other data normalization forms include Boyce-Codd Normal Form (BCNF), Fourth Normal
Form (4NF � Isolating Independent Multiple Relationships), Fifth Normal Form (5NF -
Data Quality indicates how well data in the data resource meet the business
information
demand. Data Quality includes data integrity, data accuracy, and data completeness.
Data Quality is the process wherein data is corrected, standardize and verified.
This process
needs very meticulous inspection because any mistake at this point could send waves
of
errors along the way.
Data Integration is the process of matching, merging and linking data fro a wide
variety of
sources which usually come in disparate platforms.
Data Augmentation is the process of enhancing data information from internal and
external
data sources.
Finally, Data Monitoring is making sure that data integrity is checked and
controlled over
time.
In real life implementation, data quality is a concern for professionals who are
involved with
a wide range of information systems. These professionals know the technical in and
out of a
variety of business solution systems ranging from data warehousing and business
intelligence to customer relationship management and supply chain management.
According to a study in 2002, in the United States alone, it was estimated that the
total cost
of efforts dealing with problems related to achieving and maintaining high data
quality is
about US$600 billion every year. This figure shows that concern over data quality
is such a
serious aspect so much so that many companies have begun to set up data governance
teams solely dedicated to maintaining data quality.
Aside from the formation of dedicated data quality teams in many companies to
address
problems related to data quality, several software developers and vendors have also
come
up with tools. Many software vendors today are marketing tools used in the analysis
and
repair of poor quality data. There are also many service providers specializing in
data
cleaning on contractual and data consultancy firms also offer advice on avoiding
poor quality
data.
But whatever the approach used in the implementation is, data quality task should
always
be integrated into the project design and planning. This particular aspect of data
warehouse
implementation is one of the most crucial and important fact.
When data is poor to begin with, a chain effect of errors could follow down the
line and this
could cause tremendous disaster not just on the data warehouse but the whole
business
enterprise in general. Careful planning and design at the beginning of each project
can
guarantee that the right data quality activities are integrity in the entire
project.
There are various types of data quality activities. Some of the most common ones
are the
following:
Data Planning � In this activity, data quality is considered in setting the project
scope,
schedule and deliverables. Also included are data quality deliverables in the
project charter.
It is in this activity that planning for data quality control throughout the
project is specified
in the aspect of data creating, data transmission from one database to another or
from one
application to another and data setup, configuration and reference.
Data Designing � This activity involves data quality profiling tools in creating or
validating
data information models. At this point, the technical people makes sure that those
analyzing
the data can have access to the data specifications and business rules which define
the
quality.
Data Deployment � After an intensive data quality assessment and data extracts are
confirmed to be accurate and consistent, data will finally be loaded to the
warehouse and
other data stores.
In general, a data quality process involves the following generic states: accessing
of data,
interpreting of data, standardization of data, validation, matching and
identifying,
consolidation, data enhancement and data delivery or deployment.
Of course there are other stages in the data quality processes but the inclusion or
revision
of some stages in the process depends on the software application developer. But
whatever
kind of software application being used to power a data warehouse, the basic thing
to
remember is that good information is the highly dependent on the quality of data
input.
Data Refining is a process that refines disparate data within a common context to
increase
the awareness and understanding of the data, remove data variability and
redundancy, and
develop an integrated data resource. Disparate data are the raw material and an
integrated
data resource is the final product
Data refining process may be composed of many different subsets depending on the
database or data warehousing implementation. The process of data refining is one of
the
most important aspects of data warehousing because unrefined data can cause a heavy
disaster on the final statistical output from the data warehouse will then be used
by a
company's business intelligence.
Data refining does not apply to one particular aspect of the data warehouse
implementation.
In fact it applies to the many stages � from the planning to data modeling to the
final
integration of systems in the data warehouse to the functioning of the entire
business
intelligence system.
Beginning with the data modeling, data refining occurs when at the conceptual
schema
development, the semantics of the organization are being described. All abstract
entity
classes and relationships are being identified and carefully made sure that the
entities will
be base on real life events and activities of the company. In this case, data
refining goes
into action but eliminating unnecessary things to interest. The same goes true
during the
logical schema development where the tables and columns, XML tags and object
oriented
classes are being described and data refining makes sure that the structures to
hold data
are well defined.
In data mining, there is a process called data massaging. This process is used in
extracting
values the numbers, statistics, and information found within a database and to
predict what
a customer will do next. Data Mining works in several stages starting with
collection of data,
then data refining, and taking the final action. Data collection may be gathering
of
information fro website databases and logs. Data refining involves comparing of
user
profiles with recorded behavior and dividing the users into groups so that
behaviors can be
predicted. The final is action is the appropriate action taken by the data mining
process or
the data source and that action is answering a question on the fly or sending
targeting
online advertisements to a browser or any other software application being used.
With such given circumstance, it a common necessity to have to refresh data being
used by
a data-using entity. These data-using entities include things such as a data
display, a
database, a dynamically generated web page, or the like at an interval that is
appropriate
for the data. Setting the interval for data refreshing should be done carefully: a
very short
interval will typically result in inefficient allocation of network bandwidth and
processor
resources. On the other hand, setting the data refreshing interval too long might
result in
stale data.
Some applications have data which are based on a central data warehouse. For
example,
many web applications today such as the weather news and financial and stock market
heavily depend on the periodical refreshes of their web documents so that the end
user can
be offered with the last complete information.
Too often, sometimes, the interval between data refreshing is very small and this
results in
heavy burden on the server as well as on the network resources especially if the
data
includes multimedia files. Sometimes also, the updated web documents could
encounter
inordinate delays making it difficult to retrieve web documents in time. In some
implementations of data refreshing for web applications and those involving
browsers and
websites, there are small scripts embedded within an internet browser. These
scripts would
then allow a user to find out if the refreshing of a multimedia web document will
be received
in time.
The system is installed with a criteria monitoring tool for monitoring at least one
criterion
which is related to the refresh interval. For generating an updated data refresh
interval
based at least in part on the monitored criteria, a special processor is provided.
Data
refreshing with the system is based on the data refresh interval.
Since the load is distributed but the whole data warehouse system need to be
synchronized
in order to get information which reflects the real picture of the company, the
servers need
to constantly communicate with each over the network and give regular updates. Part
of
this communication tackles with data refreshing. Data refreshing should be done in
order to
synchronize data and make sure that the final output will always be correct,
consistent and
very timely.
The error correction used is often the ECC memory or another copy of the data.
Employing
data scrubbing greatly reduces the possibility that single correctable errors will
accumulate.
To illustrate the need for data scrubbing, let us take this example. For example,
if a person
is being asked: "Are Joseph Smith of 32 Mark St, Buenavista, CA and Josef Smith of
32
Clarke St., Buenvenida, Canada the same person?", the person would probably answer
that
the two most probably are the same. But to a computer without any aid of a
specialized
software application, the two are totally different guys.
Our human eyes and mind would spot and justify that the two sets of records are
really just
the same but there was a mistake or inconsistency taking place during the data
entry. But
in the end, it is the computer that handles all data there should be a way to make
things
perfect for the computer. Data scrubbing weeds out, fixes or discards incorrect,
inconsistent, or incomplete data.
Computer cannot reason. The operate along the concept of "garbage in garbage out"
so no
matter how sophisticated a software application is, if the data input is not of
high quality,
the output data will not be of high quality also.
There are many sources of dirty data. Some of these most common sources include:
. Poor data entry, which includes misspellings, typos and transpositions, and
variations in
spelling or naming;
Today, there are hundred of specialized software applications that are developed
for data
scrubbing. These tools have complex and sophisticated algorithms capable of
parsing,
standardizing, correcting, matching and consolidating data. Most of them have a
wide
variety of functions ranging from simple data cleansing to consolidation of high
volumes of
data in the database.
A lot of these specialized data cleansing and data scrubbing software tools also
have the
capacity to reference comprehensive data sets to be used to correct and enhance
data. For
instance, customer data for a CRM software solution could be used in referencing
and
matching to additional customer information like household income and other related
information.
Despite the fact that data hygiene is very important in getting useful results from
any
application, it should not be confused with data quality. Data quality is about
good (valid) or
bad (invalid). In other words, validity is the measure of the data's relevance to
the analysis
which is at hand. Data scrubbing is often confused with data cleansing; but they
have
similarities to a certain degree.
Disparate Data
Disparate Data are heterogeneous data. They are neither similar nor can be easily
integrated with an organisations database management system. It differs in one or
more
aspects of an information system.
A data warehouse could be a prime example of a place where disparate data come
together.
The goal of a data warehouse to facilitating the bringing together of data from a
wide
variety of existing databases such as from data marts and other data warehouses as
well so
that the data warehouse can support management and reporting needs.
Now, the reality is that databases and other data sources are not implemented in
the same
way. Some database may be managed by say Microsoft SQL Sever while other may by
managed by Oracle or MySQL.
different distributions of Linux. Some may be on MacOS, others on Windows and many
other different platforms.
Still another cause of having disparate data is the different requirements and
different data
available through the states of the lifecycle � it could be less at the start and
then more at
the end. Different users within the company may have different needs for data like
suppliers
versus customers, operator versus planner, commercial versus government.
In a data warehouse, there is a process known as ETL which stands for extract,
transform
and load. The transform part is the part which takes care of managing disparate
data.
Once everything is in place and the data has been identified, data extraction take
place by
taking the desired data from data sources and placing them in data depot for
refining. The
data depot refers to some sort of working place or staging area so that disparate
data can
be refined before getting loaded into the database.
The process of data extraction technically includes any conversion between database
management systems. Data refining is the actual work of transforming disparate data
before they are finally integrated to the data warehouse under common data
architecture.
When disparate data are transformed into the data defined by the architecture, real
integration begins
Disparate Databases
Both the definition and manipulation algorithms employed in a database are being
based on
a data model which contains the definition of the semantics of the constructs and
operations
supported by these languages.
From these different kinds of data sources come differences in structure, data
semantics
and constraints supported or query language. This could be because different
database
implement based on different data models which could provide different primitives
such as
object oriented (OO) models that support specialization and inheritance and
relational
models that do not. For instance, a database may be using the set type in CODASYL
schema
but this schema could support supports insertion and retention which could are not
captured
by referential integrity alone.
Problems arising from the data results of disparate databases can be easily
remedied today
with many kinds of software tools that can manage data transformation. A popular
process
known as ETL (extract, transform, load) has become a standard in data warehouse
application in order to manage disparate data from various sources and transform
then into
a uniform format that the data warehouse can understand and work with.
Aside from enterprise data warehouses, the internet is the biggest example of
disparate
databases being handles to give relevant information. In dynamic websites, when one
looks
at a webpage with a browser, the viewer may not have any idea at all that the data
he is
looking at comes from hundreds of disparate databases.
But managing metadata is not as simple as making it. Management can all the more
become complex when dealing with large data warehouses which get data from various
data
sources powered by disparate databases.
The problem of disparate metadata has been a big challenge in many areas of
information
technology. Such technologies as data warehousing, enterprise resource planning
(ERP),
supply chain management and all other application dealing with transactional
systems with
disparate data as wells as duplication and redundancy. In the case disparate
metadata, the
most common problems are missing metadata relationships, costly implementation and
maintenance and poor choice of technology platforms.
Each of these metadata management tools can support several sets of data warehouse
functions which se or create subsets of the data warehouse metadata which have been
Consequently, the definition of metadata may be redundant across the data warehouse
tool
suite and this can result in disparate data. But as part of the cycle, the metadata
must be
integrated for them to be useful to the user of the data warehouse metadata.
The system is simply configured for collecting the disparate metadata from the
source tools
then after integration, metadata is and disbursed it back to any tools that use it.
It should
be wise though to determine which tool is the most appropriate source for each
metadata
object as various data warehouse tools may be capturing the same metadata at any
given
time.
Managing a disparate metadata cycle can really pose an incredible challenge. This
is mainly
because the only common attribute in identifying different versions of the same
metadata
objects is its name.
If different tools have been used in recording the same metadata attribute, certain
rules
must be established regarding which tool should maintain the master version in
order to
maintain consistency. So despite all these tools, disparate metadata needs to be
repetitive
process until such integration can take place. And the same goes disparate data,
the cycle
will always take place as long as the data warehouse is still in use.
Dealing with disparate operational data is a very common everyday process in a data
A single central database alone cannot handle an intensive data warehousing load so
it
needs to have many other physical servers share the load as well as store other
data. For
instance, in a large business organization, some data may come from the financial
department, others from the human resource department, manufacturing department,
sales
departments (like from point of sales department stores) and many other data
sources.
But still, despite the shared load, it would still be too heavy for the data
warehouse system
to simultaneously handle all the data. So it needs a "temporary" area to for
current data
values to be worked on. This areas is what is commonly referred to as the
operational data
store (ODS).
It is very common that the data sources could be of disparate nature. The reality
of data
warehousing system is that different data sources are powered by different
databases
systems. For instance, some databases may be run by Oracle, others by MySQL or
Microsoft
SQL Sever and many other commercial relational database vendors.
Another cause of having disparate operational data does not lie on the relational
database
management system itself but in the overall design of the enterprise information
system.
For instance, a company that is implementing a database system may not have defined
a
single, complete, integrated inventory of all its data. Or maybe the real
substance, meaning
and content of all the data within the organizational data resource is not readily
known or
well defined. And still yet, there exists very high variability of data formats and
contents in
the company's information system.
Denormalized Data
A normalized database has one fact in one place and all related facts about a
single entity
are being placed together so that each column of an entity would refer non-
transitively to
only the unique identifier for that said entity. A normalized data should have
undergone the
first three normal forms generally.
However, there are cases that performance can be significantly enhanced in a
denormalized
database physical implementation. The process of denormalization involves putting
one fact
in many places in order to speed up the data retrieval at the expense of the data
modification process.
Although not all the time recommended denormalization, in the real world
implementation is
sometimes very necessary. But before jumping into the process of denormalization,
certain
questions need to be answered.
Can the database perform well without having to depend on denormalized data?
Will the database perform well when used with denormalized data?
Will the system become less reliable due to the presence of denormalized data?
The answers will obviously give you the decision whether to denormalize of not.
If the answer to any of these questions is "yes," then you should avoid
denormalization
because any benefit that is accrued will not exceed the cost. If, after considering
these
issues, you decide to denormalize be sure to adhere to the general guidelines that
follow.
If you have enough time and resources, you may start actual testing and comparing a
set of
normalized table and another set of denormalized ones. Then load the denormalized
set of
tables by querying the data in the normalized set and then inserting or loading
into the
denormalized set. Have the denormalized set of tables in read-only restriction and
achieve a
noted performance. But it should be strictly controlled and scheduled in the
population
process in order to synchronize both the denormalized and normalized tables.
Base on the exercise about, it is obvious that the main reason for denormalizing is
when it
more important to have speedy retrieval than having a slower a data update. So if
the
database you are implementing caters to the needs of an organization which has more
frequent data updates, say, a very large data warehouse where data comes and goes
from
data source to another which makes it obvious that there are more updates than
retrievals
from data consumers then a normalization should be implemented without having to
deal
with denormalized data. But when the retrieval from data consumers is more
frequent, say,
a database that maintains less dynamic accounts but more frequent access because of
a
service exposed on the internet, then having to deal with denormalized data
populated in
the database should be the way to go.
Other reason for denormalization may include situation when there are many
repeating
groups existing which need to be processed in group instead of individual manner;
there
many calculations which will be involved on one or several columns before a query
can be
addressed; there is a need for many ways to access tables during the same period;
and
when some columns are queried with a large percentage of time. Denormalized data,
although many have the impression of being the cause of a slow system, are actually
useful
in certain conditions.
The most general sense, data quality is an indication of well the data is in terms
of integrity
and accuracy as it is stored in the data resource in order to meet the demand of
business
information. Other indicators of good quality data pertain to completeness,
timeliness and
format (though this may never apply anymore with today's advanced data cleansing
and
transformation tools).
But no matter how sophisticated the data warehousing system or data resources are,
there
can never be an achievement of high quality data when in the first place, the data
input is
not accurate. As it is aptly put in the information technology world: "Garbage in
is garbage
out". To avoid this, there must be a means of ensuring that the input data is clean
and of
high quality and so there should be an existing set of data quality criteria.
A data documentation would define all of the data associated with the
pharmaceutical
organization's business rules, entities and the corresponding attributes. It should
also very
specifically define the conventions for name labeling of the products.
Having a set of existing data quality criteria is almost similar to keeping a data
dictionary.
But while a data dictionary is very broad as it tries to define all data and all of
its general
aspects, the existing data quality criteria is very specific. It may define how the
data will be
structured, and how it will be dealt with in terms of physical storage and network
sharing.
But like the data dictionary, the existing data quality criteria can partially
overcome the
problem associated with data disparity from the sharing of different data formats
from
different data sources platforms. This will essentially complement the process of
extracting,
transforming and loading but its final effect will be clean, uniform and high
quality data
output in the data resource.
Today, there are many processes currently undertaken to re-engineer legacy systems
or the
federation of distributed databases. There have been many works done wherein a
conceptual schema that is often based on an extension of the Entity-Relationship
(ER.)
model is being derived from a hierarchical database, a network database, or a
relational
database.
Reverse data denormalization is just one of the broad aspects of a database reverse
engineering which has two major steps. The first step involves eliciting the data
semantics
from the existing system. In this step, the various sources of information can be
relevant for
tackling this task, e.g., the physical schema, the database extension, the
application
programs, but especially expert users, are being elicited.
The second involves expressing the extracted semantics with a high level data
model. This
task consists in a schema translation activity and gives rise to several
difficulties since the
concepts of the original model do not overlap those of the target model.
Most of the methods used in database reverse engineering within the context of
relational
database mainly focus on the work done with schema translation since they assume
that the
constraints such as functional dependencies or foreign keys are
available at the beginning of the process. But then in order to be more realistic,
those
strong assumptions may not in apply in all cases as there are also old versions of
database
management systems which do not support such declaration.
There have been many recent reverse data denormalization works that are
independently
proposing for the alleviation of the assumptions aforementioned. In a Third Normal
Form
schema, they key idea would be to fetch the needed information from the data
manipulation
statements which are embedded in certain application programs but there may be no
need
for constraining the relational schema with a consistent naming key attribute.
The Third Normal Form requirement has remained to be one of the major limits for
the
current methods in database reverse engineering. As it has been shown, during a
database
design process, the relational schemas are sometimes either directly produced in
the First
Normal Form or in the Second Normal Form or denormalized at the end of the database
design process.
In some cases, the denormalization occurs during the implementation of the physical
database or during the maintenance phase when the attributes are added to the
database.
There are cases when it is really necessary to add redundant data because current
database
management systems that implement the relational model are doing the implementation
poorly.
Normalized Data
Normalized Data is the data in the data view schema and the external schema which
have
gone through data normalization.
Maintaining a data warehouse means dealing with millions of data as the data
warehouse
itself is the main repository of a company's historical data or its corporate
memory. Thus,
data should be well managed and one of the ways to effectively manage a data
warehouse
is by reducing data redundancy.
One of the techniques employed commonly for reducing data is by using database
normalization. This technique is used mainly for relational database tables in
order to
minimize the duplication of data and in doing such, the database can be safeguarded
Let us take the case wherein a certain piece of information has multiple instances
in a table.
This case would result in having the instances not being kept consistent when an
update is
done on the table thus leading to a loss of data integrity. When a table is
normalized, it
becomes less prone to data integrity problems.
When a database is normalized to a certain high degree, more tables are being
created to
avoid data redundancy in one table but there would also be a need for having a
larger
number of joins and this can result to reduced performance.
As earlier mentioned, normalized data are the data used for the data view schema. A
data
view schema is the logical or virtual table composed of the data query results on a
database. But they are not like ordinary tables in a relational database in that a
data view is
not a part of the physical schema but it is instead a dynamic and virtual table
whose
contents come from collated or computer data.
A data view can be a subset of the data contained in a table and can join and
simplify
various tales into one virtual table. The data contents may be aggregated from
different
table resulting from computation operations such as average, products and sums.
Also mentioned earlier is the fact that normalized data are also used for external
schema.
An external schema is designed for supporting user views of data and providing
programmers with easy access to data from a database.
The data that users see are in terms of an external view which is defined by an
external
schema of the database. The external schema basically consists of descriptions of
each of
the various types of external record in that external view as well as a definition
of the maps
and connection between the external schema and the underlying conceptual schema.
Because of the very nature of normalized data wherein redundancies are greatly
reduced or
eliminated, the database as well as the entire information system greatly benefits
in that
the systems become a lot easier to manage and maintain. If there were so much data
redundancy scattered throughout the entire system, additional overhead cost through
the
purchase of additional hardware and software would be needed to make sure that data
Optimized Data
Although optimized data may come from different IT considerations, they are
primarily the
result of a general data optimization process which prepares the logical schema
from the
data view schema.
Data optimization, from the context of a data warehouse is optimizing the database
being
used. Most data optimizations in this respect are commonly known to be on-specific
technique used by several applications in fetching data from a data sources so that
the data
could used in data view tools and applications such as those used in statistical
reporting.
Since optimized data are data in the logical schema and conceptual schema, let us
first
know what a logical schema is. A logical schema is a non-physical dependent method
of
defining a data model of a specific domain in terms of a particular data management
Optimized data adhere to the semantics described in these two mentioned schemas.
They
work according to the rules and specifications and not violate each of these. These
data can
be said to have been mapped to the semantics of both the conceptual and logical
schemas.
There are many vendors that manage data centers to be working on optimized data.
These
vendors make sure that data across the entire organization are optimized while
addressing
issues on scalability for the future. They can also manage various data sources
into a
reliable, integral and consistent manner while delivering high data volumes with
low latency
to multiple applications within the enterprise.
While having a robust infrastructure that ensures data are optimized may entail a
high cost,
the benefits many be tremendous and long term. Developing and executing a good data
Data Profiling
What is Data Pivot
Data Pivot is a process of rotating the view of data. In databases where there is
high level
volume of data, it is often very difficult to get a view of a particular data or
report. A pivot
table helps overcome this problem by displaying the data contained in the database
by
means of automated calculations which are defined in a separate column side by side
with
data column of the requested data view.
There are several advantages to using pivot tables. One advantage is that a pivot
table
summarizes the data contained in a long list into a compact format. A pivot table
can also
help one find relationships within the data that are otherwise hard to see because
of the
amount of detail. Yet another advantage is that a pivot table organizes the data
into a
format that is very easy to convert into charts.
A pilot table also includes many functions including automatic sort, count, and
total the data
stored in a spreadsheet and create a second table displaying the summarized data. A
user
can set up changes of the summary structure by graphically dragging and dropping,
the
name pivot or rotating gave the concept its name.
Typically, a pilot table contains rows, columns and data or fact fields. Most
application
offering data pivot features can easy invoke the feature with a pivot table wizard
in few
steps. Commonly, the first involves specifying where the data is located and
whether a chart
as well as table should be displaced.
The second step is simply about identifying the list range. It is common to have an
insertion
point so that the wizard can just define the list range in an automated way. And
then the
final step is just specifying the graphical layout of the final data pivot view.
There are generally two ways to make a pivot table layout - Discrete and.
Continuous
Variables. Commonly, a discrete variable has a relatively small number of unique
values and
these unique values are of different names.
For example, discrete variables include such as values as department names, model
names,
or customer names in a company database. Discrete values are more suitably used as
row
and column variables in a data pivot table but they can of course be used also as
data
fields. However, if discrete values are used as data field, the data pivot table
can only
display a summary by count.
On the other hand, a continuous variable can take on a large range of values. Some
examples of continuous variables include units sold, profit margin and daily
precipitation. In
general, it may not be a good idea to use a continuous variable as a row or column
variable
in a data pivot table because the result would be an impossibly large table.
For instance, to analyze income for, say, 500 firms, using the firm income as a
column
variable in the data pivot table you possibly make the software application run out
of display
space. A continuous variable is commonly used as the data field in a data pivot
table in
cases where one wants to see the sum or average or other summary calculation of its
For example, dates used as row or column headers can be grouped. Data pivot table
is an
indispensable too in data warehouses as disparate data of high volumes need to be
formatted in order to get industry and business operations related reports that
reflect trends
and pattern.
But a data warehouse, as can be expected of a system that handles very large volume
of
data, is often implemented with many other different databases hosted in various
computer
systems which are called data sources, data stores or data marts.
All these systems give disparate data to the warehouse to be processed according to
the
business data architecture and business rules. As such the data warehouse needs
very
intensive loads so it needs to have a mechanism whereby it can serve its very
purpose
which is to give relevant information from among the millions and millions of data
inside it.
Aside from data archives and all the systems of records and integration and
transformation
programs, the data warehouse also contains current details and summarized data.
The heart of the data warehouse is its very current detail. This is the place where
the
biggest of bulk of the data resides and the current details is being supplied by
directly from
the operational systems which may be contained either as raw data or aggregated raw
data.
The current details are often categorized into different subject areas which
correspond to
representations of the entire enterprise rather than a given application. The
current detail
has the lowest level in terms of data granularity from among the other data in the
warehouse.
The period represented by the current detail depends on the company data
architecture but
if is common to set the current details to cover about two to five years. The
refreshing of
current details occurs as frequently as necessary to support the requirements of
the
enterprise.
One of the most distinct representations of current details in particular and data
warehouse
in general is the aspect of lightly summarized data. All enterprise elements such
as region,
functions and departments do not need the same requirements for information so an
effectively designed and implemented data warehouse can supply customized lightly
summarized data for every enterprise element. Access to both detailed and
summarized
data can be had by the enterprise elements.
Data warehouse is designed and implemented such that data is stored and generated
to
many levels of granularity. To illustrate the different levels, let us imagine a
cellular phone
company that wants to implement a data warehouse in order to analyze user behavior.
In the finest granularity level are the records of customers kept about every call
description
record during a 30-day period. During the next level which is the lightly
summarized data
history, statistical information by month for that customer such as calls by hour
of day, day
of week, area codes of numbers called, average duration of calls and other related
information are stored.
Finally, at the highly summarized level which is the next level of granularity, the
records
that may be contained include number of calls made from a zip code by all
customers,
roaming call activity, customer churn rate and this can be used other statistical
activities.
Data warehouses are typically implemented with having different databases handling
the
different levels of data granularity such as raw data, lightly summarized data and
highly
summarized data in a large information system with federated database. Lightly
summarized data have fine granularity. To have maximum efficiency, a stable network
infrastructure should be implemented.
Data Migration
Data Propagation is the distribution of data from one or more source data
warehouses to
one or more local access databases, according to propagation rules. Data warehouses
need
to manage big bulks of data every day. A data warehouse may start with a few data,
and
starts to grow day by day by constant sharing and receiving from various data
sources.
As data sharing continues, data warehouse management becomes a big issue. Database
administrators need to manage the corporate data more efficiently and in different
subsets,
groupings and time frames. As a company grows further, it may implement more and
more
data sources especially if the company expansions goes outside its current
geographical
location.
Data warehouses, data marts and operational data stores are becoming indispensable
tools
in today's businesses. These data resources need to be constantly updated and the
process
of updating involves moving large volumes of data from one system to another and
forth
and back to a business intelligence system. It is common for data movement of high
volumes to be performed in batches within a brief period without sacrificing
performance of
availability of operation applications or data from the warehouse.
The higher the volume of data to be moved, the more challenging and complex the
process
becomes. As such, it becomes the responsibility of the data warehouse administrator
to find
means of moving bulk data more quickly and identifying and moving only the data
which
has changed since the last data warehouse update.
From these challenges, several new data propagation methods have been developed in
business enterprises resulting in data warehouses and operational data stores
evolving into
mission-critical, real-time decision support systems. Below are some of the most
common
technological methods developed to address the problems related to data sharing
through
data propagation.
Bulk Extract � In this method of data propagation, copy management tools or unload
utilities are being used in order to extract all or a subset of the operational
relational
database. Typically, the extracted data is the transported to the target database
using file
transfer protocol (FTP) any other similar methods. The data which has been
extracted may
be transformed to the format used by the target on the host or target server.
The database management system load products are then used in order to refresh the
database target. This process is most efficient for use with small source files or
files that
have a high percentage of changes because this approach does not distinguish
changed
versus unchanged records. Apparently, it is least efficient for large files where
only a few
records have changed.
File Compare � This method is a variation of the bulk move approach. This process
compares the newly extracted operational data to the previous version. After that,
a set of
incremental change records is created. The processing of incremental changes is
similar to
the techniques used in bulk extract except that the incremental changes are applied
as
updates to the target server within the scheduled process. This approach is
recommended
for smaller files where there only few record changes.
Change Data Propagation � This method captures and records the changes to the file
as
part of the application change process. There are many techniques that can be used
to
implement Change Data Propagation such as triggers, log exits, log post processing
or
DBMS extensions. A file of incremental changes is created to contain the captured
changes.
After completion of the source transaction, the change records may already be
transformed
and moved to the target database. This type of data propagation is sometimes called
near
real time or continuous propagation and used in keeping the target database
synchronized
within a very brief period of a source system.
Data Mart
Data Mart is a subset of the data resource, usually oriented to a specific purpose
or major
data subject, that may be distributed to support business needs. The concept of a
data mart
can apply to any data whether they are operational data, evaluational data, spatial
data, or
metadata.
The data mart, like the data warehouse, can also provide a picture of a business
organization's data and help the organizational staff in formulating strategies
based on the
aggregated data and statistical analysis of industry trends and patterns as well as
part
business experiences.
The most notable difference of a data mart from a data warehouse is that the data
mart is
created based on a very specific and predefined purpose and need for a grouping of
certain
data. A data mart is configured such that it makes access to relevant information
in a
specific area very easy and fast.
Within a single business organization, there can more than one data mart. Each of
these
data marts is relevant or connected in some way to one or more business units that
its
design was intended for. The relationship among many data marts within a single
company
may or may not involve interdependency.
They may be related to other data marts if they were designed using conformed facts
of
dimensions. If one department has a data mart implementation, that department is
considered to be the owner of the data mart and it owns all aspects of the data
mart
including the software, hardware and the data itself. This can help manage data in
a huge
company by having a modularization method such that a department should only
manipulate and develop its own data as they see if without having to alter data
from other
department's data marts. Then other departments need data from the data mart owned
by
a certain department, proper permission should be asked first.
In other data mart implementation where there is strict conformed dimension, some
shared
dimensions exist such as customers and products and business ownership will no
longer
apply.
Data marts can be designed with star schema, snowflake schema or starflake schema.
The
star schema is the most simple of all the styles related to data mart and data
warehousing.
It consists only of few fact tables.
The snowflake schema is a variation of the star schema and the storage method is of
multidimensional nature. The starflake schema is a hybrid mixture of both the star
and
snowflake schemas.
Data marts are especially useful to make access to specific frequently access data
very
ease. It can give a collective picture or a certain aspect in the business by a
specific group
of users. Since data marts are smaller compared to a full data warehouse, response
time
could be lesser and the cost of implantation could also be less expensive.
Mini Marts
The data from the data warehouse are being used by the business intelligence system
as on
of the main bases for company decisions and that the critical factors that leads to
the use of
a data warehouse is that the company's data analysts can perform several queries
and
analysis varying degrees of analysis such as data mining on the data and other
related
information without affecting negatively or slowing down the operational system.
Given the tremendous requirements for a data warehouse, some IT professionals
recommend breaking down the huge data warehouse into data mini marts.
A data mini-mart is actually small (mini) and specialized version of data warehouse
and as
such, a mini-mart can actually contain a snapshot of operational data which are
very useful
for business people so that they can strategize based on data analysis of past
trends,
patterns and experiences.
The main difference between a full blown and big data warehouse and a mini mart is
that
the creation of the mini mart is predicated on a predefined and specific need for a
certain
configuration and grouping of select data and such configuration tries to give
emphasis on
easy access for relevant data and information.
In a business organization implementing a data warehouse, there may be many data
mini
marts and each one is relevant to one or more business units that the mini mart has
been
designed to serve.
In a business organization's data warehouse, the mini marts may or may not be
related or
dependent on each other. In cases where the mini marts are designed with the use of
conformed facts and dimensions, then they will be related to each other.
With mini marts, each department can use, manipulate and develop as well as
maintain
their data in any way that they see fit and without having to alter the information
inside the
other department's mini marts or the data warehouse.
Other benefits associate with implementing mini marts include getting very easy and
fast
access to data which are frequently needed. If there was no mini mart, a data
consumer
would have to go through the vast repository inside a central data warehouse and
with high
volumes of data involved, querying may take a very long time and may even slow the
entire
system.
Having mini marts can also create a collective view by a group of users and this
view can be
a good way to know how the company is performing on business unit level as well as
in its
entirety. Because a mini mart is obviously smaller than an entire data warehouse,
response
time would be greatly improved while creation and manipulation of new data would be
very
easy and fast as well.
Data Management
Data Redistribution is the process of moving data replicates from one data site to
another to
meet business needs. It is a process that constantly balances data needs, data
volumes,
data usage, and the physical operating environment.
It is not uncommon to have a data warehouse serving a company but the data
warehouse
also constantly interacts with other data sources. In many cases when a company is
so
large that it not only has several departments spread out in several floors of an
office
building but also has several branches spread out across different locations as
well, it is a
good idea to break up the data within the warehouse.
These data can be data replications which are being moved from data site
representing the
branch or departmental data into another data. The advantage of having this set up,
which
is the very essence of data redistribution, is that specific data needs can be near
the data
user department and so travel can be greatly reduced as well as the need for higher
networking resources.
Also, since the processing is spread across many servers, there will be a balance
of load and
the system can be made sure that no central server is taking a very toll to the
point of
breaking down and halting the whole business operation which relies on data and
information.
Redistribution via derived data - This type of data redistribution is focused the
definition
of the fine line separating core business data and data whose value is derived
through
mathematical formulas.
Data warehouses which implement data redistributions should make sure that the
existing
network infrastructure can handle the very large volume of data that travels across
the
network on a regular basis. A data redistribution software manager should also be
installed
to monitor the constant sharing of data replicates and make sure that data
integrity is
maintained all the time.
This software regularly communicates with each data site, following activities of
the data a
making sure that everything works smoothly as different servers process data
replicates
before they are aggregates into a reporting function for final use by the business
organization.
The IT infrastructure has many aspects. There is the network part where IT
architects need
to consider networking peripheral including routers, switches, cabling, wireless
connection,
and other products. Then there is the server aspect where certain powerful
computers are
assigned separate tasks like database server, web server, FTP server and others.
The data resource is a separate component of this infrastructure. Like the network
and
server components, and many other components not mentioned, it is important to
carefully
plan the data resource of the IT infrastructure.
The data resource encompasses all its representation of each and every single data
available to an organization. This means that even those non-automated data such as
bulks
of paper files in individual desks of each staff, confidential paper data hidden in
steel
cabinets, sales receipts, invoices and all other transaction paper documents
constitute the
Data Resource. It cannot be denied that despite the digitalization of all business
processes,
papers still play a large part in business operations.
A digital Data Resource provides faster and more efficient means of managing data
for the
company. In today' world, Data Resource implementation does not just end in the the
digitalization aspect.
Today, companies are finding out that the most efficient way to support Data
Resource is
constantly changing and evolving with technology. In fact, the changes have been
very
significant since the last 20 years.
In the field of digital Data Resource, in the not so distant past, there used to be
large
centralized facilities operating Mainframes to a distributed collection of client-
server systems
and back to recentralized arrays of commodity hardware.
Today, the physical nature of Data Resource may have changed dramatically but the
same
need for scalable and reliable infrastructure has not.
It is common today to have Data Resource scattered and connected via network. This
makes so much sense today as the need for information becomes more and more
pervasive
and data consumers are now doing mobile computing: the web, mobile applications,
national or global branch offices, worldwide partner sites, or subsidiaries.
Data Resources may be scattered everywhere in the company. There may be data
resource
from finance, sales, marketing, HR, manufacturing and other company departments.
Some
data resource may come from several other departments from other geographical
branches.
In short, Data Resources may come from everywhere and they may converge on one data
warehouse they may send criss-crossing data from department to another. In whatever
Data Synchronization
There are many data synchronization technologies which are available in order to
synchronize a single set of data between two or more electronic devices such as
computers,
cellular phones, and other personal digital assistants. These technologies can
automatically
do data copying of changes forth and back. For instance, a contact list on the
mobile phone
of a certain user can be synchronized with a similar contact list in another mobile
phone or
in another computer.
In the past, data management used to be a scenario where data is either consistent
or
highly available but could never be both at the same time. This was what was
referred to as
the Heisenbergian dilemma. But with today's fast advancement in information
technology
especially in the field of real time processing, data synchronization is much more
efficient
than ever before.
During the time when Usenets were very popular on the internet, it was more
sensible to
make replications of contents across a federation of news servers. The authors of
RFC 977
which specify Standard for the Stream-Based Transmission of News wrote:
�There are few (if any) shared or networked file systems that can offer the
generality of
service that stream connections using Internet TCP provide.�
But today, there are already thousands of shared file systems offer generality of
service.
Generality of service is what web servers do as they serve today's dynamic webpages
including forums, blogs and wikis. These developments have led to better data
synchronization techniques cast from the model of internet infrastructures and
implemented
in organizational data warehouses.
Since data synchronization makes frequent communication between the data warehouse
and the other data sources, problems related to network traffic management could
spring
up. This has be managed carefully be a set of standard network protocols as well as
Data synchronization is not just about overcoming network problems also. It works
closely
with the whole IT and business data architecture. One aspect of data
synchronization is
isolating multiple data views from underlying model so that the updates of the data
model
will not just alter the data view but also propagate to the synchronized instances
of the said
data model.
There are actually hundreds of techniques for data synchronization and different
software
solution vendors have different implementations of these techniques. There can
never be an
all in one solutions as needs differ from one data warehouse to another.
Data Transformation
Aside from the fact that business organizations store data on relational database,
there are
also many other that store some of their data in non-relational formats such as
mainframes,
spreadsheets, and email systems. Still, there are organizations that have
individual smaller
databases such as Microsoft Access for each staff.
When all these scenarios are taken together in a single company (which is still
highly
possible today), the organizations will need to find an efficient way of still
having to operate
as a single entity and where all disparate data and systems can related and
interchange
data among various data stores.
Data transformation is one of the collective process known as ETL (extract,
transform, load)
which is one of the most important processes in data warehouse implementation this
is the
way that data actually gets loaded into the warehouse from different data sources.
There are many tools to help data warehouses with data transformations by setting
objects
or certain utilities to automate the processes of extract, transform and load
operations to or
from a database. With these tools, data can be transformed and loaded from
heterogeneous
sources into any supported database. These tools also allow the automation of data
import
or transformation on a scheduled basis with such features as file transfer protocol
(FTP).
One such notable and widely used tool for data transformation is the Data
Transformation
Services (DTS) which contains DTS objects packages and many components. This tool
is
packaged with the Microsoft SQL Server but is also commonly used with any other
independent databases. When used with Microsoft products, the DTS can allow data to
be
transformed and loaded from heterogeneous sources using OLE DB, ODBC, or text-only
files, into any supported database.
The DTS packages are created with the use of DTS tools such as DTS wizards, DTS
Designer, and DTS Programming Interfaces.
The DTS Wizards, like any other program wizards, automates things by offering
simple
clicks to accomplish complex tasks so that even non-programmers can do such complex
jobs. But they mostly deal with common and simple DTS tasks including the
Import/Export
Wizard and the Copy Database Wizard.
DTS offers a graphical tool for building very sophisticated and complex DTS
packages. The
DTS Designer offers easy way to build the DTS packages with workflows and event-
driven
logic. It can be used in customizing and editing packages which have been created
using the
DTS wizard.
Other functionalities include are the DTS Package execution utilities, DTS Query
Designer
and DTS Run Utility.
Other than unifying and transforming data into a desired format for the data
warehouse
loading, data transformation is also responsible for correcting error by using a
background
task in order to periodically inspect the memory for errors and by doing such, it
reduces the
by using a background task in order to periodically inspect the memory for errors.
It also
minimizes or totally eliminate data redundancy.
The area of data warehouse management is very complex as data captured from
operational data sources such as those data coming from transactional business
software
solutions like Supply Chain Management (SCM), Point of Sale, Customer Serving
Software
and Enterprise Resource Planning (ERP) and management software to undergo the ETL
(extract, transform, load) process.
To facilitate data around the data warehouse, efficient ETL tools should be
employed.
Companies may either want to buy third party tools or develop their own ETL tools
by
assigning their in-house programmers to do the job. In general, the rule of thumb
is that
the more complex the data transformation requirements are, the more advantageous it
is to
just purchase third party ETL tools.
When deciding to buy a commercial data warehouse management tool, it is always good
to
consider the following aspects:
Functional capability � This means that the function to be considered is the way
the tool
handles both the "transformation" piece and the "cleansing" piece. When the tool
has strong
capability for both the "transformation" piece and the "cleansing" piece, then by
all means
buy it because in general, a typical data warehouse management tool can only have
strong
capability.
Ability to read directly from your data source � As mentioned earlier, data
warehouse
gets its data from various data sources and the ability to read directly from your
data
source make processing faster and more efficient.
The IBM WebSphere DataStage is an ETL tool and part of the IBM WebSphere
Information
Integration suite and the IBM Information Server. Formerly known as Ardent
DataStage and
Ascential DataStage, this tool is very popular for its ease of use and visual
interface. It is
available in many versions including the Server Edition and the Enterprise Edition.
. Informatica PowerCenter,
. Informatica PowerExchange,
. Informatica PowerChannel (for secure and encrypted data transffer over WAN),
. DMExpress,
. ETL Integrator,
. Informatica,
. Pentaho,
. Scriptella,
. Sunopsis and
Any properly designed Decision Support Systems features and interactive software
based
system geared towards helping decision makers compile useful information from raw
data,
documents, personal knowledge, and business models to identify and solve problems
and
make decisions.
The database management system is the system which stores all kind of data and
information. These data may come from various sources but mainly from an
organization's
data repository such as databases, data marts or data warehouse. other sources of
data
may include the internet in general as well as individual users' personal insights
and
experiences.
The model-base management system refers to the component that takes care of all
facts,
situations and events which have been modeled in different techniques such as
optimization
model and goal seeking model.
The dialog generation and management system refers to the interface management such
as
graphical windows and button in software where the end users use to interact with
the
whole system.
There are so many uses for a Decision Support System and one of the most common is
in
business enterprises in all kinds of industries. Some of the information that can
be derived
from the system for decision support may include all of current information assets
which
could include legacy and relational data sources, cubes, data warehouses, and data
marts,
sales figures in any given period intervals, new product sales assumptions which
will be
used as basis for revenue projections and any experiences related to any given
context
which can be the basis for different organizational decisions.
In another aspect which is not directly related to business enterprise, a Decision
Support
System may be used for hospitals as in the case of a clinical decision support
system used
for fast and accurate medical diagnosis.
The system could also be used for government agencies so they can have support for
decisions such as how to effectively deliver basic services to whoever need them
most or
how best to run the economy based on a certain trend in any given period.
It could also be used in the agricultural sector so that wise decisions can be made
on what
crops best suit a particular area or which fertilizers are best for certain types
of crops.
In the education sector, the system can be used to aid in deciding which subject
areas
many students fail and how the school can revise the curriculum to make it more
relevant
and effective.
But there are certain cases however that having redundant data as effect of a
database
denormalization will give the database a better performance.
For instance, when implementing a database which has more data retrieval, as in the
case
of a website which only gets information less frequently but there are more views
of data
from internet users, it is wise to have denormalized table.
As a general rule, when updates are optimized at the expense of retrieval, the by
all means,
the database should not be denormalized. But when retrieval should be optimized at
the
expense of updates, the by all means, denormalization should be employed.
Going to be the data stores, the historical data store is a place for holding
cleansed and
integrated historical data. Since the data acts like archives and therefore
retrieval only
happens less frequently, the data store should be fully normalized. This means it
has to be
normalized up the third normal form.
The operational data store is designed for holding current data which is used for
the day to
day operation. This means that accesses and insertions happen every minute due to
the
progressive activities of business transactions.
Master and transactional data are mish mashed together and the operational data
store
need to efficiently facilitate the generation of operational reports. Hence, the
operational
data store should be implemented as a denormalized data store.
The analytical data store is designed to store both current and historical
information about
the business organization and is implemented with a dimensional structure for the
facilitation of the generation of analytical reports. In this case, denormalization
should be
employed as the current information is also fast getting updated.
The decision whether or not to denormalize a data store should not be taken lightly
as this
involves administrative dedication. This dedication should be manifested in the
form of
documenting business processes and rules in order to ensure that data are valid,
data
migration is scheduled and data consumers kept update about the state of the data
stores.
If denormalized data exists for a certain application, the whole system should be
reviewed
periodically and progressively.
As a general practice, the periodic test whether the extra cost which is related to
processing
with a normalized database justifies the positive effect of denormalization. This
cost should
be measured in terms of Input / Output and CPU processing time saved and complexity
of
the updating programming minimized.
Data stores are such an integral part of the data warehouse and when not optimized,
it
could significantly minimize the entire system's efficiency.
Massive Parallel Processing (MPP)
The idea behind MPP is really just that of the general parallel computing wherein
the
simultaneous execution of some combination of multiple instances of programmed
instructions and data on multiple processors in so that the result can be obtained
a lot more
efficient and fast.
The idea is further based on the fact the having to divide a bigger problem into
smaller
tasks makes it easy to carry out simultaneously with some coordination. The
technique of
parallel computing was first put to practical use by the ILLIAC IV in 1976, fully a
decade
after it was conceived.
In the not so distant past of the information technology before client � server
computing
was on the rise, distributed massive parallel processing was the holy grails of
computer
science. Under this architecture, there various different types of computers
regardless of the
operating system being used and the computers would be able to work on the same
task by
sharing the data involved over a network connection.
Although it has fast become possible to do MPP in many laboratory settings such as
the one
at MIT, there was yet a short supply of practical commercial applications for
distributed
massive parallel processing solutions. As a matter of fact, the only interest at
that time
came from the academics who at the time could hardly find enough grant money in
order to
be able to afford time on a supercomputer. This scenario resulted in MPP becoming
known
as the poor man's approach to supercomputing.
Many software giants in the industry are envisioning loosely coupled servers that
are
connected over the internet such as the Microsoft's .NET strategy. Another giant,
Hewlett-
Packard, has leveraged its core e-speak integration engine by creating its e-
services
architecture.
There are various companies today that have actually started delivering products
which
incorporate massive parallel processing such as Plumtree Software, a developer of
corporate
portal software whose version 4.0 release added an MPP engine which can process
requests
for data from multiple servers that run the Plumtree portal software in parallel.
EIS offer strong ad-hoc querying, analysing, reporting and drill-down capabilities
without
having to worry about complexities of the algorithm involved in the system. The
senior
executives can have easy access to both internal and external information relevant
to
meeting the strategic goals of the organization.
highlight trends and patterns in the operations within the business enterprise and
its
relation to the business trends of the industry where the business is operating.
Today's Business Intelligence, with its high end sub areas including analytics,
reporting and
digital dashboards, could be considered by many to be the evolved form of the
Executive
Information System.
In the past, huge mainframe computers were running the executive information
systems so
that the enterprise data can be unified and utilized for analyzing the sales
performance or
market research statistics for decision makers such as financial officers,
marketing directors,
and chief executive officers who were then not very well versed with computers.
Many executive information systems are now being run on personal computers and
laptops
that high ranking company officials and CEOs can bring with then anywhere as meet
the
demands of a data driven company as well as industry.
In fact, executive information systems are now well integrated into large data
warehouses
with many computers internetworking each other and constantly aggregating data for
informative reporting.
An Executive Information System is a typical software system composed of hardware,
software, and telecommunications network. The hardware may be any computer capable
of
high speed processing but since this is a rather large system dealing with high
volume of
data, the hardware should also consider very large random access memory capacity
and
high storage capacity.
The software component literally controls the flow and logic of the whole system.
The
software takes care of all the algorithms which translate business rules and data
models
into digital representations for the hardware to understand.
The software may both have text base and graphics base. It also contains the
database
which manages all data involves and closely collaborates with the algorithms for
processing
the data. These algorithms may specify how to do routine and special statistical,
financial,
and other quantitative analysis.
The telecommunications network takes care of the cables and other media which will
be
used in data transmission. It also takes care of the traffic within the network as
manages
the system communications with outside networks.
What is surrogate key? Where we use it explain with examples?
It is just a unique identifier or number for each row that can be used for the
primary key to the table.
The only requirement for a surrogate primary key is that it is unique for each row
in the table.
It is useful because the natural primary key (i.e. Customer Number in Customer
table) can change and
this makes updates more difficult.
Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated as the
primary keys
(according to the business users) but, not only can these change, indexing on a
numerical value is
probably better and you could consider creating a surrogate key called, say,
AIRPORT_ID. This would be
internal to the system and as far as the client is concerned you may display only
the AIRPORT_NAME.
On the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1' (that's
what would be in your
Employee Dimension). This employee has a turnover allocated to him on the Business
Unit 'BU1' But on
the 2nd of June the Employee 'E1' is muted from Business Unit 'BU1' to Business
Unit 'BU2.' The entire
new turnovers have to belong to the new Business Unit 'BU2' but the old one should
belong to the
Business Unit 'BU1.'
If you used the natural business key 'E1' for your employee within your
datawarehouse everything would
be allocated to Business Unit 'BU2' even what actually belongs to 'BU1.'
If you use surrogate keys, you could create on the 2nd of June a new record for the
Employee 'E1' in
your Employee Dimension with a new surrogate key.
This way, in your fact table, you have your old data (before 2nd of June) with the
SID of the Employee
'E1' + 'BU1.' All new data (after 2nd of June) would take the SID of the employee
'E1' + 'BU2.'
You could consider Slowly Changing Dimension as an enlargement of your natural key:
natural key of the
Employee was Employee Code 'E1' but for you it becomes
Employee Code + Business Unit - 'E1' + 'BU1' or 'E1' + 'BU2.' But the difference
with the natural key
enlargement process, is that you might not have all part of your new key within
your fact table, so you
might not be able to do the join on the new enlarge key -> so you need another id.
Surrogate Key a simple concept.
Actually the DWH concept is to maintain the historic datas for analysing. So its
should denormalized form.
You all know one thing a single mobile is used by other person if it is not in use
for more than one year.
how is it posssible just because of this Surrogate Key.
This Data Warehousing site aims to help people get a good high-level understanding
of what it takes to
implement a successful data warehouse project. A lot of the information is from my
personal experience
as a business intelligence professional, both as a client and as a vendor.
- Tools: The selection of business intelligence tools and the selection of the data
warehousing team.
Tools covered are:
. Database, Hardware
. ETL (Extraction, Transformation, and Loading)
. OLAP
. Reporting
. Metadata
- Steps: This selection contains the typical milestones for a data warehousing
project, from requirement
gathering, query optimization, to production rollout and beyond. I also offer my
observations on the data
warehousing field.
Database/Hardware Selection
The only choices here are what type of hardware and database to purchase, as there
is basically no way
that one can build hardware/database systems from scratch.
Database/Hardware Selections
In making selection for the database/hardware platform, there are several items
that need to be carefully
considered:
. Scalability: How can the system grow as your data storage needs grow? Which RDBMS
and hardware
platform can handle large sets of data most efficiently? To get an idea of this,
one needs to determine the
approximate amount of data that is to be kept in the data warehouse system once
it's mature, and base
any testing numbers from there.
True Case: One of the projects I have worked on was with a major RDBMS provider
paired with a hardware platform that was not so popular (at least not in the data
warehousing world). The DBA constantly complained about the bug not being fixed
because the support level for the particular type of hardware that client had
chosen was
Level 3, which basically meant that no one in the RDBMS support organization will
fix
any bug particular to that hardware platform.
. Oracle
. Microsoft SQL Server
. IBM DB2
. Teradata
. Sybase
. MySQL
Popular OS Platforms
. Linux
. FreeBSD
. Microsoft
. Complexity of the data transformation: The more complex the data transformation
is, the more
suitable it is to purchase an ETL tool.
. Data cleansing needs: Does the data need to go through a thorough cleansing
exercise before it
is suitable to be stored in the data warehouse? If so, it is best to purchase a
tool with strong data
cleansing functionalities. Otherwise, it may be sufficient to simply build the ETL
routine from
scratch.
. Data volume. Available commercial tools typically have features that can speed up
data
movement. Therefore, buying a commercial product is a better approach if the volume
of data
transferred is large.
While the selection of a database and a hardware platform is a must, the selection
of an ETL tool is highly
recommended, but it's not a must. When you evaluate ETL tools, it pays to look for
the following
characteristics:
. Functional capability: This includes both the 'transformation' piece and the
'cleansing' piece. In
general, the typical ETL tools are either geared towards having strong
transformation capabilities or
having strong cleansing capabilities, but they are seldom very strong in both. As a
result, if you know your
data is going to be dirty coming in, make sure your ETL tool has strong cleansing
capabilities. If you know
there are going to be a lot of different data transformations, it then makes sense
to pick a tool that is
strong in transformation.
. Ability to read directly from your data source: For each organization, there is a
different set of data
sources. Make sure the ETL tool you select can connect directly to your source
data.
. Metadata support: The ETL tool plays a key role in your metadata because it maps
the source data to
the destination, which is an important piece of the metadata. In fact, some
organizations have come to
rely on the documentation of their ETL tool as their metadata source. As a result,
it is very important to
select an ETL tool that works with your overall metadata strategy.
Popular Tools
OLAP tools are geared towards slicing and dicing of the data. As such, they require
a
strong metadata layer, as well as front-end flexibility. Those are typically
difficult
features for any home-built systems to achieve. Therefore, my recommendation is
that if
OLAP analysis is part of your charter for building a data warehouse, it is best to
purchase an existing OLAP tool rather than creating one from scratch.
Before we speak about OLAP tool selection criterion, we must first distinguish
between
the two types of OLAP tools, MOLAP (Multidimensional OLAP) and ROLAP (Relational
OLAP).
1. MOLAP: In this type of OLAP, a cube is aggregated from the relational data
source
(data warehouse). When user generates a report request, the MOLAP tool can generate
the create quickly because all data is already pre-aggregated within the cube.
2. ROLAP: In this type of OLAP, instead of pre-aggregating everything into a cube,
the
ROLAP engine essentially acts as a smart SQL generator. The ROLAP tool typically
comes with a 'Designer' piece, where the data warehouse administrator can specify
the
relationship between the relational tables, as well as how dimensions, attributes,
and
hierarchies map to the underlying database tables.
Right now, there is a convergence between the traditional ROLAP and MOLAP vendors.
ROLAP vendor recognize that users want their reports fast, so they are implementing
MOLAP functionalities in their tools; MOLAP vendors recognize that many times it is
necessary to drill down to the most detail level information, levels where the
traditional
cubes do not get to for performance and size reasons.
So what are the criteria for evaluating OLAP vendors? Here they are:
. Customization efforts: More and more, OLAP tools are used as an advanced
reporting tool. This is
because in many cases, especially for ROLAP implementations, OLAP tools often can
be used as a
reporting tool. In such cases, the ease of front-end customization becomes an
important factor in the
tool selection process.
. Security Features: Because OLAP tools are geared towards a number of users,
making sure people see
only what they are supposed to see is important. By and large, all established OLAP
tools have a security
layer that can interact with the common corporate login protocols. There are,
however, cases where
large corporations have developed their own user authentication mechanism and have
a "single sign-
on" policy. For these cases, having a seamless integration between the tool and the
in-house
authentication can require some work. I would recommend that you have the tool
vendor team come in
and make sure that the two are compatible.
. Metadata support: Because OLAP tools aggregates the data into the cube and
sometimes serves as
the front-end tool, it is essential that it works with the metadata strategy/tool
you have selected.
Popular Tools
. Business Objects
. IBM Cognos
. SQL Server Analysis Services
. MicroStrategy
. Palo OLAP Server
. Number of reports: The higher the number of reports, the more likely that buying
a
reporting tool is a good idea. This is not only because reporting tools typically
make
creating new reports easier (by offering re-usable components), but they also
already
have report management systems to make maintenance and support functions easier.
. Desired Report Distribution Mode: If the reports will only be distributed in a
single mode
(for example, email only, or over the browser only), we should then strongly
consider the
possibility of building the reporting tool from scratch. However, if users will
access the
reports through a variety of different channels, it would make sense to invest in a
third-
party reporting tool that already comes packaged with these distribution modes.
. Ad Hoc Report Creation: Will the users be able to create their own ad hoc
reports? If so,
it is a good idea to purchase a reporting tool. These tool vendors have accumulated
extensive experience and know the features that are important to users who are
creating
ad hoc reports. A second reason is that the ability to allow for ad hoc report
creation
necessarily relies on a strong metadata layer, and it is simply difficult to come
up with a
metadata model when building a reporting tool from scratch.
Data is useless if all it does is sit in the data warehouse. As a result, the
presentation
layer is of very high importance.
Most of the OLAP vendors already have a front-end presentation layer that allows
users
to call up pre-defined reports or create ad hoc reports. There are also several
report tool
vendors. Either way, pay attention to the following points when evaluating
reporting
tools:
In general there are two types of data sources, one the relationship database, the
other is the
OLAP multidimensional data source. Nowadays, chances are good that you might want
to have
both. Many tool vendors will tell you that they offer both options, but upon closer
inspection, it
is possible that the tool vendor is especially good for one type, but to connect to
the other type of
data source, it becomes a difficult exercise in programming.
In a realistic data warehousing usage scenario by senior executives, all they have
time for is to
come in on Monday morning, look at the most important weekly numbers from the
previous
week (say the sales numbers), and that's how they satisfy their business
intelligence needs. All
the fancy ad hoc and drilling capabilities will not interest them, because they do
not touch these
features.
Based on the above scenario, the reporting tool must have scheduling and
distribution
capabilities. Weekly reports are scheduled to run on Monday morning, and the
resulting reports
are distributed to the senior executives either by email or web publishing. There
are claims by
various vendors that they can distribute reports through various interfaces, but
based on my
experience, the only ones that really matter are delivery via email and publishing
over the
intranet.
. Security Features: Because reporting tools, similar to OLAP tools, are geared
towards a number of
users, making sure people see only what they are supposed to see is important.
Security can reside at
the report level, folder level, column level, row level, or even individual cell
level. By and large, all
established reporting tools have these capabilities. Furthermore, they have a
security layer that can
interact with the common corporate login protocols. There are, however, cases where
large
corporations have developed their own user authentication mechanism and have a
"single sign-on"
policy. For these cases, having a seamless integration between the tool and the in-
house authentication
can require some work. I would recommend that you have the tool vendor team come in
and make sure
that the two are compatible.
. Customization
Every one of us has had the frustration over spending an inordinate amount of time
tinkering
with some office productivity tool only to make the report/presentation look good.
This is
definitely a waste of time, but unfortunately it is a necessary evil. In fact, a
lot of times, analysts
will wish to take a report directly out of the reporting tool and place it in their
presentations or
reports to their bosses. If the reporting tool offers them an easy way to pre-set
the reports to look
exactly the way that adheres to the corporate standard, it makes the analysts jobs
much easier,
and the time savings are tremendous.
. Export capabilities
The most common export needs are to Excel, to a flat file, and to PDF, and a good
report tool
must be able to export to all three formats. For Excel, if the situation warrants
it, you will want to
verify that the reporting format, not just the data itself, will be exported out to
Excel. This can
often be a time-saver.
Popular Tools
. SAP Crystal Reports
. MicroStrategy
. IBM Cognos
. Actuate
. Jaspersoft
. Pentaho
. Metadata Tool Selection
. Buy vs. Build
. Only in the rarest of cases does it make sense to build a metadata tool from
scratch. This is
because doing so requires resources that are intimately familiar with the
operational, technical,
and business aspects of the data warehouse system, and such resources are difficult
to come by.
Even when such resources are available, there are often other tasks that can
provide more value
to the organization than to build a metadata tool from scratch.
. In fact, the question is often whether any type of metadata tool is needed at
all. Although
metadata plays an extremely important role in a successful data warehousing
implementation,
this does not always mean that a tool is needed to keep all the "data about data."
It is possible to,
say, keey such information in the repository of other tools used, in a text
documentation, or even
in a presentation or a spreadsheet.
. Having said the above, though, it is author's believe that having a solid
metadata foundation is
one of the keys to the success of a data warehousing project. Therefore, even if a
metadata tool
is not selected at the beginning of the project, it is essential to have a metadata
strategy; that is,
how metadata in the data warehousing system will be stored.
. Metadata Tool Functionalities
. This is the most difficult tool to choose, because there is clearly no standard.
In fact, it might be
better to call this a selection of the metadata strategy. Traditionally, people
have put the data
modeling information into a tool such as ERWin and Oracle Designer, but it is
difficult to extract
information out of such data modeling tools. For example, one of the goals for your
metadata
selection is to provide information to the end users. Clearly this is a difficult
task with a data
modeling tool.
. So typically what is likely to happen is that additional efforts are spent to
create a layer of
metadata that is aimed at the end users. While this allows the end users to gain
the required
insight into what the data and reports they are looking at means, it is clearly
inefficient because all
that information already resides somewhere in the data warehouse system, whether it
be the ETL
tool, the data modeling tool, the OLAP tool, or the reporting tool.
. There are efforts among data warehousing tool vendors to unify on a metadata
model. In June of
2000, the OMG released a metadata standard called CWM (Common Warehouse Metamodel),
and some of the vendors such as Oracle have claimed to have implemented it. This
standard
incorporates the latest technology such as XML, UML, and SOAP, and, if accepted
widely, is truly
the best thing that can happen to the data warehousing industry. As of right now,
though, the
author has not really seen that many tools leveraging this standard, so clearly it
has not quite
caught on yet.
. So what does this mean about your metadata efforts? In the absence of everything
else, I would
recommend that whatever tool you choose for your metadata support supports XML, and
that
whatever other tool that needs to leverage the metadata also supports XML. Then it
is a matter of
defining your DTD across your data warehousing system. At the same time, there is
no need to
worry about criteria that typically is important for the other tools such as
performance and support
for parallelism because the size of the metadata is typically small relative to the
size of the data
warehouse.
Open source BI are BI software can be distributed for free and permits users to
modify
the source code. Open source software is available in all BI tools, from data
modeling to
reporting to OLAP to ETL.
Because open source software is community driven, it relies on the community for
improvement. As such, new feature sets typically come from community contribution
rather than as a result of dedicated R&D efforts.
With traditional BI software, the business model typically involves a hefty startup
cost,
and then there is an annual fee for support and maintenance that is calculated as a
percentage of the initial purchase price. In this model, a company needs to spend a
substantial amount of money before any benefit is realized. With the substantial
cost
also comes the need to go through a sales cycle, from the RFP process to evaluation
to
negotiation, and multiple teams within the organization typically get involved.
These
factors mean that it's not only costly to get started with traditional BI software,
but the
amount of time it takes is also long.
With open source BI, the beginning of the project typically involves a free
download of
the software. Given this, bureaucracy can be kept to a minimum and it is very easy
and
inexpensive to get started.
Lower cost
Because of its low startup cost and the typically lower ongoing maintenance/support
cost, the cost for open source BI software is lower (sometimes much lower) than
traditional BI software.
Easy to customize
By definition, open source software means that users can access and modify the
source
code directly. That means it is possible for developers to get under the hood of
the open
source BI tool and add their own features. In contrast, it is much more difficult
to do this
with traditional BI software because there is no way to access the source code.
Traditional BI software vendors put in a lot of money and resources into R&D, and
the
result is that the product has a rich feature set. Open source BI tools, on the
other hand,
rely on community support, and hence do not have as strong a feature set.
. JasperSoft
. Eclipse BIRT Project
. Pentaho
. SpagoBI
. OpenI
1. They are usually more experienced in data warehousing implementations. The fact
of
the matter is, even today, people with extensive data warehousing backgrounds are
difficult to find. With that, when there is a need to ramp up a team quickly, the
easiest
route to go is to hire external consultants.
1. They are less expensive. With hourly rates for experienced data warehousing
professionals running from $100/hr and up, and even more for Big-5 or vendor
consultants, hiring permanent employees is a much more economical option.
2. They are less likely to leave. With consultants, whether they are on contract,
via a
Big-5 firm, or one of the tool vendor firms, they are likely to leave at a moment's
notice.
This makes knowledge transfer very important. Of course, the flip side is that
these
consultants are much easier to get rid of, too.
. Project Manager: This person will oversee the progress and be responsible for the
success of the data
warehousing project.
. DBA: This role is responsible to keep the database running smoothly. Additional
tasks for this role may
be to plan and execute a backup/recovery plan, as well asperformance tuning.
. Technical Architect: This role is responsible for developing and implementing the
overall technical
architecture of the data warehouse, from the backend hardware/software to the
client desktop
configurations.
. ETL Developer: This role is responsible for planning, developing, and deploying
the extraction,
transformation, and loading routine for the data warehouse.
. Front End Developer: This person is responsible for developing the front-end,
whether it be client-
server or over the web.
. OLAP Developer: This role is responsible for the development of OLAP cubes.
. QA Group: This role is responsible for ensuring the correctness of the data in
the data warehouse.
This role is more important than it appears, because bad data quality turns away
users more than any
other reason, and often is the start of the downfall for the data warehousing
project.
The above list is roles, and one person does not necessarily correspond to only one
role. In fact,
it is very common in a data warehousing team where a person takes on multiple
roles. For a
typical project, it is common to see teams of 5-8 people. Any data warehousing team
that
contains more than 10 people is definitely bloated.
After the tools and team personnel selections are made, the data warehouse design
can begin. The
following are the typical steps involved in the datawarehousing project cycle.
. Requirement Gathering
. Physical Environment Setup
. Data Modeling
. ETL
. OLAP Cube Design
. Front End Development
. Report Development
. Performance Tuning
. Query Optimization
. Quality Assurance
. Rolling out to Production
. Production Maintenance
. Incremental Enhancements
Each page listed above represents a typical data warehouse design phase, and has
several sections:
Task Description
The first thing that the project team should engage in is gathering requirements
from
end users. Because end users are typically not familiar with the data warehousing
process or concept, the help of the business sponsor is essential. Requirement
gathering can happen as one-to-one meetings or as Joint Application Development
(JAD) sessions, where multiple people are talking about the project scope in the
same
meeting.
The primary goal of this phase is to identify what constitutes as a success for
this
particular phase of the data warehouse project. In particular, end user reporting /
analysis requirements are identified, and the project team will spend the remaining
Time Requirement
2 - 8 weeks.
Deliverables
. A list of reports / cubes to be delivered to the end users by the end of this
current
phase.
. A updated project plan that clearly identifies resource loads and milestone
delivery dates.
Possible Pitfalls
This phase often turns out to be the most tricky phase of the data warehousing
implementation. The reason is that because data warehousing by definition includes
data from multiple sources spanning many different departments within the
enterprise,
there are often political battles that center on the willingness of information
sharing.
Even though a successful data warehouse benefits the enterprise, there are
occasions
where departments may not feel the same way. As a result of unwillingness of
certain
groups to release data or to participate in the data warehousing requirement
definition,
the data warehouse effort either never gets off the ground, or could not start in
the
direction originally defined.
When this happens, it would be ideal to have a strong business sponsor. If the
sponsor
is at the CXO level, she can often exert enough influence to make sure everyone
cooperates.
Task Description
Once the requirements are somewhat clear, it is necessary to set up the physical
servers and databases. At a minimum, it is necessary to set up a development
environment and a production environment. There are also many data warehousing
projects where there are three environments: Development, Testing, and Production.
It is not enough to simply have different physical environments set up. The
different
processes (such as ETL, OLAP Cube, and reporting) also need to be set up properly
for
each environment.
It is best for the different environments to use distinct application and database
servers.
In other words, the development environment will have its own application server
and
database servers, and the production environment will have its own set of
application
and database servers.
. All changes can be tested and QA'd first without affecting the production
environment.
. Development and QA can occur during the time users are accessing the data
warehouse.
. When there is any question about the data, having separate environment(s) will
allow the data warehousing team to examine the data without impacting the
production environment.
Time Requirement
Getting the servers and databases ready should take less than 1 week.
Deliverables
. Hardware / Software setup document for all of the environments, including
hardware specifications, and scripts / settings for the software.
Possible Pitfalls
To save on capital, often data warehousing teams will decide to use only a single
database and a single server for the different environments. Environment separation
is
achieved by either a directory structure or setting up distinct instances of the
database.
This is problematic for the following reasons:
Data Modeling
Task Description
This is a very important step in the data warehousing project. Indeed, it is fair
to say that
the foundation of the data warehousing system is the data model. A good data model
will allow the data warehousing system to grow easily, as well as allowing for good
performance.
In data warehousing project, the logical data model is built based on user
requirements,
and then it is translated into the physical data model. The detailed steps can be
found in
the Conceptual, Logical, and Physical Data Modeling section.
Part of the data modeling exercise is often the identification of data sources.
Sometimes
this step is deferred until the ETL step. However, my feeling is that it is better
to find out
where the data exists, or, better yet, whether they even exist anywhere in the
enterprise
at all. Should the data not be available, this is a good time to raise the alarm.
If this was
delayed until the ETL phase, rectifying it will becoming a much tougher and more
complex process.
Time Requirement
2 - 6 weeks.
Deliverables
Possible Pitfalls
ETL
Task Description
The ETL (Extraction, Transformation, Loading) process typically takes the longest
to
develop, and this can easily take up to 50% of the data warehouse implementation
cycle or longer. The reason for this is that it takes time to get the source data,
understand the necessary columns, understand the business rules, and understand the
Time Requirement
1 - 6 weeks.
Deliverables
Possible Pitfalls
There is a tendency to give this particular phase too little development time. This
can
prove suicidal to the project because end users will usually tolerate less
formatting,
longer time to run reports, less functionality (slicing and dicing), or fewer
delivered
reports; one thing that they will not tolerate is wrong information.
A second common problem is that some people make the ETL process more
complicated than necessary. In ETL design, the primary goal should be to optimize
load
speed without sacrificing on quality. This is, however, sometimes not followed.
There
are cases where the design goal is to cover all possible future uses, whether they
are
practical or just a figment of someone's imagination. When this happens, ETL
performance suffers, and often so does the performance of the entire data
warehousing
system.
Task Description
Usually the design of the olap cube can be derived from the Requirement
Gathering phase. More often than not, however, users have some idea on what they
want, but it is difficult for them to specify the exact report / analysis they want
to see.
When this is the case, it is usually a good idea to include enough information so
that
they feel like they have gained something through the data warehouse, but not so
much
that it stretches the data warehouse scope by a mile. Remember that data
warehousing
is an iterative process - no one can ever meet all the requirements all at once.
Time Requirement
1 - 2 weeks.
Deliverables
Possible Pitfalls
Make sure your olap cube-bilding process is optimized. It is common for the data
warehouse to be on the bottom of the nightly batch load, and after the loading of
the
data warehouse, there usually isn't much time remaining for the olap cube to be
refreshed. As a result, it is worthwhile to experiment with the olap cube
generation paths
to ensure optimal performance.
Task Description
Regardless of the strength of the OLAP engine and the integrity of the data, if the
users
cannot visualize the reports, the data warehouse brings zero value to them. Hence
front
end development is an important part of a data warehousing initiative.
So what are the things to look out for in selecting a front-end deployment
methodology?
The most important thing is that the reports should need to be delivered over the
web,
so the only thing that the user needs is the standard browser. These days it is no
longer
desirable nor feasible to have the IT department doing program installations on end
users desktops just so that they can view reports. So, whatever strategy one
pursues,
make sure the ability to deliver over the web is a must.
The front-end options ranges from an internal front-end development using scripting
reporting requirements of the enterprise. Possible changes include not just the
difference in report layout and report content, but also include possible changes
in the
back-end structure. For example, if the enterprise decides to change from
Solaris/Oracle to Microsoft 2000/SQL Server, will the front-end tool be flexible
enough
to adjust to the changes without much modification?
Another area to be concerned with is the complexity of the reporting tool. For
example,
do the reports need to be published on a regular interval? Are there very specific
formatting requirements? Is there a need for a GUI interface so that each user can
customize her reports?
Time Requirement
1 - 4 weeks.
Deliverables
Possible Pitfalls
Just remember that the end users do not care how complex or how technologically
advanced your front end infrastructure is. All they care is that they receives
their
information in a timely manner and in the way they specified.
Report Development
Task Description
Report specification typically comes directly from the requirements phase. To the
end
user, the only direct touchpoint he or she has with the data warehousing system is
the
reports they see. So, report development, although not as time consuming as some of
the other steps such as ETL and data modeling, nevertheless plays a very important
role in determining the success of the data warehousing project.
One would think that report development is an easy task. How hard can it be to just
follow instructions to build the report? Unfortunately, this is not true. There are
several
points the data warehousing team need to pay attention to before releasing the
report.
User customization: Do users need to be able to select their own metrics? And how
do
users need to be able to filter the information? The report development process
needs
to take those factors into consideration so that users can get the information they
need
in the shortest amount of time possible.
Report delivery: What report delivery methods are needed? In addition to delivering
the report to the web front end, other possibilities include delivery via email,
via text
messaging, or in some form of spreadsheet. There are reporting solutions in the
marketplace that support report delivery as a flash file. Such flash file
essentially acts as
a mini-cube, and would allow end users to slice and dice the data on the report
without
having to pull data from an external source.
Access privileges: Special attention needs to be paid to who has what access to
what
information. A sales report can show 8 metrics covering the entire company to the
company CEO, while the same report may only show 5 of the metrics covering only a
single district to a District Sales Director.
Report development does not happen only during the implementation phase. After the
system goes into production, there will certainly be requests for additional
reports.
These types of requests generally fall into two broad categories:
2. Data is not yet available in the data warehouse. This means that the request
needs to
be prioritized and put into a future data warehousing development cycle.
Time Requirement
1 - 2 weeks.
Deliverables
Possible Pitfalls
Make sure the exact definitions of the report are communicated to the users.
Otherwise,
user interpretation of the report can be errenous.
Performance Tuning
Task Description
There are three major areas where a data warehousing system can use a little
performance tuning:
. ETL - Given that the data load is usually a very time-consuming process (and
hence
they are typically relegated to a nightly load job) and that data warehousing-
related
batch jobs are typically of lower priority, that means that the window for data
loading is
not very long. A data warehousing system that has its ETL process finishing right
on-
time is going to have a lot of problems simply because often the jobs do not get
started
on-time due to factors that is beyond the control of the data warehousing team. As
a
result, it is always an excellent idea for the data warehousing group to tune the
ETL
process as much as possible.
. Query Processing - Sometimes, especially in a ROLAP environment or in a system
where the reports are run directly against the relationship database, query
performance
can be an issue. A study has shown that users typically lose interest after 30
seconds of
waiting for a report to return. My experience has been that ROLAP reports or
reports that
run directly against the RDBMS often exceed this time limit, and it is hence ideal
for the
data warehousing team to invest some time to tune the query, especially the most
popularly ones. We present a number of query optimization ideas.
. Report Delivery - It is also possible that end users are experiencing significant
delays in
receiving their reports due to factors other than the query performance. For
example,
network traffic, server setup, and even the way that the front-end was built
sometimes
play significant roles. It is important for the data warehouse team to look into
these areas
for performance tuning.
Time Requirement
3 - 5 days.
Deliverables
Possible Pitfalls
Make sure the development environment mimics the production environment as much
as possible - Performance enhancements seen on less powerful machines sometimes
do not materialize on the larger, production-level machines.
Query Optimization
For any production database, SQL query performance becomes an issue sooner or
later. Having long-running queries not only consumes system resources that makes
the
server and application run slowly, but also may lead to table locking and data
corruption
issues. So, query optimization becomes an important task.
Nowadays all databases have their own query optimizer, and offers a way for users
to
understand how a query is executed. For example, which index from which table is
being used to execute the query? The first step to query optimization is
understanding
what the database is doing. Different databases have different commands for this.
For
example, in MySQL, one can use "EXPLAIN [SQL Query]" keyword to see the query
plan. In Oracle, one can use "EXPLAIN PLAN FOR [SQL Query]" to see the query plan.
The more data returned from the query, the more resources the database needs to
expand to process and store these data. So for example, if you only need to
retrieve
one column from a table, do not use 'SELECT *'.
Sometimes logic for a query can be quite complex. Often, it is possible to achieve
the
desired result through the use of subqueries, inline views, and UNION-type
statements.
For those cases, the intermediate results are not stored in the database, but are
immediately used within the query. This can lead to performance issues, especially
when the intermediate results have a large number of rows.
The way to increase query performance in those cases is to store the intermediate
results in a temporary table, and break up the initial SQL statement into several
SQL
statements. In many cases, you can even build an index on the temporary table to
speed up the query performance even more. Granted, this adds a little complexity in
query management (i.e., the need to manage temporary tables), but the speedup in
query performance is often worth the trouble.
. Use Index
Using an index is the first strategy one should use to speed up a query. In fact,
this strategy is so important that index optimization is also discussed.
. Aggregate Table
Pre-populating tables at higher levels so less amount of data need to be parsed.
. Vertical Partitioning
Partition the table by columns. This strategy decreases the amount of data a
SQL query needs to process.
. Horizontal Partitioning
Partition the table by data value, most often time. This strategy decreases the
amount of data a SQL query needs to process.
. Denormalization
The process of denormalization combines multiple tables into a single table. This
speeds up query performance because fewer table joins are needed.
. Server Tuning
Each server has its own parameters, and often tuning server parameters so that
it can fully take advantage of the hardware resources can significantly speed up
query performance.
Quality Assurance
Task Description
Once the development team declares that everything is ready for further testing,
the QA
team takes over. The QA team is always from the client. Usually the QA team members
will know little about data warehousing, and some of them may even resent the need
to
have to learn another tool or tools. This makes the QA process a tricky one.
Time Requirement
1 - 4 weeks.
Deliverables
. QA Test Plan
. QA verification that the data warehousing system is ready to go to production
Possible Pitfalls
As mentioned above, usually the QA team members know little about data
warehousing, and some of them may even resent the need to have to learn another
tool
or tools. Make sure the QA team members get enough education so that they can
complete the testing themselves.
Production Maintenance
Task Description
Once the data warehouse goes production, it needs to be maintained. Tasks as such
regular backup and crisis management becomes important and should be planned out.
In addition, it is very important to consistently monitor end user usage. This
serves two
purposes: 1. To capture any runaway requests so that they can be fixed before
slowing
the entire system down, and 2. To understand how much users are utilizing the data
warehouse for return-on-investment calculations and future enhancement
considerations.
Time Requirement
Ongoing.
Deliverables
Usually by this time most, if not all, of the developers will have left the
project, so it is
essential that proper documentation is left for those who are handling production
maintenance. There is nothing more frustrating than staring at something another
person did, yet unable to figure it out due to the lack of proper documentation.
Another pitfall is that the maintenance phase is usually boring. So, if there is
another
phase of the data warehouse planned, start on that as soon as possible.
Incremental Enhancements
Task Description
Once the data warehousing system goes live, there are often needs for incremental
enhancements. I am not talking about a new data warehousing phases, but simply
small
changes that follow the business itself. For example, the original geographical
designations may be different, the company may originally have 4 sales regions, but
now because sales are going so well, now they have 10 sales regions.
Deliverables
Possible Pitfalls
Because a lot of times the changes are simple to make, it is very tempting to just
go
ahead and make the change in production. This is a definite no-no. Many unexpected
problems will pop up if this is done. I would very strongly recommend that the
typical
cycle of development --> QA --> Production be followed, regardless of how simple
the
change may seem.
Observations
This section lists the trends I have seen based on my experience in the data
warehousing field:
Industry consolidation
If you add up the total time required to complete the tasks from Requirement
Gathering to Rollout to Production, you'll find it takes about 9 - 29 weeks to
complete
each phase of the data warehousing efforts. The 9 weeks may sound too quick, but I
have been personally involved in a turnkey data warehousing implementation that
took
40 business days, so that is entirely possible. Furthermore, some of the tasks may
proceed in parallel, so as a rule of thumb it is reasonable to say that it
generally takes 2
- 6 months for each phase of the data warehousing implementation.
Why is this important? The main reason is that in today's business world, the
business
environment changes quickly, which means that what is important now may not be
important 6 months from now. For example, even the traditionally static financial
industry is coming up with new products and new ways to generate revenue in a rapid
pace. Therefore, a time-consuming data warehousing effort will very likely become
obsolete by the time it is in production. It is best to finish a project quickly.
The focus on
quick delivery time does mean, however, that the scope for each phase of the data
warehousing project will necessarily be limited. In this case, the 80-20 rule
applies, and
our goal is to do the 20% of the work that will satisfy 80% of the user needs. The
rest
can come later.
Usually data mining is viewed as the final manifestation of the data warehouse. The
ideal is that now information from all over the enterprise is conformed and stored
in a
central location, data mining techniques can be applied to find relationships that
are
otherwise not possible to find. Unfortunately, this has not quite happened due to
the
following reasons:
2. The ROI for data mining companies is inherently lower because by definition,
data
mining will only be performed by a few users (generally no more than 5) in the
entire
enterprise. As a result, it is hard to charge a lot of money due to the low number
of
users. In addition, developing data mining algorithms is an inherently complex
process
and requires a lot of up front investment. Finally, it is difficult for the vendor
to put a
value proposition in front of the client because quantifying the returns on a data
mining
project is next to impossible.
This is not to say, however, that data mining is not being utilized by enterprises.
In fact,
many enterprises have made excellent discoveries using data mining techniques. What
I am saying, though, is that data mining is typically not associated with a data
warehousing initiative. It seems like successful data mining projects are usually
stand-
alone projects.
Industry Consolidation
In the last several years, we have seen rapid industry consolidation, as the weaker
competitors
are gobbled up by stronger players. The most significant transactions are below
(note that the
dollar amount quoted is the value of the deal when initially announced):
For the majority of the deals, the purchase represents an effort by the buyer to
expand into other
areas of data warehousing (Hyperion's purchase of Brio also falls into this
category because,
even though both are OLAP vendors, their product lines do not overlap). This
clearly shows
vendors' strong push to be the one-stop shop, from reporting, OLAP, to ETL.
There are two levels of one-stop shop. The first level is at the corporate level.
In this case, the
vendor is essentially still selling two entirely separate products. But instead of
dealing with two
sets of sales and technology support groups, the customers only interact with one
such group.
The second level is at the product level. In this case, different products are
integrated. In data
warehousing, this essentially means that they share the same metadata layer. This
is actually a
rather difficult task, and therefore not commonly accomplished. When there is
metadata
integration, the customers not only get the benefit of only having to deal with one
vendor instead
of two (or more), but the customer will be using a single product, rather than
multiple products.
This is where the real value of industry consolidation is shown.
Just because this is often not done does not mean this is not important. Just like
a data
warehousing system aims to measure the pulse of the company, the success of the
data warehousing system itself needs to be measured. Without some type of measure
on the return on investment (ROI), how does the company know whether it made the
right choice? Whether it should continue with the data warehousing investment?
There are a number of papers out there that provide formula on how to calculate the
If the system is satisfying user needs, users will naturally use the system. If
not, users
will abandon the system, and a data warehousing system with no users is actually a
detriment to the company (since resources that can be deployed elsewhere are
required
to maintain the system). Therefore, it is very important to have a tracking
mechanism to
figure out how much are the users accessing the data warehouse. This should not be
a
problem if third-party reporting/OLAP tools are used, since they all contain this
component. If the reporting tool is built from scratch, this feature needs to be
included in
the tool. Once the system goes into production, the data warehousing team needs to
periodically check to make sure users are using the system. If usage starts to dip,
find
out why and address the reason as soon as possible. Is the data quality lacking?
Are
the reports not satisfying current needs? Is the response time slow? Whatever the
reason, take steps to address it as soon as possible, so that the data warehousing
system is serving its purpose successfully.
Business Intelligence
two terms are used interchangeably. So, exactly what is business inteligence?
Business intelligence usually refers to the information that is available for the
enterprise to make decisions on. A data warehousing (or data mart) system is the
backend, or the infrastructural, component for achieving business intellignce.
Business
intelligence also includes the insight gained from doing data mining analysis, as
well as
unstrctured data (thus the need fo content management systems). For our purposes
here, we will discuss business intelligence in the context of using a data
warehouse
infrastructure.
Tools
The most common tools used for business intelligence are as follows. They are
listed in
the following order: Increasing cost, increasing functionality, increasing business
Excel
Take a guess what's the most common business intelligence tool? You might be
surprised to find out it's Microsoft Excel. There are several reasons for this:
2. It's commonly used. You can easily send an Excel sheet to another person without
In fact, it is still so popular that all third-party reporting / OLAP tools have an
"export to
Excel" functionality. Even for home-built solutions, the ability to export numbers
to Excel
usually needs to be built.
Excel is best used for business operations reporting and goals tracking.
Reporting tool
Business operations reporting and dashboard are the most common applications for a
reporting tool.
OLAP tool
OLAP tools are usually used by advanced users. They make it easy for users to look
at
the data from multiple dimensions. The OLAP Tool Selection selection discusses how
one should select an OLAP tool.
Data mining tools are usually only by very specialized users, and in an
organization,
even large ones, there are usually only a handful of users using data mining tools.
Data mining tools are used for finding correlation among different factors.
Concepts
Conceptual Data Model: What is a conceptual data model, its features, and an
example of this type of data model.
Logical Data Model: What is a logical data model, its features, and an example of
this
type of data model.
Physical Data Model: What is a physical data model, its features, and an example of
Conceptual, Logical, and Physical Data Model: Different levels of abstraction for a
data model. This section compares and constrasts the three different types of data
models.
Data Integrity: What is data integrity and how it is enforced in data warehousing.
MOLAP, ROLAP, and HOLAP: What are these different types of OLAP technology?
This section discusses how they are different from the other, and the advantages
and
disadvantages of each.
Bill Inmon vs. Ralph Kimball: These two data warehousing heavyweights have a
different view of the role between data warehouse and data mart.
Factless Fact Table: A fact table without any fact may sound silly, but there are
real life
instances when a factless fact table is useful in data warehousing.
Junk Dimension: Discusses the concept of a junk dimension: When to use it and why
is it useful.
Conformed Dimension: Discusses the concept of a conformed dimension: What is it
and why is it important.
Dimensional data model is most often used in data warehousing systems. This is
different from the 3rd normal form, commonly used for transactional (OLTP) type
systems. As you can imagine, the same data would then be stored differently in a
dimensional model than in a 3rd normal form model.
To understand dimensional data modeling, let's define some of the terms commonly
used in this type of modeling:
Fact Table: A fact table is a table that contains the measures of interest. For
example,
sales amount would be such a measure. This measure is stored in the fact table with
the appropriate granularity. For example, it can be sales amount by store by day.
In this
case, the fact table would contain three columns: A date column, a store column,
and a
sales amount column.
Lookup Table: The lookup table provides the detailed information about the
attributes.
For example, the lookup table for the Quarter attribute would include a list of all
of the
quarters available in the data warehouse. Each row (each quarter) may have several
fields, one for the unique ID that identifies the quarter, and one or more
additional fields
that specifies how that particular quarter is represented on a report (for example,
first
quarter of 2001 may be represented as "Q1 2001" or "2001 Q1").
A dimensional model includes fact tables and lookup tables. Fact tables connect to
one
or more lookup tables, but fact tables do not have direct relationships to one
another.
Dimensions and hierarchies are represented by lookup tables. Attributes are the
non-
key columns in the lookup tables.
In designing data models for data warehouses / data marts, the most commonly used
schema types are Star Schema and Snowflake Schema.
Whether one uses a star or a snowflake largely depends on personal preference and
business needs. Personally, I am partial to snowflakes, when there is a business
case
to analyze the information at that particular level.
Granularity
The first step in designing a fact table is to determine the granularity of the
fact table.
By granularity, we mean the lowest level of information that will be stored in the
fact
table. This constitutes two steps:
For example, in an off-line retail world, the dimensions for a sales fact table are
usually
time, geography, and product. This list, however, is by no means a complete list
for all
off-line retailers. A supermarket with a Rewards Card program, where customers
provide some personal information in exchange for a rewards card, and the
supermarket would offer lower prices for certain items for customers who present a
rewards card at checkout, will also have the ability to track the customer
dimension.
Whether the data warehousing system includes the customer dimension will then be a
decision that needs to be made.
Determining which part of hierarchy the information is stored along each dimension
is a
bit more tricky. This is where user requirement (both stated and possibly future)
plays a
major role.
In the above example, will the supermarket wanting to do analysis along at the
hourly
level? (i.e., looking at how certain products may sell by different hours of the
day.) If so,
it makes sense to use 'hour' as the lowest level of granularity in the time
dimension. If
daily analysis is sufficient, then 'day' can be used as the lowest level of
granularity.
Since the lower the level of detail, the larger the data amount in the fact table,
the
granularity exercise is in essence figuring out the sweet spot in the tradeoff
between
detailed level of analysis and data storage.
Note that sometimes the users will not specify certain requirements, but based on
the
industry knowledge, the data warehousing team may foresee that certain requirements
will be forthcoming that may result in the need of additional details. In such
cases, it is
prudent for the data warehousing team to design the fact table such that lower-
level
information is included. This will avoid possibly needing to re-design the fact
table in the
future. On the other hand, trying to anticipate all future requirements is an
impossible
and hence futile exercise, and the data warehousing team needs to fight the urge of
the
"dumping the lowest level of detail into the data warehouse" symptom, and only
includes what is practically needed. Sometimes this can be more of an art than
science,
and prior experience will become invaluable here.
Types of Facts
. Additive: Additive facts are facts that can be summed up through all of the
dimensions
in the fact table.
. Semi-Additive: Semi-additive facts are facts that can be summed up for some of
the
dimensions in the fact table, but not the others.
. Non-Additive: Non-additive facts are facts that cannot be summed up for any of
the
dimensions present in the fact table.
Let us use examples to illustrate each of the three types of facts. The first
example
assumes that we are a retailer, and we have a fact table with the following
columns:
Date
Store
Product
Sales_Amount
The purpose of this table is to record the sales amount for each product in each
store
on a daily basis. Sales_Amount is the fact. In this case, Sales_Amount is an
additive
fact, because you can sum up this fact along any of the three dimensions present in
the
fact table -- date, store, and product. For example, the sum of Sales_Amount for
all 7
days in a week represent the total sales amount for that week.
Date
Account
Current_Balance
Profit_Margin
The purpose of this table is to record the current balance for each account at the
end of
each day, as well as the profit margin for each account for each
day. Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-
additive fact, as it makes sense to add them up for all accounts (what's the total
current
balance for all accounts in the bank?), but it does not make sense to add them up
through time (adding up all current balances for a given account for each day of
the
month does not give us any useful information). Profit_Margin is a non-additive
fact, for
it does not make sense to add them up for the account level or the day level.
Based on the above classifications, there are two types of fact tables:
. Cumulative: This type of fact table describes what has happened over a period of
time.
For example, this fact table may describe the total sales by product by store by
day. The
facts for this type of fact tables are mostly additive facts. The first example
presented
here is a cumulative fact table.
. Snapshot: This type of fact table describes the state of things in a particular
instance of
time, and usually includes more semi-additive and non-additive facts. The second
example presented here is a snapshot fact table.
Star Schema
In the star schema design, a single object (the fact table) sits in the middle and
is
radially connected to other surrounding objects (dimension lookup tables) like a
star.
Star Schema
Each dimension is represented as a single table. The primary key in each dimension
table is related to a forieng key in the fact table.
All measures in the fact table are related to all the dimensions that fact table is
related
to. In other words, they all have the same level of granularity.
A star schema can be simple or complex. A simple star consists of one fact table; a
Let's look at an example: Assume our data warehouse keeps store sales data, and the
different dimensions are time, store, product, and customer. In this case, the
figure on
the left repesents our star schema. The lines between two tables indicate that
there is a
primary key / foreign key relationship between the two tables. Note that different
dimensions are not related to one another.
Christina is a customer with ABC Inc. She first lived in Chicago, Illinois. So, the
original
entry in the customer lookup table has the following record:
Customer Key
Name
State
1001
Christina
Illinois
At a later date, she moved to Los Angeles, California on January, 2003. How should
ABC Inc. now modify its customer table to reflect this change? This is the "Slowly
Changing Dimension" problem.
There are in general three ways to solve this type of problem, and they are
categorized
as follows:
Type 1: The new record replaces the original record. No trace of the old record
exists.
Type 2: A new record is added into the customer dimension table. Therefore, the
customer is treated essentially as two people.
We next take a look at each of the scenarios and how the data model and the data
looks like for each of them. Finally, we compare and contrast among the three
alternatives.
In Type 1 Slowly Changing Dimension, the new information simply overwrites the
original information. In other words, no history is kept.
Customer Key
Name
State
1001
Christina
Illinois
After Christina moved from Illinois to California, the new information replaces the
new
record, and we have the following table:
Customer Key
Name
State
1001
Christina
California
Advantages:
- This is the easiest way to handle the Slowly Changing Dimension problem, since
there
is no need to keep track of the old information.
Disadvantages:
Usage:
Type 1 slowly changing dimension should be used when it is not necessary for the
data
warehouse to keep track of historical changes.
Customer Key
Name
State
1001
Christina
Illinois
After Christina moved from Illinois to California, we add the new information as a
new
row into the table:
Customer Key
Name
State
1001
Christina
Illinois
1005
Christina
California
Advantages:
Disadvantages:
- This will cause the size of the table to grow fast. In cases where the number of
rows
for the table is very high to start with, storage and performance can become a
concern.
Usage:
Type 2 slowly changing dimension should be used when it is necessary for the data
warehouse to track historical changes.
In Type 3 Slowly Changing Dimension, there will be two columns to indicate the
particular attribute of interest, one indicating the original value, and one
indicating the
current value. There will also be a column that indicates when the current value
becomes active.
Customer Key
Name
State
1001
Christina
Illinois
To accommodate Type 3 Slowly Changing Dimension, we will now have the following
columns:
. Customer Key
. Name
. Original State
. Current State
. Effective Date
After Christina moved from Illinois to California, the original information gets
updated,
and we have the following table (assuming the effective date of change is January
15,
2003):
Customer Key
Name
Original State
Current State
Effective Date
1001
Christina
Illinois
California
15-JAN-2003
Advantages:
- This does not increase the size of the table, since new information is updated.
Disadvantages:
- Type 3 will not be able to keep all history where an attribute is changed more
than
once. For example, if Christina later moves to Texas on December 15, 2003, the
California information will be lost.
Usage:
Type III slowly changing dimension should only be used when it is necessary for the
data warehouse to track historical changes, and when such changes will only occur
for
a finite number of time.
The three level of data modeling, conceptual data model, logical data model,
and physical data model, were discussed in prior sections. Here we compare these
three types of data models. The table below compares the different features:
Feature
Conceptual
Logical
Physical
Entity Names
Entity Relationships
.
Attributes
Primary Keys
Foreign Keys
Table Names
.
Conceptual Model Design
Logical Model Design
Physical Model Design
Column Names
Below we show the conceptual, logical, and physical versions of a single data
model.
We can see that the complexity increases from conceptual to logical to physical.
This is
why we always first start with the conceptual data model (so we understand at high
level
what are the different entities in our data and how they relate to one another),
then
move on to the logical data model (so we understand the details of our data without
worrying about how they will actually implemented), and finally the physical data
model
(so we know exactly how to implement our data model in the database of choice). In
a
data warehousing project, sometimes the conceptual data model and the logical data
model are considered as a single deliverable.
Data Integrity
Data integrity refers to the validity of data, meaning data is consistent and
correct. In the
data warehousing field, we frequently hear the term, "Garbage In, Garbage Out." If
there is no data integrity in the data warehouse, any resulting report and analysis
will
not be useful.
In a data warehouse or a data mart, there are three areas of where data integrity
needs
to be enforced:
Database level
We can enforce data integrity at the database level. Common ways of enforcing data
integrity include:
Referential integrity
The relationship between the primary key of one table and the foreign key of
another
table must always be maintained. For example, a primary key cannot be deleted if
there
is still a foreign key that refers to this primary key.
Primary keys and the UNIQUE constraint are used to make sure every row in a table
can be uniquely identified.
For columns identified as NOT NULL, they may not have a NULL value.
Valid Values
Only allowed values are permitted in the database. For example, if a column can
only
have positive integers, a value of '-1' cannot be allowed.
ETL process
For each step of the ETL process, data integrity checks should be put in place to
ensure
that source data is the same as the data in the destination. Most common checks
include record counts or record sums.
Access level
We need to ensure that data is not altered by any unauthorized means either during
the
ETL process or in the data warehouse. To do this, there needs to be safeguards
against
unauthorized access to data (including physical access to the servers), as well as
logging of all data access history. Data integrity can only ensured if there is no
unauthorized access to the data.
What Is OLAP
OLAP stands for On-Line Analytical Processing. The first attempt to provide a
definition
to OLAP was by Dr. Codd, who proposed 12 rules for OLAP. Later, it was discovered
that this particular white paper was sponsored by one of the OLAP tool vendors,
thus
causing it to lose objectivity. The OLAP Report has proposed the FASMI
test, Fast Analysis of SharedMultidimensional Information. For a more detailed
description of both Dr. Codd's rules and the FASMI test, please visit The OLAP
Report.
For people on the business side, the key feature out of the above list is
"Multidimensional." In other words, the ability to analyze metrics in different
dimensions
such as time, geography, gender, product, etc. For example, sales for the company
is
up. What region is most responsible for this increase? Which store in this region
is most
responsible for the increase? What particular product category or categories
contributed
the most to the increase? Answering these types of questions in order means that
you
are performing an OLAP analysis.
Depending on the underlying technology used, OLAP can be braodly divided into two
different camps: MOLAP and ROLAP. A discussion of the different OLAP types can be
found in the MOLAP, ROLAP, and HOLAP section.
In the OLAP world, there are mainly two different types: Multidimensional OLAP
(MOLAP) and Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies
that combine MOLAP and ROLAP.
MOLAP
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a
multidimensional cube. The storage is not in the relational database, but in
proprietary
formats.
Advantages:
. Excellent performance: MOLAP cubes are built for fast data retrieval, and is
optimal for slicing and dicing operations.
. Can perform complex calculations: All calculations have been pre-generated
when the cube is created. Hence, complex calculations are not only doable, but
they return quickly.
Disadvantages:
. Limited in the amount of data it can handle: Because all calculations are
performed when the cube is built, it is not possible to include a large amount of
data in the cube itself. This is not to say that the data in the cube cannot be
derived from a large amount of data. Indeed, this is possible. But in this case,
only summary-level information will be included in the cube itself.
. Requires additional investment: Cube technology are often proprietary and do
not already exist in the organization. Therefore, to adopt MOLAP technology,
chances are additional investments in human and capital resources are needed.
ROLAP
This methodology relies on manipulating the data stored in the relational database
to
give the appearance of traditional OLAP's slicing and dicing functionality. In
essence,
each action of slicing and dicing is equivalent to adding a "WHERE" clause in the
SQL
statement.
Advantages:
. Can handle large amounts of data: The data size limitation of ROLAP technology
is the limitation on data size of the underlying relational database. In other
words,
ROLAP itself places no limitation on data amount.
. Can leverage functionalities inherent in the relational database: Often,
relational
database already comes with a host of functionalities. ROLAP technologies,
since they sit on top of the relational database, can therefore leverage these
functionalities.
Disadvantages:
HOLAP
HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For
summary-type information, HOLAP leverages cube technology for faster performance.
Factless Fact Table Example
When detail information is needed, HOLAP can "drill through" from the cube into the
In the data warehousing field, we often hear about discussions on where a person /
organization's philosophy falls into Bill Inmon's camp or into Ralph Kimball's
camp. We
describe below the difference between the two.
Bill Inmon's paradigm: Data warehouse is one part of the overall business
intelligence
system. An enterprise has one data warehouse, and data marts source their
information
from the data warehouse. In the data warehouse, information is stored in 3rd normal
form.
Ralph Kimball's paradigm: Data warehouse is the conglomerate of all data marts
within the enterprise. Information is always stored in the dimensional model.
There is no right or wrong between these two ideas, as they represent different
data
warehousing philosophies. In reality, the data warehouse in most enterprises are
closer
to Ralph Kimball's idea. This is because most data warehouses started out as a
departmental effort, and hence they originated as a data mart. Only when more data
marts are built later do they evolve into a data warehouse.
For example, think about a record of student attendance in classes. In this case,
the fact
table would consist of 3 dimensions: the student dimension, the time dimension, and
the
class dimension. This factless fact table would look like the following:
The only measure that you can possibly attach to each combination is "1" to show
the
presence of that particular combination. However, adding a fact that always shows 1
is
redundant because we can simply use the COUNT function in SQL to answer the same
questions.
Factless fact tables offer the most flexibility in data warehouse design. For
example,
one can easily answer the following questions with this factless fact table:
Without using a factless fact table, we will need two separate fact tables to
answer the
above two questions. With the above factless fact table, it becomes the only fact
table
that's needed.
Junk Dimension
In data warehouse design, frequently we run into a situation where there are yes/no
Junk dimension is the way to solve this problem. In a junk dimension, we combine
these
indicator fields into a single dimension. This way, we'll only need to build a
single
dimension table, and the number of fields in the fact table, as well as the size of
the fact
table, can be decreased. The content in the junk dimension table is the combination
of
all possible values of the individual indicator fields.
Let's look at an example. Assuming that we have the following fact table:
Fact Table Before Junk Dimension
Fact Table With Junk Dimension
In this example, the last 3 fields are all indicator fields. In this existing
format, each one
of them is a dimension. Using the junk dimension principle, we can combine them
into a
single junk dimension, resulting in the following fact table:
Note that now the number of dimensions in the fact table went from 7 to 5.
The content of the junk dimension table would look like the following:
Junk Dimension Example
In this case, we have 3 possible values for the TXN_CODE field, 2 possible values
for
the COUPON_IND field, and 2 possible values for the PREPAY_IND field. This results
in a total of 3 x 2 x 2 = 12 rows for the junk dimension table.
By using a junk dimension to replace the 3 indicator fields, we have decreased the
number of dimensions by 2 and also decreased the number of fields in the fact table
by
2. This will result in a data warehousing environment that offer better performance
as
well as being easier to manage.
Conformed Dimension
A conformed dimension is a dimension that has exactly the same meaning and content
when being referred from different fact tables. A conformed dimension can refer to
multiple tables in multiple data marts within the same organization. For two
dimension
tables to be considered as conformed, they must either be identical or one must be
a
subset of another. There cannot be any other type of difference between the two
tables.
For example, two dimension tables that are exactly the same except for the primary
key
are not considered conformed dimensions.
Why is conformed dimension important? This goes back to the definition of data
warehouse being "integrated." Integrated means that even if a particular entity had
different meanings and different attributes in the source systems, there must be a
single
version of this entity once the data flows into the data warehouse.
Not all conformed dimensions are as easy to produce as the time dimension. An
example is the customer dimension. In any organization with some history, there is
a
high likelihood that different customer databases exist in different parts of the
organization. To achieve a conformed customer dimension means those data must be
compared against each other, rules must be set, and data must be cleansed. In
addition, when we are doing incremental data loads into the data warehouse, we'll
need
to apply the same rules to the new values to make sure we are only adding truly new
Building a conformed dimension also part of the process in master data management,
or MDM. In MDM, one must not only make sure the master data dimensions are
conformed, but that conformity needs to be brought back to the source systems.
Glossary
[A-D] | [E-Z]
Aggregation: One way of speeding up query performance. Facts are summed up for
selected
dimensions from the originalfact table. The resulting aggregate table will have
fewer rows, thus making
queries that can use them go faster.
Conformed Dimension: A dimension that has exactly the same meaning and content when
being
referred to from different fact tables.
Data Mart: Data marts have the same definition as the data warehouse (see below),
but data marts have
a more limited audience and/or data content.
Dimension: The same category of information. For example, year, month, day, and
week are all part of
the Time Dimension.
Drill Through: Data analysis that goes from an OLAP cube into the relational
database.
ETL: Stands for Extraction, Transformation, and Loading. The movement of data from
one area to another.
Fact Table: A type of table in the dimensional model. A fact table typically
includes two
types of columns: fact columns and foreign keys to the dimensions.
Hierarchy: A hierarchy defines the navigating path for drilling up and drilling
down. All
attributes in a hierarchy belong to the same dimension.
Metadata: Data about data. For example, the number of tables in the database is a
type
of metadata.
OLAP: On-Line Analytical Processing. OLAP should be designed to provide end users
a quick way of slicing and dicing the data.
ROLAP: Relational OLAP. ROLAP systems store data in the relational database.
Star Schema: A common form of dimensional model. In a star schema, each dimension
is represented by a single dimension table.
Master Data Management (MDM) refers to the process of creating and managing data
that an organization must have as a single master copy, called the master data.
Usually,
master data can include customers, vendors, employees, and products, but can differ
by different industries and even different companies within the same industry. MDM
is
important because it offers the enterprise a single version of the truth. Without a
clearly
defined master data, the enterprise runs the risk of having multiple copies of data
that
are inconsistent with one another.
MDM is typically more important in larger organizations. In fact, the bigger the
organization, the more important the discipline of MDM is, because a bigger
organization means that there are more disparate systems within the company, and
the
difficulty on providing a single source of truth, as well as the benefit of having
master
data, grows with each additional data source. A particularly big challenge to
maintaining
master data occurs when there is a merger/acquisition. Each of the organizations
will
have its own master data, and how to merge the two sets of data will be
challenging.
Let's take a look at the customer files: The two companies will likely have
different
unique identifiers for each customer. Addresses and phone numbers may not match.
One may have a person's maiden name and the other the current last name. One may
have a nickname (such as "Bill") and the other may have the full name (such as
"William"). All these contribute to the difficulty in creating and maintain in a
single set of
master data.
At the heart of the master data management program is the definition of the master
data. Therefore, it is essential that we identify who is responsible for defining
and
enforcing the definition. Due to the importance of master data, a dedicated person
or
team should be appointed. At the minimum, a data steward should be identified. The
responsible party can also be a group -- such as a data governance committee or a
data governance council.
Master Data Management vs Data Warehousing
Based on the discussions so far, it seems like Master Data Management and Data
Warehousing have a lot in common. For example, the effort of data transformation
and
cleansing is very similar to an ETL process in data warehousing, and in fact they
can
use the same ETL tools. In the real world, it is not uncommon to see MDM and data
warehousing fall into the same project. On the other hand, it is important to call
out the
main differences between the two:
1) Different Goals
to the source system in some way. In data warehousing, solving the root cause is
not
always needed, as it may be enough just to have a consistent view at the data
warehousing level rather than having to ensure consistency at the data source
level.
Master Data Management is only applied to entities and not transactional data,
while a
data warehouse includes data that are both transactional and non-transactional in
nature. The easiest way to think about this is that MDM only affects data that
exists in
dimensional tables and not in fact tables, while in a data warehousing environment
includes both dimensional tables and fact tables.
In a data warehouse, usually the only usage of this "single source of truth" is for
applications that access the data warehouse directly, or applications that access
systems that source their data straight from the data warehouse. Most of the time,
the
original data sources are not affected. In master data management, on the other
hand,
we often need to have a strategy to get a copy of the master data back to the
source
system. This poses challenges that do not exist in a data warehousing environment.
For
example, how do we sync the data back with the original source? Once a day? Once an
hour? How do we handle cases where the data was modified as it went through the
cleansing process? And how much modification do we need make do to the source
system so it can use the master data? These questions represent some of the
challenges MDM faces. Unfortunately, there is no easy answer to those questions, as
Example: In order to store data, over the years, many application designers
in each branch have made their individual decisions as to how an
application and database should be built. So source systems will be
different in naming conventions, variable measurements, encoding
structures, and physical attributes of data. Consider a bank that has got
several branches in several countries, has millions of customers and the
lines of business of the enterprise are savings, and loans. The following
example explains how the data is integrated from source systems to target
systems.
In the above example of target data, attribute names, column names, and
data types are consistent throughout the target system. This is how data
from various source systems is integrated and accurately stored into the
data warehouse
[edit]
Data warehouses and data marts are built on dimensional data modeling
where fact tables are connected with dimension tables. This is most useful
for users to access data since a database can be visualized as a cube of
several dimensions. A data warehouse provides an opportunity for slicing
and dicing that cube along each of its dimensions.
Data Mart: A data mart is a subset of data warehouse that is designed for a
particular line of business, such as sales, marketing, or finance. In a
dependent data mart, data can be derived from an enterprise-wide data
warehouse. In an independent data mart, data can be collected directly from
sources.
[edit]
General Information
[edit]
[edit]
Glossary
[edit]
Hierarchy
[edit]
Level
[edit]
Fact Table
A table in a star schema that contains facts and connected to dimensions. A
fact table typically has two types of columns: those that contain facts and
those that are foreign keys to dimension tables. The primary key of a fact
table is usually a composite key that is made up of all of its foreign keys. A
fact table might contain either detail level facts or facts that have been
aggregated (fact tables that contain aggregated facts are often instead called
summary tables). A fact table usually contains facts with the same level of
aggregation. Example of Star Schema: Figure 1.6
[edit]
Snowflake Schema
[edit]
Fact Table
The centralized table in a star schema is called as FACT table. A fact table
typically has two types of columns: those that contain facts and those that
are foreign keys to dimension tables. The primary key of a fact table is
usually a composite key that is made up of all of its foreign keys. In the
example fig 1.6 �Sales Dollar� is a fact(measure) and it can be added across
several dimensions. Fact tables store different types of measures like
additive, non additive and semi additive measures.
Measure Types
A fact table might contain either detail level facts or facts that have been
aggregated (fact tables that contain aggregated facts are often instead called
summary tables). In the real world, it is possible to have a fact table that
contains no measures or facts. These tables are called as Factless Fact tables.
Database � RDBMS
Teradata NCR
. Repository Manager
. Designer
. Server Manager
. Repository Manager. Use the Repository Manager to create and administer the
metadata repository. You use the
Repository Manager to create a repository user and group. You create a folder to
store the metadata you create in the
lessons.. Repository Server Administration Console. Use the Repository Server
Administration console to
administer the Repository Servers and repositories.. Designer. Use the Designer to
create mappings that contain
transformation instructions for the PowerCenter Server. Before you can create
mappings, you must add source and
target definitions to the repository. Designer comprises the following tools:
Informatica Server: The Informatica Server extracts the source data, performs the
data transformation and loads the
transformed data into the targets. Sources accessed by Powercenter
You can create a repository user profile for everyone working in the repository,
each with a separate user name and
password. You can also create user groups and assign each user to one or more
groups. Then, grant repository
privileges to each group, so users in the group can perform tasks within the
repository (such as use the Designer or
create workflows).
The repository user profile is not the same as the database user profile. While a
particular user might not have access
to a database as a database user, that same person can have privileges to a
repository in the database as
a repository user.
. Use Designer
. Browse Repository
. Workflow Operator
. Administer Repository
. Administer Server
. Super UserYou can perform various tasks for each privilege. Privileges depend on
your group membership. Every
repository user belongs to at least one group. For example, the user who
administers the repository belongs to the
Administrators group. By default, you receive the privileges assigned to your
group. While it is most common to
assign privileges by group, the repository administrator, who has either the Super
User or Administer Repository
privilege, can also grant privileges to individual users.
. Create groups.
2. Create a group called Group1. To do this, you need to log in to the repository
as the
Administrator.
To perform the following tasks, you need to connect to the repository. If you are
already
Otherwise, ask your administrator to perform the tasks in this chapter for you.
2. Double-click on ur repository .
3. Enter the repository user name and password for the Administrator user. Click
Connect.
The dialog box expands to enter additional information.
4. Enter the host name and port number needed to connect to the repository
database.
5. Click Connect.
Most of the data warehouses already have existing source tables or flat files.
Before you create source
definitions, you need to create the source tables in the database. In this lesson,
you run an SQL script in the
Warehouse Designer to create sample source tables. The SQL script creates sources
with table names and data.
Note: These SQL Scripts come along with Informatica Power Center software.
When you run the SQL script, you create the following source tables:
. CUSTOMERS
. DEPARTMENT
. DISTRIBUTORS
. EMPLOYEES
. ITEMS
. ITEMS_IN_PROMOTIONS
. JOBS
. MANUFACTURERS
. ORDERS
. ORDER_ITEMS
. PROMOTIONS
. STORES
Generally, you use the Warehouse Designer to create target tables in the target
database. The Warehouse Designer
generates SQL based on the definitions in the workspace. However, we will use this
feature to generate the source
tutorial tables from the tutorial SQL scripts that ship with the product.
1. Launch the Designer, double-click the icon for your repository, and log into the
repository.
The Database Object Generation dialog box gives you several options for creating
tables.
6. Select the ODBC data source you created for connecting to the source database.
7. Enter the database user name and password and click the Connect button.
You now have an open connection to the source database. You know that you are
connected when the Disconnect
button displays and the ODBC name of the source database appears in the dialog box.
8. Make sure the Output window is open at the bottom of the Designer.
Note : The SQL file is installed in the Tutorial folder in the PowerCenter Client
installation directory.
10. Select the SQL file appropriate to the source database platform you are using.
Click Open.
Alternatively, you can enter the file name and path of the SQL file.
Platform File
Informix SMPL_INF.SQL
Oracle SMPL_ORA.SQL
Sybase SQL Server SMPL_SYB.SQL
DB2 SMPL_DB2.SQL
Teradata SMPL_TERA_SQL
The database now executes the SQL script to create the sample source database
objects and to insert values into the
source tables. While the script is running, the Output window displays the
progress.
12. When the script completes, click Disconnect, and then click Close.
Now we are ready to create the source definitions in the repository based on the
source tables created in the previous
session. The repository contains a description of source tables, not the actual
data contained in them. After you add
these source definitions to the repository, you can use them in a mapping.
Every folder contains nodes for sources, targets, schemas, mappings, mapplets, and
reusable transformations.
4. Select the ODBC data source to access the database containing the source tables.
5. Enter the user name and password to connect to this database. Also, enter the
name of
In Oracle, the owner name is the same as the user name. Make sure that the owner
name
connect-to-db.jpg
is in all caps (for example, JDOE).
6. Click Connect.
7. In the Select tables list, expand the database owner and the TABLES heading.
If you click the All button, you can see all tables in the source database.
You should now see a list of all the tables you created by running the SQL script
in addition to any tables already in
the database.
. CUSTOMERS
. DEPARTMENT
. DISTRIBUTORS
. EMPLOYEES
. ITEMS
. ITEMS_IN_PROMOTIONS
. JOBS
. MANUFACTURERS
. ORDERS
. ORDER_ITEMS
. PROMOTIONS
. STORES
Tip: Hold down the Ctrl key to select multiple tables. Or, hold down the Shift key
to
select a block of tables. You may need to scroll down the list of tables to select
all tables.
You can import target definitions from existing target tables, or you can create
the definitions and then generate and
run the SQL to create the target tables. In this session,we shall create a target
definition in the Warehouse Designer,
and then create a target table based on the definition.
The next step is to create the metadata for the target tables in the repository.
The actual table that the target
definition describes does not exist yet.
Target definitions define the structure of tables in the target database, or the
structure of file targets the PowerCenter
Server creates when you run a workflow. If you add a target definition to the
repository that does not exist in a
relational database, you need to create target tables in your target database. You
do this by generating and executing
the necessary SQL code within the Warehouse Designer.
In the following steps, you will copy the EMPLOYEES source definition into the
Warehouse Designer to create the
target definition. Then, you will modify the target definition by deleting and
adding columns to create the definition
you want.
The Designer creates a new target definition, EMPLOYEES, with the same column
definitions as the EMPLOYEES
source definition and the same database type.
Note: If you need to change the database type for the target definition(like if ur
source is oracle and target in
teradata), you can select the correct database type when you edit the target
definition.
The target column definitions are the same as the EMPLOYEES source definition.
. ADDRESS1
. ADDRESS2
. CITY
edittable_targetdefn2.jpg
. STATE
. POSTAL_CODE
. HOME_PHONE
When you finish, the target definition should look similar to the following target
definition:
Note that the EMPLOYEE_ID column is a primary key. The primary key cannot accept
null values. The Designer
automatically selects Not Null and disables the Not Null option. You now have a
column ready to receive data from
the EMPLOYEE_ID column in the EMPLOYEES source table.
Note: If you want to add a business name for any column, scroll to the right and
enter it.
9. Choose Repository-Save.
You can use the Warehouse Designer to run an existing SQL script to create target
tables.
dbobjgen.jpg
Note: When you use the Warehouse Designer to generate SQL, you can choose to drop
the table in the database
before creating it. To do this, select the Drop Table option. If the target
database already contains tables,
make sure it does not contain a table with the same name as the table you plan to
create. If the table exists in
the database, you lose the existing table and data.
If you installed the client software in a different location, enter the appropriate
drive letter and directory.
4. If you are connected to the source database from the previous lesson, click
Disconnect, and then click Connect.
6. Enter the necessary user name and password, and then click Connect.
7. Select the Create Table, Drop Table, and Primary Key options.
The Designer runs the DDL code needed to create T_EMPLOYEES. If you want to review
the actual code, click
Edit SQL file to open the MKT_EMP.SQL file.
9. Click Close to exit.
What is OLAP?
OLAP is abbreviation of Online Analytical Processing. This system is an application
that collects,
manages, processes and presents multidimensional data for analysis and management
purposes.
What is the difference between OLTP and OLAP?
Data Source
OLTP: Operational data is from original data source of the data
OLAP: Consolidation data is from various source.
Process Goal
OLTP: Snapshot of business processes which does fundamental business tasks
OLAP: Multi-dimensional views of business activities of planning and decision
making
Queries and Process Scripts
OLTP: Simple quick running queries ran by users.
OLAP: Complex long running queries by system to update the aggregated data.
Database Design
OLTP: Normalized small database. Speed will be not an issue due to smaller database
and normalization
will not degrade performance. This adopts entity relationship(ER) model and an
application-oriented
database design.
OLAP: De-normalized large database. Speed is issue due to larger database and de-
normalizing will
improve performance as there will be lesser tables to scan while performing tasks.
This adopts star,
snowflake or fact constellation mode of subject-oriented database design.
Describes the foreign key columns in fact table and dimension table?
Foreign keys of dimension tables are primary keys of entity tables.
Foreign keys of facts tables are primary keys of Dimension tables.
What is Data Mining?
Data Mining is the process of analyzing data from different perspectives and
summarizing it into useful
information.
What is the difference between view and materialized view?
A view takes the output of a query and makes it appear like a virtual table and it
can be used in place of
tables.
A materialized view provides indirect access to table data by storing the results
of a query in a separate
schema object.
What is ER Diagram?
Entity Relationship Diagrams are a major data modelling tool and will help organize
the data in your
project into entities and define the relationships between the entities. This
process has proved to enable
the analyst to produce a good database structure so that the data can be stored and
retrieved in a most
efficient manner.
An entity-relationship (ER) diagram is a specialized graphic that illustrates the
interrelationships between
entities in a database. A type of diagram used in data modeling for relational data
bases. These diagrams
show the structure of each table and the links between tables.
What is ODS?
ODS is abbreviation of Operational Data Store. A database structure that is a
repository for near real-time
operational data rather than long term trend data. The ODS may further become the
enterprise shared
operational database, allowing operational systems that are being re-engineered to
use the ODS as there
operation databases.
What is ETL?
ETL is abbreviation of extract, transform, and load. ETL is software that enables
businesses to
consolidate their disparate data while moving it from place to place, and it
doesn�t really matter that that
data is in different forms or formats. The data can come from any source.ETL is
powerful enough to
handle such data disparities. First, the extract function reads data from a
specified source database and
extracts a desired subset of data. Next, the transform function works with the
acquired data � using rules
orlookup tables, or creating combinations with other data � to convert it to the
desired state. Finally, the
load function is used to write the resulting data to a target database.
What is VLDB?
VLDB is abbreviation of Very Large DataBase. A one terabyte database would normally
be considered to
be a VLDB. Typically, these are decision support systems or transaction processing
applications serving
large numbers of users.
from
from FactTable1
) f1
join FactTable2 f2
8
So if we don�t join 2 fact tables that way, how do we do it? The answer is using
the fact key column. It is a
good practice (especially in SQL Server because of the concept of cluster index) to
have a fact key
column to enable us to identify rows on the fact table . The performance would be
much better (than
joining on dim keys), but you need to plan this in advance as you need to include
the fact key column on
the other fact table.
from FactTable1 f1
join FactTable2 f2
on f2.fact1key = f1.factkey
I implemented this technique originally for self joining, but then expand the usage
to join to other fact
table. But this must be used on an exception basis rather than the norm.
Purpose: not to trap the candidate of course. But to see if they have the
experience dealing with a
problem which doesn�t happen every day.
Explain the concepts and capabilities of Business Intelligence.
Business intelligence tools are to report, analyze and present data. Few of the
tools available in the
market are:
. Eclipse BIRT Project:- Based on eclipse. Mainly used for web applications and it
is open source.
. Freereporting.com:- It is a free web based reporting tool.
. JasperSoft:- BI tool used for reporting, ETL etc.
. Pentaho:- Has data mining, dashboard and workflow capabilities.
. Openl:- A web application used for OLAP reporting.
. SQL Server Integration Services:- Used for data transformation and creation. Used
in data
acquisition form a source system.
. SQL Server Analysis Services: Allows data discovery using data mining. Using
business logic it
supports data enhancement.
. SQL Server Reporting Services:- Used for Data presentation and distribution
access.
uestion: From where you Get the Logical Query of your Request?
Answer: The logical SQL generated by the server can be viewed in BI Answers. If I
have not understood
the question, Please raise your voice.
Question: Major Challenges You Faced While Creating the RPD?
Answer: Every now and then there are problems with the database connections but the
problem while
creating the repository RPD files comes with complex schemas made on OLTP systems
consisting of lot
of joins and checking the results. Th type of join made need to be checked. By
default it is inner join but
sometimes the requirement demands other types of joins. There are lot of problems
with the date formats
also.
Question: What are Global Filter and how thery differ From Column Filter?
Answer: Column filter- simply a filter applied on a column which we can use to
restrict our column values
while pulling the data or in charts to see the related content.
Global filter- Not sure. I understand this filter will have impact on across the
application but I really dont
understand where and how it can be user. I heard of global variables but not global
filters.
How to make the Delivery Profilers Work?
When we are Use SA System how Does SA Server understand that It needs to use it For
Getting
the User Profile information?
Where to Configure the Scheduler?
Answer: We configure the OBIEE schedular in database.
Question: How to hide Certain Columns From a User?
Answer: Application access level security- Do not add the column in the report, Do
not add the column in
the presentation layer.
Question:How can we Enable Drills in a Given Column Data?
Answer: To enable Drill down for a column, it should be included in the hirarchy in
OBIEE. Hyperion IR
has a drill anywhere feature where dont have to define and can drill to any
available column.
Question: Is Drill Down Possible without the attribute being a Part of a
Hierarchical Dimension?
Answer: No
Question: How do u Conditional Format.?
Answer: while creating a chat in BI Answers, you can define the conditions and can
apply colour
formatting.
Question: What is Guided Navigation?
Answer: I think it is just the arrangement of hyperlinks to guide the user to
navigate between the reports
to do the analysis.
How is Webcat File Deployed Across Environment?
Question: How the users Created Differs From RPD/Answers/Dashboards Level?
Answer: RPD users can do administrator tasks like adding new data source, create
hirarchies, change
column names where as Answers users may create new charts, edit those charts and
Dashboard users
may only view and analyse the dashboard or can edit dashboard by adding/removing
charts objects.
Question: Online/Offline Mode how it Impact in Dev and Delpoyment?
Answer: Online Mode- You can make changes in the RPD file and push in changes which
will be
immediately visible to the users who are already connected. This feature we may use
in production
environment.
Offline mode- can be useful in test or development environment.
Questions: Explan me the Schema in Your Last Project?
DB What happens if u Reconcile/Sync Both?
Q.What is OLAP?
A.OLAP stands for Online Analytical Processing. It is used for Anaytical
reporting.This helps to do
Business analysis of you data, but normal reporting tools are not supporting to
Business analysis. This is
the major difference between reporting tool and OLAP tool.It is a GateWay between
the Business user
and DWH.
1.MOLAP
This is the traditional mode in OLAP analysis. In MOLAP data is stored in form of
multidimensional cubes
and not in relational databases. The advantages of this mode is that it provides
excellent query
performance and the cubes are built for fast data retrieval. All calculations are
pre-generated when the
cube is created and can be easily applied while querying data.
The disadvantages of this model are that it can handle only a limited amount of
data. Since all
calculations have been pre-built when the cube was created, the cube cannot be
derived from a large
volume of data. This deficiency can be bypassed by including only summary level
calculations while
constructing the cube. This model also requires huge additional investment as cube
technology is
proprietary and the knowledge base may not exist in the organization.
2.ROLAP
The underlying data in this model is stored in relational databases. Since the data
is stored in relational
databases this model gives the appearance of traditional OLAP�s slicing and dicing
functionality. The
advantages of this model is it can handle a large amount of data and can leverage
all the functionalities of
the relational database.
The disadvantages are that the performance is slow and each ROLAP report is an SQL
query with all
the limitations of the genre. It is also limited by SQL functionality. ROLAP
vendors have tried to mitigate
this problem by building into the tool out-of-the-box complex functions as well as
providing the users with
an ability to define their own functions.
3.HOLAP
HOLAP technology tries to combine the strengths of the above two models. For
summary type
information HOLAP leverages cube technology and for drilling down into details it
uses the ROLAP
model.
1.Cube browsing is the fastest when using MOLAP. This is so even in cases where no
aggregations
have been done. The data is stored in a compressed multidimensional format and can
be accessed
quickly than in the relational database. Browsing is very slow in ROLAP about the
same in HOLAP.
Processing time is slower in ROLAP, especially at higher levels of aggregation.
2.MOLAP storage takes up more space than HOLAP as data is copied and at very low
levels of
aggregation it takes up more room than ROLAP. ROLAP takes almost no storage space
as data is not
duplicated. However ROALP aggregations take up more space than MOLAP or HOLAP
aggregations.
3.All data is stored in the cube in MOLAP and data can be viewed even when the
original data source is
not available. In ROLAP data cannot be viewed unless connected to the data source.
4.MOLAP can handle very limited data only as all data is stored in the cube.
Q.How to Import universes and user from business object 6.5 to XI R2, it is showing
as some
ODBC error is there any setting to change?
A.You can import universes through import option in file menu.if ur odbc driver is
not connecting then u
can check ur database driver
2.Designer :It is the tool used to create, manage and distribute universe for
BusinessObjects
and WebIntelligence Users. A universe is a file that containe connection parameters
for one or more
database middleware and SQL structure called objects that map to actual SQL
structure in the database
as columns,tables and database.
Q.What is Hyperion? Is it an OLAP tool? what is the difference between OLAP and ETL
tools?
What is the future for OLAP and ETL market for the next five years?
a.Its an Business Intelligence tools. Like Brio which was an independent product
bought over my
Hyperion has converted this product name to Hyperion Intelligence.
1.Creating materialized views enable to pre -run the complex joins and store the
data.
2.Most of the DW environment has a day old data hence they don�t have lot of
overhead.
3.Running a report against a single materialized table is always faster then
running against multiple tables
with complex joins.
4.Indexes can be created on this materialized view to further increase the
performance.
g.Check to see if the performance of the SQL can be increased by using hints ,if
yes then add a hint to
the report SQL and freeze the SQL, this might have an additional overhead of
maintaining the report
c.The Keys tab allows you to define index awareness for an object. Index awareness
is the ability to take
advantage of the indexes on key columns to speed data retrieval.
1.In a typical data warehousing environment surrogate keys are used as primary keys
instead of
natural keys , this primary key may not be meaningful to the end user but Designer
can take advantage of
the indexes on key columns to speed data retrieval.
2.The only disadvantage is it would not return duplicate data unless the duplicate
data has
separate keys
d.Check to see if the size of the universe has increased recently
e.Try to create a different universe for new requirements
f.Under extreme conditions the AUTOPARSE parameter in the param file can be turned
off, this could be
too risky if not handled properly.
Q.What is OLAP?
A.OLAP stands for Online Analytical Processing. It is used for Anaytical
reporting.This helps to do
Business analysis of you data, but normal reporting tools are not supporting to
Business analysis. This is
the major difference between reporting tool and OLAP tool.It is a GateWay between
the Business user
and DWH.
1.MOLAP
This is the traditional mode in OLAP analysis. In MOLAP data is stored in form of
multidimensional cubes
and not in relational databases. The advantages of this mode is that it provides
excellent query
performance and the cubes are built for fast data retrieval. All calculations are
pre-generated when the
cube is created and can be easily applied while querying data.
The disadvantages of this model are that it can handle only a limited amount of
data. Since all
calculations have been pre-built when the cube was created, the cube cannot be
derived from a large
volume of data. This deficiency can be bypassed by including only summary level
calculations while
constructing the cube. This model also requires huge additional investment as cube
technology is
proprietary and the knowledge base may not exist in the organization.
2.ROLAP
The underlying data in this model is stored in relational databases. Since the data
is stored in relational
databases this model gives the appearance of traditional OLAP�s slicing and dicing
functionality. The
advantages of this model is it can handle a large amount of data and can leverage
all the functionalities of
the relational database.
The disadvantages are that the performance is slow and each ROLAP report is an SQL
query with all
the limitations of the genre. It is also limited by SQL functionality. ROLAP
vendors have tried to mitigate
this problem by building into the tool out-of-the-box complex functions as well as
providing the users with
an ability to define their own functions.
3.HOLAP
HOLAP technology tries to combine the strengths of the above two models. For
summary type
information HOLAP leverages cube technology and for drilling down into details it
uses the ROLAP
model.
2.MOLAP storage takes up more space than HOLAP as data is copied and at very low
levels of
aggregation it takes up more room than ROLAP. ROLAP takes almost no storage space
as data is not
duplicated. However ROALP aggregations take up more space than MOLAP or HOLAP
aggregations.
3.All data is stored in the cube in MOLAP and data can be viewed even when the
original data source is
not available. In ROLAP data cannot be viewed unless connected to the data source.
4.MOLAP can handle very limited data only as all data is stored in the cube.
Q.How to Import universes and user from business object 6.5 to XI R2, it is showing
as some
ODBC error is there any setting to change?
A.You can import universes through import option in file menu.if ur odbc driver is
not connecting then u
can check ur database driver
2.Designer :It is the tool used to create, manage and distribute universe for
BusinessObjects
and WebIntelligence Users. A universe is a file that containe connection parameters
for one or more
database middleware and SQL structure called objects that map to actual SQL
structure in the database
as columns,tables and database.
3.Auditor :Tool is used for monitor and analysis user and system activity.
4.Application Foundation : This module covers a set of products which is used for
Enterprise
Performance Management (EPM). The tools are
1.Dashboard manager
2.Scorecard
3.Performance Management Applications
Q.What is Hyperion? Is it an OLAP tool? what is the difference between OLAP and
ETL tools?
What is the future for OLAP and ETL market for the next five years?
a.Its an Business Intelligence tools. Like Brio which was an independent product
bought over my
Hyperion has converted this product name to Hyperion Intelligence.
1.Creating materialized views enable to pre -run the complex joins and store the
data.
2.Most of the DW environment has a day old data hence they don�t have lot of
overhead.
3.Running a report against a single materialized table is always faster then
running against multiple tables
with complex joins.
4.Indexes can be created on this materialized view to further increase the
performance.
g.Check to see if the performance of the SQL can be increased by using hints ,if
yes then add a hint to
the report SQL and freeze the SQL, this might have an additional overhead of
maintaining the report
c.The Keys tab allows you to define index awareness for an object. Index awareness
is the ability to take
advantage of the indexes on key columns to speed data retrieval.
1.In a typical data warehousing environment surrogate keys are used as primary keys
instead of
natural keys , this primary key may not be meaningful to the end user but Designer
can take advantage of
the indexes on key columns to speed data retrieval.
2.The only disadvantage is it would not return duplicate data unless the duplicate
data has
separate keys
ans. EME is said as enterprise metdata env, GDE as graphical devlopment env and Co-
operating sytem
can be said as asbinitio server
relation b/w this CO-OP, EME AND GDE is as fallows
Co operating system is the Abinitio Server. this co-op is installed on perticular
O.S platform that is called
NATIVE O.S .comming to the EME, its i just as repository in informatica , its hold
the
metadata,trnsformations,db config files source and targets informations. comming to
GDE its is end user
envirinment where we can devlop the graphs(mapping just like in informatica)
desinger uses the GDE and designs the graphs and save to the EME or Sand box it is
at user side.where
EME is ast server side.
To run a graph infinitely, the end script in the graph should call the .ksh file of
the graph. Thus if the name
of the graph is abc.mp then in the end script of the graph there should be a call
to abc.ksh.
Like this the graph will run infinitely.
What is the difference between look-up file and look-up, with a relevant example?
Generally Lookup file represents one or more serial files(Flat files). The amount
of data is small enough to
be held in the memory. This allows transform functions to retrive records much more
quickly than it could
retrive from Disk.
A lookup is a component of abinitio graph where we can store data and retrieve it
by using a key
parameter.
A lookup file is the physical file where the data for the lookup is stored.
How many components in your most complicated graph? It depends the type of
components you us.
usually avoid using much complicated transform function in a graph.
Lookup is basically a specific dataset which is keyed. This can be used to mapping
values as per the data
present in a particular file (serial/multi file). The dataset can be static as well
dynamic ( in case the lookup
file is being generated in previous phase and used as lookup file in current
phase). Sometimes, hash-
joins can be replaced by using reformat and lookup if one of the input to the join
contains less number of
records with slim record length.
AbInitio has built-in functions to retrieve values using the key for the lookup
What is a ramp limit?
The limit parameter contains an integer that represents a number of reject events
The ramp parameter contains a real number that represents a rate of reject events
in the number of
records processed.
no of bad records allowed = limit + no of records*ramp.
ramp is basically the percentage value (from 0 to 1)
This two together provides the threshold value of bad records.
Multistage transform components by default uses packages. However user can create
his own set of
functions in a transfer function and can include this in other transfer functions.
If the user wants to group the records on particular field values then rollup is
best way to do that. Rollup is
a multi-stage transform function and it contains the following mandatory functions.
1. initialise
2. rollup
3. finalise
Also need to declare one temporary variable if you want to get counts of a
particular group.
For each of the group, first it does call the initialise function once, followed by
rollup function calls for each
of the records in the group and finally calls the finalise function once at the end
of last rollup call.
Add Default Rules � Opens the Add Default Rules dialog. Select one of the
following: Match Names �
Match names: generates a set of rules that copies input fields to output fields
with the same name. Use
Wildcard (.*) Rule � Generates one rule that copies input fields to output fields
with the same name.
)If it is not already displayed, display the Transform Editor Grid.
2)Click the Business Rules tab if it is not already displayed.
3)Select Edit > Add Default Rules.
In case of reformat if the destination field names are same or subset of the source
fields then no need to
write anything in the reformat xfr unless you dont want to use any real transform
other than reducing the
set of fields or split the flow into a number of flows to achive the functionality.
What is the difference between partitioning with key and round robin?
There are many ways the performance of the graph can be improved.
1) Use a limited number of components in a particular phase
2) Use optimum value of max core values for sort and join components
3) Minimise the number of sort components
4) Minimise sorted join component and if possible replace them by in-memory
join/hash join
5) Use only required fields in the sort, reformat, join components
6) Use phasing/flow buffers in case of merge, sorted joins
7) If the two inputs are huge then use sorted join, otherwise use hash join with
proper driving port
8) For large dataset don't use broadcast as partitioner
9) Minimise the use of regular expression functions like re_index in the trasfer
functions
10) Avoid repartitioning of data unnecessarily
Try to run the graph as long as possible in MFS. For these input files should be
partitioned and if possible
output file should also be partitioned.
How do you truncate a table?
From Abinitio run sql component using the DDL "trucate table
By using the Truncate table component in Ab Initio
What is the function you would use to transfer a string into a decimal?
In this case no specific function is required if the size of the string and decimal
is same. Just use decimal
cast with the size in the transform function and will suffice. For example, if the
source field is defined as
string(8) and the destination as decimal(8) then (say the field name is field1).
out.field :: (decimal(8)) in.field
If the destination field size is lesser than the input then use of string_substring
function can be used likie
the following.
say destination field is decimal(5).
out.field :: (decimal(5))string_lrtrim(string_substring(in.field,1,5)) /*
string_lrtrim used to trim leading and
trailing spaces */
What are primary keys and foreign keys?
In RDBMS the relationship between the two tables is represented as Primary key and
foreign key
relationship.Wheras the primary key table is the parent table and foreignkey table
is the child table.The
criteria for both the tables is there should be a matching column.
What is the difference between clustered and non-clustered indices? ...and why do
you use a
clustered index?
What is an outer join?
An outer join is used when one wants to select all the records from a port -
whether it has
satisfied the join criteria or not.
What are Cartesian joins?
joins two tables without a join key. Key should be {}.
What is the purpose of having stored procedures in a database?
Main Purpose of Stored Procedure for reduse the network trafic and all sql
statement executing in cursor
so speed too high.
Why might you create a stored procedure with the 'with recompile' option?
Recompile is useful when the tables referenced by the stored proc undergoes a lot
of
modification/deletion/addition of data. Due to the heavy modification activity the
execute plan becomes
outdated and hence the stored proc performance goes down. If we create the stored
proc with recompile
option, the sql server wont cache a plan for this stored proc and it will be
recompiled every time it is run.
What is a cursor? Within a cursor, how would you update fields on the row just
fetched
The oracle engine uses work areas for internal processing in order to the execute
sql statement is called
cursor.There are two types of cursors like Implecit cursor and Explicit
cursor.Implicit cursor is using for
internal processing and Explicit cursor is using for user open for data required.
How would you find out whether a SQL query is using the indices you expect?
explain plan can be reviewed to check the execution plan of the query. This would
guide if the expected
indexes are used or not.
Describe the process steps you would perform when defragmenting a data table.
The difference between the TRUNCATE and DELETE statement is Truncate belongs to DDL
command
whereas DELETE belongs to DML command.Rollback cannot be performed incase of
Truncate statement
wheras Rollback can be performed in Delete statement. "WHERE" clause cannot be used
in Truncate
where as "WHERE" clause can be used in DELETE statement.
A .dbc file has the information required for Ab Initio to connect to the database
to extract or load tables or
views. While .CFG file is the table configuration file created by db_config while
using components like
Load DB Table.
What is the difference between partitioning with key and round robin?
Ans: Partition by Key or hash partition -> This is a partitioning technique which
is used to partition data
when the keys are diverse. If the key is present in large volume then there can
large data skew. But this
method is used more often for parallel data processing.
Round robin partition is another partitioning technique to uniformly distribute the
data on each of the
destination data partitions. The skew is zero in this case when no of records is
divisible by number of
partitions. A real life example is how a pack of 52 cards is distributed among 4
players in a round-robin
manner.
How do you improve the performance of a graph?
Ans: There are many ways the performance of the graph can be improved.
1) Use a limited number of components in a particular phase
2) Use optimum value of max core values for sort and join components
3) Minimise the number of sort components
4) Minimise sorted join component and if possible replace them by in-memory
join/hash join
5) Use only required fields in the sort, reformat, join components
6) Use phasing/flow buffers in case of merge, sorted joins
7) If the two inputs are huge then use sorted join, otherwise use hash join with
proper driving port
8) For large dataset don't use broadcast as partitioner
9) Minimise the use of regular expression functions like re_index in the trasfer
functions
10) Avoid repartitioning of data unnecessarily
Try to run the graph as long as possible in MFS. For these input files should be
partitioned and if possible
output file should also be partitioned.
How do you truncate a table?
Ans: From Abinitio run sql component using the DDL "trucate table
By using the Truncate table component in Ab Initio
Have you ever encountered an error called "depth not equal"?
Ans: When two components are linked together if their layout doesnot match then
this problem can occur
during the compilation of the graph. A solution to this problem would be to use a
partitioning component in
between if there was change in layout.
What is the function you would use to transfer a string into a decimal?
Ans: In this case no specific function is required if the size of the string and
decimal is same. Just use
decimal cast with the size in the transform function and will suffice. For example,
if the source field is
defined as string(8) and the destination as decimal(8) then (say the field name is
field1).out.field ::
(decimal(8)) in.field
If the destination field size is lesser than the input then use of string_substring
function can be used likie
the following.
say destination field is decimal(5).
out.field :: (decimal(5))string_lrtrim(string_substring(in.field,1,5)) /*
string_lrtrim used to trim leading and
trailing spaces */
In RDBMS the relationship between the two tables is represented as Primary key and
foreign key
relationship.Wheras the primary key table is the parent table and foreignkey table
is the child table.The
criteria for both the tables is there should be a matching column.
What are Cartesian joins?
Ans: Joins two tables without a join key. Key should be {}.
What is the purpose of having stored procedures in a database?
Ans: Main Purpose of Stored Procedure for reduse the network trafic and all sql
statement executing in
cursor so speed too high.
Why might you create a stored procedure with the 'with recompile' option?
Recompile is useful when the tables referenced by the stored proc undergoes a lot
of
modification/deletion/addition of data. Due to the heavy modification activity the
execute plan becomes
outdated and hence the stored proc performance goes down. If we create the stored
proc with recompile
option, the sql server wont cache a plan for this stored proc and it will be
recompiled every time it is run.
What is a cursor? Within a cursor, how would you update fields on the row just
fetched?
Ans: The oracle engine uses work areas for internal processing in order to the
execute sql statement is
called cursor.There are two types of cursors like Implecit cursor and Explicit
cursor.Implicit cursor is using
for internal processing and Explicit cursor is using for user open for data
required.
How would you find out whether a SQL query is using the indices you expect?
Ans: Explain plan can be reviewed to check the execution plan of the query. This
would guide if the
expected indexes are used or not.
How can you force the optimizer to use a particular index?
Ans: use hints /*+ */, these acts as directives to the optimizer
select /*+ index(a index_name) full(b) */ *from table1 a, table2 bwhere b.col1 =
a.col1 and b.col2= 'sid'and
b.col3 = 1;
When using multiple DML statements to perform a single unit of work, is it
preferable to use
implicit or explicit transactions, and why.
Ans: Because implicit is using for internal processing and explicit is using for
user open data requied.
Describe the elements you would review to ensure multiple scheduled "batch" jobs do
not
"collide" with each other.
Ans: Because every job depend upon another job for example if you first job result
is successfull then
another job will execute otherwise your job doesn't work.
Describe the process steps you would perform when defragmenting a data table.
Ans: This table contains mission critical data.
There are several ways to do this:
1) We can move the table in the same or other tablespace and rebuild all the
indexes on the table.
alter table move this activity reclaims the defragmented space in the table
analyze table table_name compute statistics to capture the updated statistics.
2)Reorg could be done by taking a dump of the table, truncate the table and import
the dump back into
the table.
Explain the difference between the �truncate� and "delete" commands.
Ans: The difference between the TRUNCATE and DELETE statement is Truncate belongs
to DDL
command whereas DELETE belongs to DML command.Rollback cannot be performed incase
of Truncate
statement wheras Rollback can be performed in Delete statement. "WHERE" clause
cannot be used in
Truncate where as "WHERE" clause can be used in DELETE statement.
What is the difference between a DB config and a CFG file?
Ans: A .dbc file has the information required for Ab Initio to connect to the
database to extract or load
tables or views. While .CFG file is the table configuration file created by
db_config while using
components like Load DB Table.
Describe the �Grant/Revoke� DDL facility and how it is implemented.
Ans:Basically,This is a part of D.B.A responsibilities GRANT means permissions for
example GRANT
CREATE TABLE ,CREATE VIEW AND MANY MORE .
REVOKE means cancel the grant (permissions).So,Grant or Revoke both commands depend
upon D.B.A
SQL
1. You need to see the last fifteen lines of the files dog, cat and horse. What
command should
you use?
tail -15 dog cat horse
The tail utility displays the end of a file. The -15 tells tail to display the last
fifteen lines of each specified file.
2. Who owns the data dictionary?
The SYS user owns the data dictionary. The SYS and SYSTEM users are created when
the database is
created.
3. You routinely compress old log files. You now need to examine a log from two
months ago. In
order to view its contents without first having to decompress it, use the _________
utility.
zcat
The zcat utility allows you to examine the contents of a compressed file much the
same way that cat displays
a file.
4. You suspect that you have two commands with the same name as the command is not
producing the expected results. What command can you use to determine the location
of the
command being run?
which
The which command searches your path until it finds a command that matches the
command you are
looking for and displays its full path.
5. You locate a command in the /bin directory but do not know what it does. What
command
can you use to determine its purpose.
whatis
The whatis command displays a summary line from the man page for the specified
command.
6. You wish to create a link to the /data directory in bob�s home directory so you
issue the
command ln /data /home/bob/datalink but the command fails. What option should you
use
in this command line to be successful.
Use the -F option
In order to create a link to a directory you must use the -F option.
7. When you issue the command ls -l, the first character of the resulting display
represents the
file�s ___________.
type
The first character of the permission block designates the type of file that is
being displayed.
8. What utility can you use to show a dynamic listing of running processes?
__________
top
The top utility shows a listing of all running processes that is dynamically
updated.
9. Where is standard output usually directed?
to the screen or display
By default, your shell directs standard output to your screen or display.
10. You wish to restore the file memo.ben which was backed up in the tarfile
MyBackup.tar.
What command should you type?
tar xf MyBackup.tar memo.ben
This command uses the x switch to extract a file. Here the file memo.ben will be
restored from the tarfile
MyBackup.tar.
11. You need to view the contents of the tarfile called MyBackup.tar. What command
would you
use?
tar tf MyBackup.tar
The t switch tells tar to display the contents and the f modifier specifies which
file to examine.
12. You want to create a compressed backup of the users� home directories. What
utility should
you use?
tar
You can use the z modifier with tar to compress your archive at the same time as
creating it.
13. What daemon is responsible for tracking events on your system?
syslogd
The syslogd daemon is responsible for tracking system information and saving it to
specified log files.
14. You have a file called phonenos that is almost 4,000 lines long. What text
filter can you use to
split it into four pieces each 1,000 lines long?
split
The split text filter will divide files into equally sized pieces. The default
length of each piece is 1,000 lines.
15. You would like to temporarily change your command line editor to be vi. What
command
should you type to change it?
set -o vi
The set command is used to assign environment variables. In this case, you are
instructing your shell to
assign vi as your command line editor. However, once you log off and log back in
you will return to the
previously defined command line editor.
16. What account is created when you install Linux?
root
Whenever you install Linux, only one user account is created. This is the superuser
account also known as
root.
17. What command should you use to check the number of files and disk space used
and each
user�s defined quotas?
repquota
The repquota command is used to get a report on the status of the quotas you have
set including the amount
of allocated space and amount of used space.
SQL knowledge is usually basic knowledge required for almost all database related
technical jobs. Therefore it is good to know some SQL Interview questions and
answers. This post will mainly contain "generic" SQL questions and will focus on
questions that allow testing the candidate's knowledge about sql itself but also
logical
thinking. It will start from basic questions and finish on questions and answers
for
experienced candidates. If you are after broader set of questions I recommend
visiting
links at the bottom that will point you to more interview questions and answers
related
to SQL Server.
I will start with one general sql interview question and then go into basic sql
questions
and increase the difficulty. I will explain questions using standard sql knowledge
but at
the end I will add comments related to sql server. Who is it for?
These questions are mainly small tasks where the candidate can present not only
their
SQL knowledge but analytical skills and relational database understanding.
Remember if you know exactly what you need (or you know how you work) make sure
you include these kinds of questions and make them very clear to the candidate so
they
have a chance to answer them (without guessing).
SQL INTERVIEW QUESTIONS
Below is a list of questions in this blog post so you can test your knowledge
without
seeing answers. If you would like to see questions and answers please scrool down.
Question: What type of joins have you used?
Question: How can you combine two tables/views together? For instance one table
contains 100 rows and the other one contains 200 rows, have exactly the same fields
and you want to show a query with all data (300 rows). This sql interview question
can
get complicated.
Question: What is the difference between where and having clause?
Question: How would apply date range filter?
Question: What type of wildcards have you used? This is usually one of mandatory
sql
interview question.
Question: How do you find orphans?
Question: How would you solve the following sql queries using today's date?
First day of previous month
First day of current month
Last day of previous month
Last day of current month
Question: You have a table that records website traffic. The table contains website
name
(multiple websites), page name, IP address and UTC date time. What would be the
query
to show all websites visited in the last 30 days with total number or visits, total
number if
unique page view and total number of unique visitors (using IP Address)?
Question: How to display top 5 employees with the higest number of sales (total)
and
display position as a field. Note that if both of employees have the same total
sales
values they should receive the same position, in other words Top 5 employees might
return more than 5 employees.
Question: How to get accurate age of an employee using SQL?
Question: This is SQL Server interview question. You have three fields ID, Date and
Total. Your table contains multiple rows for the same day which is valid data
however for
reporting purpose you need to show only one row per day. The row with the highest
ID
per day should be returned the rest should be hidden from users (not returned).
Question: How to return truly random data from a table? Let say top 100 random
rows?
Question: How to create recursive query in SQL Server?
Question: How long have you used SQL for? Did you have any breaks?
Answer: SQL skills vary a lot depending on the type of job and experience of the
candidate so I wouldn�t pay too much attention to this sql interview question but
it is
always worth having this information before asking SQL tasks so you know if you
deal
with someone who is truly interested in SQL (might just have 1 year experience but
be
really good at it and at answering the questions) or someone who doesn�t pay too
much
attention to gain proper knowledge and has been like that for many years (which
doesn�t always mean you don�t want them).
most people used inner join and (left/right) outer join which is rather mandatory
knowledge but those more experienced will also mention cross join and self-join. In
SQL
Server you can also get full outer join.
Question: How can you combine two tables/views together? For instance one table
contains 100 rows and the other one contains 200 rows, have exactly the same
fields and you want to show a query with all data (300 rows). This sql interview
question can get complicated.
Answer: You use UNION operator. You can drill down this question and ask what is
the
different between UNION and UNION ALL (the first one removes duplicates (not always
desirable)� in other words shows only DISTINCT rows�.Union ALL just combines so it
is
also faster). More tricky question are how to sort the view (you use order by at
the last
query), how to name fields so they appear in query results/view schema (first query
field
names are used). How to filter groups when you use union using SQL (you would
create
separate query or use common table expression (CTE) or use unions in from with ().
Question: What is the difference between where and having clause?
Answer: in SQL Where filters data on lowest row level. Having filters data after
group by
has been performed so it filters on "groups"
Question: How would apply date range filter?
Answer: This is tricky question. You can use simple condition >= and <= or similar
or
use between/and but the trick is to know your exact data type. Sometimes date
fields
contain time and that is where the query can go wrong so it is recommended to use
some date related functions to remove the time issue. In SQL Server common function
to do that is datediff function. You also have to be aware of different time zones
and
server time zone.
Question: What type of wildcards have you used? This is usually one of mandatory
sql interview question.
Answer: First question is what is a wildcard? Wildcards are special characters that
allow
matching string without having exact match. In simple word they work like contains
or
begins with. Wildcard characters are software specific and in SQL Server we have %
which represent any groups of characters, _ that represent one character (any) and
you
also get [] where we can [ab] which means characters with letter a or b in a
specific
place.
Question: How do you find orphans?
Answer: This is more comprehensive SQL and database interview question. First of
all
we test if the candidate knows what an orphan is. An Orphan is a foreign key value
in
"child table" which doesn�t exist in primary key column in parent table. To get it
you can
use left outer join (important: child table on left side) with join condition on
primary/foreign key columns and with where clause where primary key is null. Adding
distinct or count to select is common practise. In SQL Server you can also you
except
which will show all unique values from first query that don't exist in second
query.
Question: How would you solve the following sql queries using today's date?
First day of previous month
First day of current month
Last day of previous month
Last day of current month
Answer: These tasks require good grasp of SQL functions but also logical thinking
which
is one of the primary skills involved in solving sql questions. In this case I
provided links
to actual answers with code samples. Experienced people should give correct answer
almost immediately. People with less experience might need more time or would
require some help (Google).
Question: You have a table that records website traffic. The table contains website
name (multiple websites), page name, IP address and UTC date time. What would
be the query to show all websites visited in the last 30 days with total number or
visits, total number if unique page view and total number of unique visitors (using
IP Address)?
Answer: This test is mainly about good understanding of aggregate functions and
date
time. In this we need to group by Website, Filter data using datediff but the trick
in here
is to use correct time zone. If I want to do that using UTC time than I could use
GetUTCDate() in sql server and the final answer related to calculated fields using
aggregate functions that I will list on separate lines below:
TotalNumberOfClicks = Count(*) 'nothing special here
TotalUniqueVisitors = Count(distinct Ipaddress) ' we count ipaddress fields but
only
unique ip addresses. The next field should be in here but as it is more complicated
I put
it as third field.
TotalNumberOfUniquePageViews = Count(distinct PageName+IPAddress) 'This one is
tricky to get unique pageview we need to count all visits but per page but only for
sales values they should receive the same position, in other words Top 5
employees might return more than 5 employees.
Answer: Microsoft introduced in SQL Server 2005 ranking function and it is ideal to
solve this query. RANK() function can be used to do that, DENSE_Rank() can also be
used. Actually the question is ambiguous because if your two top employees have the
same total sales which position should the third employee get 2 (Dense_Rank()
function)
or 3 (Rank() Function)? In order to filter the query Common Table Expression (CTE)
can
be used or query can be put inside FROM using brackets ().
Now that we covered basic and intermediate questions let's continue with more
complicate ones. These questions and answers are suitable for experienced
candidates:
Total. Your table contains multiple rows for the same day which is valid data
however for reporting purpose you need to show only one row per day. The row
with the highest ID per day should be returned the rest should be hidden from
users (not returned).
To better picture the question below is sample data and sample output:
ID, Date, Total
1, 2011-12-22, 50
2, 2011-12-22, 150
The correct result is:
2, 2012-12-22, 150
The correct output is single row for 2011-12-22 date and this row was chosen
because it
has the highest ID (2>1)
Answer: Usually Group By and aggregate function are used (MAX/MIN) but in this case
that will not work. Removing duplications with this kind of rules is not so easy
however
SQL Server provides ranking functions and the candidate can use dense_rank function
partition by Date and order by id (desc) and then use cte/from query and filter it
using
rank = 1. There are several other ways to solve that but I found this way to be
most
efficient and simple.
Question: How to return truly random data from a table? Let say top 100 random
rows?
I must admit I didn't answer correctly this sql interview question a few years
back.
Answer: Again this is more SQL Server answer and you can do that using new_id()
function in order by clause and using top 100 in select. There is also table sample
function but it is not truly random as it operates on pages not rows and it might
not
also return the number of rows you wanted.
Question: How to create recursive query in SQL Server?
Answer: The first question is actually what is a recursive query? The most common
example is parent child hierarchy for instance employee hierarchy where employee
can
have only one manager and manager can have none or many employees reporting to it.
Recursive query can be create in sql using stored procedure but you can also use
CTE
(Common table expression) for more information visit SQL Interview question -
recursive
query (microsoft). It might be also worth asking about performance as CTE is not
always
very fast but in this case I don't know which one is would perform betters.
I will try to find time to add more questions soon. Feel free to suggest new
questions
(add comments).
The following link shows SQL Queries Examples from beginner to advanced
See also:
SSIS Interview questions and answers
SSRS Interview questions and answers
SQL Server Interview questions and answers
In this blog I will post SQL queries examples as learning on examples usually is
very effective
sometimes better than any tutorial but this can also help with SQL Interview
questions and
answers. I will start from basic SQL queries and go to advanced and complex
queries. I will use
SQL Server 2008 R2 and I will try to remember to add comments for features that are
new. The
database I use is called AdventureWorksDW2008R2 and it is Microsoft training
database that
you can download from Microsoft site. I will be posting new samples for the next
several weeks.
---
---
---
. Normal Queries
. Sub Queries
. Co-related queries
. Nested queries
. Compound queries
2. What is a transaction ?
Answer: A transaction is a set of SQL statements between any two COMMIT and
ROLLBACK statements.
5. What is PL/SQL?
Answer: No.Unlike Oracle Forms, SQL*Plus does not have a PL/SQL engine.Thus, all
your PL/SQL are send directly to the database engine for execution.This makes it
much more efficient as SQL statements are not stripped off and send to the database
individually.
Answer: Currently, the maximum parsed/compiled size of a PL/SQL block is 64K and
the maximum code size is 100K.You can run the following select statement to query
the size of an existing package or procedure. SQL> select * from dba_object_size
where name = 'procedure_name'
Answer: Included in Oracle 7.3 is a UTL_FILE package that can read and write
files.The directory you intend writing to has to be in your INIT.ORA file (see
UTL_FILE_DIR=...parameter).Before Oracle 7.3 the only means of writing a file was
to use DBMS_OUTPUT with the SQL*Plus SPOOL command.
DECLARE
fileHandler UTL_FILE.FILE_TYPE;
BEGIN
fileHandler := UTL_FILE.FOPEN('/home/oracle/tmp', 'myoutput','W');
UTL_FILE.PUTF(fileHandler, 'Value of func1 is %sn', func1(1));
UTL_FILE.FCLOSE(fileHandler);
END;
Answer: PL/SQL V2.2, available with Oracle7.2, implements a binary wrapper for
PL/SQL programs to protect the source code.This is done via a standalone utility
that
transforms the PL/SQL source code into portable binary object code (somewhat
larger than the original).This way you can distribute software without having to
worry about exposing your proprietary algorithms and methods.SQL*Plus and
SQL*DBA will still understand and know how to execute such scripts.Just be careful,
10. Can one use dynamic SQL within PL/SQL? OR Can you use a DDL in a
procedure ? How ?
Answer: From PL/SQL V2.1 one can use the DBMS_SQL package to execute dynamic
SQL statements.
Eg: CREATE OR REPLACE PROCEDURE DYNSQL AS
cur integer;
rc integer;
BEGIN
cur := DBMS_SQL.OPEN_CURSOR;
DBMS_SQL.PARSE(cur,'CREATE TABLE X (Y DATE)', DBMS_SQL.NATIVE);
rc := DBMS_SQL.EXECUTE(cur);
DBMS_SQL.CLOSE_CURSOR(cur);
END;
Answer: No.
24. Can you have two functions with the same name in a PL/SQL block ?
Answer: Yes.
25. Can you have two stored functions with the same name ?
Answer: Yes.
Answer: No.
30. Can 2 functions have same name & input parameters but differ only by
return datatype
Answer: No.
32. Why Create or Replace and not Drop and recreate procedures ?
Answer: We have control over the firing of a stored procedure but we have no
control over the firing of a trigger.
37. What is the maximum no.of statements that can be specified in a trigger
statement ?
Answer: One.
Answer: No
39. What are the values of :new and :old in Insert/Delete/Update Triggers ?
40. What are cascading triggers? What is the maximum no of cascading triggers
at a time?
Answer: When a statement in a trigger body causes another trigger to be fired, the
triggers are said to be cascading.Max = 32.
Answer: A trigger giving a SELECT on the table on which the trigger is written.
Answer:
Answer: Contains pointers to locations of various data files, redo log files, etc.
Answer: It Used by Oracle to store information about various physical and logical
Oracle structures e.g.Tables, Tablespaces, datafiles, etc
Answer: No.
Answer: Yes.
52. Can Check constraint be used for self referential integrity ? How ?
Answer: Yes.In the CHECK condition for a column of a table, we can reference some
other column of the same table and thus enforce self referential integrity.
Answer: Two
54. What are the states of a rollback segment ? What is the difference between
partly available and needs recovery ?
. ONLINE
. OFFLINE
. PARTLY AVAILABLE
. NEEDS RECOVERY
. INVALID.
55. What is the difference between unique key and primary key ?
Answer: No.
Answer: Yes.
Answer: Yes.
Answer: 254.
60. What is the significance of the & and && operators in PL SQL ?
Answer: The & operator means that the PL SQL block requires user input for a
variable.
The && operator means that the value of this variable should be the same as
inputted by the user previously for this same variable
Answer: Explicit cursors can take parameters, as the example below shows.A cursor
parameter can appear in a query wherever a constant can appear.
Answer: Yes
Answer: Yes
Answer: Yes
Answer: 9 rows
Answer: No rows
69. Which symbol preceeds the path to the table in the remote database ?
Answer: @
70. Are views automatically updated when base tables are updated ?
Answer: Yes
Answer: No
72. If all the values from a cursor have been fetched and another fetch is
issued, the output will be : error, last record or first record ?
Answer: 7.5
Answer: 3
Answer: A Relational Database is a database where all data visible to the user is
organized strictly as tables of data values and where all database operations work
on
these tables.
database manager creates a separate process for each database user.But in MTA the
database manager can assign multiple users (multiple user processes) to a single
dispatcher (server process), a controlling process that queues request for work
thus
reducing the databases memory requirement and resources.
Answer:
. RDBMS - R system
. Hierarchical - IMS
. N/W - DBTG
Answer:
ORACLE 7
ORACLE 6
No provision
Truncate command
No provision
Distributed Database
Distributed Query
No provision
Client/Server Tech
No provision
Answer: The database has the ability to audit all actions that take place within
it. a)
Login attempts, b) Object Accesss, c) Database Action Result of Greatest(1,NULL) or
Least(1,NULL) NULL
Answer: To be created when table is queried for less than 2% or 4% to 25% of the
table rows.
Answer: Error
87. Can database trigger written on synonym of a table and if it can be then
what would be the effect if original table is accessed.
Answer: No
Answer: No.
Answer: Synonym is just a second name of table used for multiple link of
database.View can be created with many tables, and with virtual columns and with
conditions.But synonym can be on view.
92. What is the difference between foreign key and reference key ?
Answer: Foreign key is the key i.e.attribute which refers to another table primary
key. Reference key is the primary key of table referred by another table.
Answer: Yes
94. If content of dual is updated to some value computation takes place or not ?
Answer: Yes
95. If any other table same as dual is created would it act similar to dual?
Answer: Yes
96. For which relational operators in where clause, index is not used ?
97. .Assume that there are multiple databases running on one machine.How can
you switch from one to another ?
are unlikely to be left in the lurch by Oracle and there are always lots of third
party
interfaces available. Backup and Recovery : Oracle provides industrial strength
support for on-line backup and recovery and good software fault tolerence to disk
failure.You can also do point-in-time recovery. Performance : Speed of a 'tuned'
Oracle Database and application is quite good, even with large databases.Oracle can
manage > 100GB databases. Multiple database support : Oracle has a superior
ability to manage multiple databases within the same transaction using a two-phase
commit protocol.
Answer: PL/SQL requires that you declare an identifier before using it.Therefore,
you
must declare a subprogram before calling it.This declaration at the start of a
subprogram is called forward declaration.A forward declaration consists of a
subprogram specification terminated by a semicolon.
Answer: In our case, db_block_buffers was changed from 60 to 1000 (std values are
60, 550 & 3500) shared_pool_size was changed from 3.5MB to 9MB (std values are
3.5, 5 & 9MB) open_cursors was changed from 200 to 300 (std values are 200 &
300) db_block_size was changed from 2048 (2K) to 4096 (4K) {at the time of
database creation}. The initial SGA was around 4MB when the server RAM was 32MB
and The new SGA was around 13MB when the server RAM was increased to 128MB.
Answer: Yes
. Equijoins
. Non-equijoins
. self join
. outer join
Answer: A package cursor is a cursor which you declare in the package specification
without an SQL statement.The SQL statement for the cursor is attached dynamically
at runtime from calling procedures.
106. If you insert a row in a table, then create another table and then say
Rollback.In this case will the row be inserted ?
Answer: Yes.Because Create table is a DDL which commits automatically as soon as
it is executed.The DDL commits the transaction even if the create statement fails
internally (eg table already exists error) and not syntactically.
Answer:
All devices are represented by files called special files that are located in/dev
directory. Thus, device files and other files are named and accessed in the same
way. A 'regular file' is just an ordinary data file in the disk. A 'block special
file'
represents a device with characteristics similar to a disk (data transfer in terms
of
blocks). A 'character special file' represents a device with characteristics
similar to a
keyboard (data transfer is by stream of bits in sequential order).
2. What is 'inode'?
Answer:
All UNIX files have its description stored in a structure called 'inode'. The inode
contains info about the file-size, its location, time of last access, time of last
modification, permission and so on. Directories are also represented as files and
have an associated inode. In addition to descriptions about the file, the inode
contains pointers to the data blocks of the file. If the file is large, inode has
indirect
pointer to a block of pointers to additional data blocks (this further aggregates
for
larger files). A block is typically 8k.
Answer:
Answer:
The difference between fcntl anf ioctl is that the former is intended for any open
file,
while the latter is for device-specific operations.
Answer:
Example 1:
To change mode of myfile to 'rw-rw-r--' (ie. read, write permission for user -
read,write permission for group - only read permission for others) we give the args
as:
chmod(myfile,0664) .
Each operation is represented by discrete values
'r' is 4
'w' is 2
'x' is 1
Therefore, for 'rw' the value is 6(4+2).
Example 2:
To change mode of myfile to 'rwxr--r--' we give the args as:
chmod(myfile,0744).
Answer:
A link is a second name (not a file) for a file. Links can be used to assign
more than one name to a file, but cannot be used to assign a directory more
than one name or link filenames on different computers.
Symbolic link 'is' a file that only contains the name of another file.Operation
on the symbolic link is directed to the file pointed by the it.Both the
limitations of links are eliminated in symbolic links.
7. What is a FIFO?
Answer:
FIFO are otherwise called as 'named pipes'. FIFO (first-in-first-out) is a special
file which is said to be data transient. Once data is read from named pipe, it
cannot be read again. Also, data can be read only in the order written. It is
used in interprocess communication where a process writes to one end of the
pipe (producer) and the other reads from the other end (consumer).
8. How do you create special files like named pipes and device files?
Answer:
The system call mknod creates special files in the following sequence.
For example:
If the device is a disk, major device number refers to the disk controller and
minor device number is the disk.
Answer:
The privileged mount system call is used to attach a file system to a directory
of another file system; the unmount system call detaches a file system. When
you mount another file system on to your directory, you are essentially
splicing one directory tree onto a branch in another directory tree. The first
argument to mount call is the mount point, that is , a directory in the current
file naming system. The second argument is the file system to mount to that
point. When you insert a cdrom to your unix system's drive, the file system in
the cdrom automatically mounts to /dev/cdrom in your system.
Answer:
Inode has 13 block addresses. The first 10 are direct block addresses of the
first 10 data blocks in the file. The 11th address points to a one-level index
block. The 12th address points to a two-level (double in-direction) index
block. The 13th address points to a three-level(triple in-direction)index block.
This provides a very large maximum file size with efficient access to large
files, but also small files are accessed directly in one disk read.
11. What is a shell?
Answer:
12. Brief about the initial process sequence while the system boots up.
Answer:
This is done by executing the file /etc/init. Process dispatcher gives birth to
the shell. Unix keeps track of all the processes in an internal data structure
called the Process Table (listing command is ps -el).
Answer:
Unix identifies each process with a unique integer called ProcessID. The
process that executes the request for creation of a process is called the
'parent process' whose PID is 'Parent Process ID'. Every process is associated
with a particular user called the 'owner' who has privileges over the process.
The identification for the user is 'UserID'. Owner is the user who executes the
process. Process also has 'Effective User ID' which determines the access
privileges for accessing resources like files.
getpid() -process id
getppid() -parent process id
getuid() -user id
geteuid() -effective user id
Answer:
The 'fork()' used to create a new process from an existing process. The new
process is called the child process, and the existing process is called the
parent. We can tell which is which by checking the return value from 'fork()'.
The parent gets the child's pid returned to him, but the child gets 0 returned
to him.
Answer:
Explanation:
The fork creates a child that is a duplicate of the parent process. The child
begins from the fork().All the statements after the call to fork() will be
executed twice.(once by the parent process and other by child). The
statement before fork() is executed only by the parent process.
Answer:
Explanation:
Answer:
Answer:
Answer:
A parent and child can communicate through any of the normal inter-process
communication schemes (pipes, sockets, message queues, shared memory),
but also have some special ways to communicate that take advantage of their
relationship as a parent and child. One of the most obvious is that the parent
can get the exit status of the child.
Answer:
When a program forks and the child finishes before the parent, the kernel still
keeps some of its information about the child in case the parent might need it
- for example, the parent may need to check the child's exit status. To be
able to get this information, the parent calls 'wait()'; In the interval between
the child terminating and the parent calling 'wait()', the child is said to be a
'zombie' (If you do 'ps', the child will have a 'Z' in its status field to indicate
this.)
21. What are the process states in Unix?
Answer:
Answer:
When you execute a program on your UNIX system, the system creates a
special environment for that program. This environment contains everything
needed for the system to run the program as if no other program were
running on the system. Each process has process context, which is everything
that is unique about the state of the program you are currently running.
Every time you execute a program the UNIX system does a fork, which
performs a series of operations to create a process context and then execute
your program in that context. The steps include the following: Allocate a slot
in the process table, a list of currently running programs kept by UNIX. Assign
a unique process identifier (PID) to the process. iCopy the context of the
parent, the process that requested the spawning of the new process. Return
the new PID to the parent process. This enables the parent process to
examine or control the process directly.
After the fork is complete, UNIX runs your program.
Answer:
When you enter 'ls' command to look at the contents of your current working
directory, UNIX does a series of things to create an environment for ls and
the run it: The shell has UNIX perform a fork. This creates a new process that
the shell will use to run the ls program. The shell has UNIX perform an exec
of the ls program. This replaces the shell program and data with the program
and data for ls and then starts running that new program. The ls program is
loaded into the new process context, replacing the text and data of the shell.
The ls program performs its task, listing the contents of the current directory.
A daemon is a process that detaches itself from the terminal and runs,
disconnected, in the background, waiting for requests and responding to
them. It can also be defined as the background process that does not belong
to a terminal session. Many system functions are commonly performed by
daemons, including the sendmail daemon, which handles mail, and the NNTP
daemon, which handles USENET news. Many other daemons may exist. Some
of the most common daemons are: init: Takes over the basic running of the
system when the kernel has finished the boot process. inetd: Responsible for
starting network services that do not have their own stand-alone daemons.
For example, inetd usually takes care of incoming rlogin, telnet, and ftp
connections. cron: Responsible for running repetitive tasks on a regular
schedule.
Answer:
The ps command prints the process status for some or all of the running
processes. The information given are the process identification number
(PID),the amount of time that the process has taken to execute so far etc.
Answer:
The kill command takes the PID as one argument; this identifies which
process to terminate. The PID of a process can be got using 'ps' command.
Answer:
The most common reason to put a process in the background is to allow you
to do something else interactively without waiting for the process to
complete. At the end of the command you add the special background
symbol, &. This symbol tells your shell to execute the given command in the
background.
Example:
cp *.* ../backup& (cp is for copy)
The system calls used for low-level process creation are execlp() and
execvp(). The execlp call overlays the existing program with the new one ,
runs that and exits. The original program gets back control only when an
error occurs.
execlp(path,file_name,arguments..); //last argument must be NULL
A variant of execlp called execvp is used when the number of arguments is
not known in advance. execvp(path,argument_array); //argument array
should be terminated by NULL
Answer:
Pipes:
One-way communication scheme through which different process can
communicate. The problem is that the two processes should have a common
ancestor (parent-child relationship). However this problem was fixed with the
introduction of named-pipes (FIFO).
Message Queues :
Message queues can be used between related and unrelated processes
running on a machine.
Shared Memory:
This is the fastest of all IPC schemes. The memory to be shared is mapped
into the address space of the processes (that are sharing). The speed
achieved is attributed to the fact that there is no kernel involvement. But this
scheme needs synchronization.
Answer:
Swapping:
Whole process is moved from the swap device to the main memory for
execution. Process size must be less than or equal to the available main
memory. It is easier to implementation and overhead to the system.
Swapping systems does not handle the memory more flexibly as compared to
the paging systems.
Paging:
Only the required memory pages are moved to main memory from the swap
device for execution. Process size does not matter. Gives the concept of the
virtual memory. It provides greater flexibility in mapping the virtual address
space into the physical memory of the machine. Allows more number of
processes to fit in the main memory simultaneously. Allows the greater
process size than the available physical memory. Demand paging systems
handle the memory more flexibly.
31. What is major difference between the Historic Unix and the new BSD release of
Unix
System V in terms of Memory Management?
Answer:
Answer:
It decides which process should reside in the main memory, Manages the
parts of the virtual address space of a process which is non-core resident,
Monitors the available main memory and periodically write the processes into
the swap device to provide more processes fit in the main memory
simultaneously.
Answer:
A Map is an Array, which contains the addresses of the free space in the swap
device that are allocatable resources, and the number of the resource units
available there. This allows First-Fit allocation of contiguous blocks of a
resource. Initially the Map contains one entry – address (block offset from
the starting of the swap area) and the total number of resources.
Kernel treats each unit of Map as a group of disk blocks. On the allocation and
freeing of the resources Kernel updates the Map for accurate information.
34. What scheme does the Kernel in Unix System V follow while choosing a swap
device
among the multiple swap devices?
Answer:
Kernel follows Round Robin scheme choosing a swap device among the
multiple swap devices in Unix System V.
Answer:
36. What are the events done by the Kernel after a process is being swapped out
from
the main memory?
Answer:
When Kernel swaps the process out of the primary memory, it performs the
following:
37. Is the Process before and after the swap are the same? Give reason.
Answer:
While swapping the process once again into the main memory, the Kernel
referring to the Process Memory Map, it assigns the main memory accordingly
taking care of the empty slots in the regions.
Answer:
This contains the private data that is manipulated only by the Kernel. This is
local to the Process, i.e. each process is allocated a u-area.
39. What are the entities that are swapped out of the main memory while swapping
the
process out of the main memory?
Answer:
All memory space occupied by the process, process's u-area, and Kernel stack
are swapped out, theoretically.
Practically, if the process's u-area contains the Address Translation Tables for
the process then Kernel implementations do not swap the u-area.
Answer:
fork() is a system call to create a child process. When the parent process calls
fork() system call, the child process is created and if there is short of memory
then the child process is sent to the read-to-run state in the swap device, and
return to the user state without swapping the parent process. When the
memory will be available the child process will be swapped into the main
memory.
Answer:
At the time when any process requires more memory than it is currently
allocated, the Kernel performs Expansion swap. To do this Kernel reserves
enough space in the swap device. Then the address translation mapping is
adjusted for the new virtual address space but the physical memory is not
allocated. At last Kernel swaps the process into the assigned space in the
swap device. Later when the Kernel swaps the process into the main memory
this assigns memory according to the new address translation mapping.
Answer:
The swapper is the only process that swaps the processes. The Swapper
operates only in the Kernel mode and it does not uses System calls instead it
uses internal Kernel functions for swapping. It is the archetype of all kernel
process.
43. What are the processes that are not bothered by the swapper? Give Reason.
Answer:
Answer:
The swapper works on the highest scheduling priority. Firstly it will look for
any sleeping process, if not found then it will look for the ready-to-run
process for swapping. But the major requirement for the swapper to work the
ready-to-run process must be core-resident for at least 2 seconds before
swapping out. And for swapping in the process must have been resided in the
swap device for at least 2 seconds. If the requirement is not satisfied then the
swapper will go into the wait state on that event and it is awaken once in a
second by the Kernel.
45. What are the criteria for choosing a process for swapping into memory from the
swap device?
Answer:
The resident time of the processes in the swap device, the priority of the
processes and the amount of time the processes had been swapped out.
46. What are the criteria for choosing a process for swapping out of the memory to
the
swap device?
Answer:
Answer:
Nice value is the value that controls {increments or decrements} the priority
of the process. This value that is returned by the nice () system call. The
equation for using nice value is:
Priority = ("recent CPU usage"/constant) + (base- priority) + (nice value)
Only the administrator can supply the nice value. The nice () system call
works for the running process only. Nice value of one process cannot affect
the nice value of the other process.
48. What are conditions on which deadlock can occur while swapping the processes?
Answer:
Answer:
Answer:
It's the nature of the processes that they refer only to the small subset of the
total data space of the process. i.e. the process frequently calls the same
subroutines or executes the loop instructions.
Answer:
The set of pages that are referred by the process in the last 'n', references,
where 'n' is called the window of the working set of the process.
Answer:
The window of the working set of a process is the total number in which the
process had referred the set of pages in the working set of the process.
Answer:
Page fault is referred to the situation when the process addresses a page in
the working set of the process but the process fails to locate the page in the
working set. And on a page fault the kernel updates the working set by
reading the page from the secondary device.
54. What are data structures that are used for Demand Paging?
Answer:
55. What are the bits that support the demand paging?
Answer:
Valid, Reference, Modify, Copy on write, Age. These bits are the part of the
page table entry, which includes physical address of the page and protection
bits.
Page
address
Age
Copy
on
write
Modify
Reference
Valid
Protection
56. How the Kernel handles the fork() system call in traditional Unix and in the
System V
Unix, while swapping?
Answer:
Kernel in traditional Unix, makes the duplicate copy of the parent's address
space and attaches it to the child's process, while swapping. Kernel in System
V Unix, manipulates the region tables, page table, and pfdata table entries,
by incrementing the reference count of the region table of shared regions.
Answer:
During the fork() system call the Kernel makes a copy of the parent process's
address space and attaches it to the child process.
But the vfork() system call do not makes any copy of the parent's address
space, so it is faster than the fork() system call. The child process as a result
of the vfork() system call executes exec() system call. The child process from
vfork() system call executes in the parent's address space (this can overwrite
the parent's data and stack ) which suspends the parent process until the
child process exits.
Answer:
A data representation at the machine level, that has initial values when a
program starts and tells about how much space the kernel allocates for the
un-initialized data. Kernel initializes it to zero at run-time.
This is the Kernel process that makes rooms for the incoming pages, by
swapping the memory pages that are not the part of the working set of a
process. Page-Stealer is created by the Kernel at the system initialization and
invokes it throughout the lifetime of the system. Kernel locks a region when a
process faults on a page in the region, so that page stealer cannot steal the
page, which is being faulted in.
Answer:
61. What are the phases of swapping a page from the memory?
Answer:
Page stealer finds the page eligible for swapping and places the page number
in the list of pages to be swapped.
Kernel copies the page to a swap device when necessary and clears the valid
bit in the page table entry, decrements the pfdata reference count, and places
the pfdata table entry at the end of the free list if its reference count is 0.
Answer:
Page fault refers to the situation of not having a page in the main memory
when any process references it.
63. In what way the Fault Handlers and the Interrupt handlers are different?
Answer:
Fault handlers are also an interrupt handler with an exception that the
interrupt handlers cannot sleep. Fault handlers sleep in the context of the
process that caused the memory fault. The fault refers to the running process
and no arbitrary processes are put to sleep.
64. What is validity fault?
Answer:
If a process referring a page in the main memory whose valid bit is not set, it
results in validity fault.
The valid bit is not set for those pages: that are outside the virtual address
space of a process, that are the part of the virtual address space of the
process but no physical address is assigned to it.
65. What does the swapping system do if it identifies the illegal page for
swapping?
Answer:
If the disk block descriptor does not contain any record of the faulted page,
then this causes the attempted memory reference is invalid and the kernel
sends a "Segmentation violation" signal to the offending process. This
happens when the swapping system identifies any invalid memory reference.
66. What are states that the page can be in, after causing a page fault?
Answer:
Answer:
It sets the valid bit of the page by clearing the modify bit.
It recalculates the process priority.
Answer:
Protection fault refers to the process accessing the pages, which do not have
the access permission. A process also incur the protection fault when it
attempts to write a page whose copy on write bit was set during the fork()
system call.
70. How the Kernel handles the copy on write bit of a page, when the bit is set?
Answer:
In situations like, where the copy on write bit of a page is set and that page is
shared by more than one process, the Kernel allocates new page and copies
the content to the new page and the other processes retain their references
to the old page. After copying the Kernel updates the page table entry with
the new page number. Then Kernel decrements the reference count of the old
pfdata table entry.
In cases like, where the copy on write bit is set and no processes are sharing
the page, the Kernel allows the physical page to be reused by the processes.
By doing so, it clears the copy on write bit and disassociates the page from its
disk copy (if one exists), because other process may share the disk copy.
Then it removes the pfdata table entry from the page-queue as the new copy
of the virtual page is not on the swap device. It decrements the swap-use
count for the page and if count drops to 0, frees the swap space.
Answer:
The page is first checked for the validity fault, as soon as it is found that the
page is invalid (valid bit is clear), the validity fault handler returns
immediately, and the process incur the validity page fault. Kernel handles the
validity fault and the process will incur the protection fault if any one is
present.
Answer:
After finishing the execution of the fault handler, it sets the modify and
protection bits and clears the copy on write bit. It recalculates the process-
priority and checks for signals.
73. How the Kernel handles both the page stealer and the fault handler?
Answer:
The page stealer and the fault handler thrash because of the shortage of the
memory. If the sum of the working sets of all processes is greater that the
physical memory then the fault handler will usually sleep because it cannot
allocate pages for a process. This results in the reduction of the system
throughput because Kernel spends too much time in overhead, rearranging
the memory in the frantic pace.