DataStage ETL Concepts Guide
DataStage ETL Concepts Guide
:: FUNDAMENTAL CONCEPTS::
                                                                                             DAY 1
                                                                                              2010
                                                          Introduction for Phases of DataStage
                                                                                DAY 2
                                                   How the ETL programming tool works?
                                                                                2010
      Pictorial view:
Data Base
                             ETL             db                     BI          DM
Flat files
                                             DWH
MS Excel
                                                                                      DAY 3
                                                                                Continue…
                                                                                          2010
                          Source
                                                                      Staging (permanent data)
            Extract
            window
Loading the data into .txt (ASCII code)                                data base or resides in
                                                                       local repository
        S        T        S        T        S     T
OLTP             stage area       sa      sa      sa   DWH
Home Work (HW): one record for each kindle (multiple records for multiple addresses and
dummy records for joint accounts);
                                                                                       DAY 4
                                                                 ETL Developer Requirements
                                                                                       2010
   •   Q: One record for each kindle(multiple records for multiple addresses and dummy
       records for joint accounts);
   •   Customer maintaining one record but handling different addresses is called ‘single
       view customer’ or ‘single version of truth’.
HW explanation: Here we must read the query very care fully and understand the terminology
of the words in business perceptive. Multiple records means multiple of the
customers(records) and multiple addresses means one customer(one account) maintaining
multiple of addresses like savings/credit cards/current account/loan.
HLD LLD ,, ,, ,,
                                      Inputs
                                                                         here,
                                                                    HLD- high level document
                                      Developer                     LLD- low level document
                                                                                        DAY 5
                                                      How the DWH project is under taken?
Process:                                                                                 2010
                      HLD
Requirements:                 Warehouse(WH) -HLD
       x                      x
                              x       TD                                           jobs in %
                                                                      Developer (70% - 80%)
    as developer involves        Developer       system engineer     Production(10%)
                                                                      Migration (30%)
Up to 2002 this environment worked                   after 2002 and up to till this environment
IBM launched X-Migrator, which convert server jobs to parallel jobs
In this it converts up to, 70% automatically
                         30% manually.
Project divided into some category with respective to period as shown below and its
period( time of the project).
                                                                                              2010
       Categories       -       Period (that taken in months and years)
       Simple                   6m
       Medium                   6m – 1y
       Complex                  1– 11/2 y
       Too complex              11/2 y – 5y and so on(it may takes many years depend up on project)
                HLD                    Architecture
Warehouse:                             Schema (structure)
                                       Dimensions and tables (target tables)
                                       Facts
                (low level doc’s)
       LLD                             Mapping Doc’s (specifications-spec’s)
TD                                     Test Spec’s
                                       Naming Doc’s
Funneling
S2
                       Horizontal combining
                       or vertical combining
               Emp
                            HC               Trg
               Dept                                                                      2010
Here, HC means Horizontal combination is used for combine primary rows with secondary
rows.
      As Developer maximum 30 Target fields will get.
      As Developer maximum 100 source fields will get.
“Look Up!” means cross verification from primary table.
After document:
S1
                               T
                  HC
                                                TRG
                               H
S2                             C
                                                                                     DAY 6
                                                                       Architecture of DWH
                                                                                                 2010
                                              Manager
                      Reliance comm.
Explanation of above example: Reliance group with some there branches and every branch
have one manager. And for all this manager one Top level manager (TLM) will be there. And
TLM needs the details of list shown above for analyze.
                                                                                          Bottom level
For above example how ETL process is done shown below                                    RC-mgr
reliance fresh                                                          ERP
    ETL                                  ETL                                                 mini WH/
    PROCES                               PROCES                                              Data mart
    S                                    S
                                                                                  DWH
independent Data Mart                                                          Dependent Data Mart
Dependent Data Mart: means the ETL process takes all manager information or dB and keep
in the Warehouse. By that the data transmission between warehouse and data mart where
depends upon by each other. Here Data mart is also called as ‘Bottom level’/ ‘mini WH’ as
shown in blue color in above figure i.e., the data of individual manager (like RF, RC, RP and
so on). Hence the data mart depends up on the WH is called dependent data mart.
                                                                                               2010
Independent Data Mart: only one or individual manager i.e., data mart were directly access the
ETL process with out any help of Warehouse. That’s why its called independent data mart.
                                        ETL
                                        PROCE
                      R Power           SS                                               Data Mart
Reliance Group
                                                              Warehouse
                      R Fresh                                                            Data Mart
Layer I Layer II
                          R comm.
                                                                   DM
                                           ETL
                         R power           PROCE
                                           SS
                                                                   DM                      DWH
Reliance Group
                          R fresh                                  DM
                                                     Layer I                  Layer II
                                                               Bottom level              Top level
Programming (coding)
6.2.1. Layer I:
DM
                                                                                             DM
            Source                   DWH                  Source
                                                                                             DM
                       Layer I                                          Layer I
In this layer the data send directly in first case from source to Data WareHouse(DWH) and in
second case source to group of Data Marts(DM).
                                  DM                                         DM
SRC              DWH                                      SRC                                DWH
                                  DM                                         DM
       Layer I         Layer II                                    Layer I        Layer II
   TOP – BOTTOM APPROACH                                      BOTTOM – TOP APPROACH
In this layer the data follow from source – data warehouse – data mart and this type of follow
is called “top – bottom approach”. And in another case the data follow from source – data
marts – data warehouse and this type of following data is called “bottom – top approach”. For
this Layer II architecture is explained in the above shown example eg. Reliance group.
                                                                                         2010
* (99.99% using layer 3 and layer 4)
DM
                                                                                   DM
                 Layer I                Layer II                    Layer III
In this layer the data follow from source – ODS (operations data stores) – DWH – Data Marts.
Here the new concept add that is ODS means operations of data stores for at period like 6
months or one year that data used to solve instance problem where the ETL developer is not
involved here.
        And who solve the instance/ temporary problems that team called Interface team is
involved here. The ODS data stores after the period into the DWH and from that it goes to DM
there the ETL developers involves here in layer 3.
       The clear explanation about the layer 3 architecture in the below example, it is the best
example for clear explanation.
Example #1:
                                        Source (it is waiting for landing, because of some technical problem)      2010
                                 (at least or max. 2hrs to solve the problem )
                                                                                             ETL dev. Involves here
                                                  Layer I
                                                                                                                     DM
DM
Interface team involves here Stores problem info for future references DM
Example explanation:
       In this example, source is aero plan that is for waiting for landing to the airport
terminal. But it is not to suppose to land because of some technical problem in the airport base
station. To solve this type operations special team involves here i.e., interface team. In the
airport base station the technical problems and the Operations Data Store (ODS) in db i.e.,
simple say problem information captured.
       But the ODS stores the data for one year only. And years of database stores in the data
warehouse because of some technical problems to be not repeat or for future reference. From
DWH to it goes to Data Marts here ETL developers involves for solve technical problems i.e.,
is also called layer 3 architecture of data warehouse.
DAY 7
Continues…..
                                                   look
                                                   up
                                                                            It is for data backup of DWH & SVC
                      Interface                                        L3
Source
                                                                                           Business intelligence
1                     Files
                      (FLAT                       ETL                            DW
                                                                                                        BI
                      FILES)              Read flat files through DS                                    DM
                                                           L2                    L4
Source                        Condition                            Format
2                                                                  MISMATC
                                                                                        SVC
                              MISMATC             ODS                                   DM
                              H                                    H
Reporting
Source Layer I
Note: (Information about dropped data when the transmission done between ETL reads the flat
files(.txt, csv, .xml and so on) to ODS.)
Two types of mismatch data:
     •   Condition mismatch(CM): this verify the data from flat files whether they are
         conditions are correct or mismatched, if it is mismatched the record will drops
         automatically. To see the drop data the reference link is used and it shows which
         record is condition mismatched.
     •   Format mismatch(FM): this is also like condition mismatch but it checks on the format
         whether the sending data or records is format is correct or mismatched. Here also
         reference link is used to see drop data.
Example for condition mismatch: An employee table contains some data
                                        Contains
SQL> select * from emp;                                                                      Trg only
                                                                                             req. dno =
                                        dno
EID      ENAME       DNO                                                                     10
                                        10,20,30,1
08       Naveen      10                 0
                                              emp
                                                                                 TR
19       Munna       20                       tbl
                                                                             G
99       Suman       30
                                                    Referenc
15       Sravan      10                             e link        drops20,30
from emp
For example:
NOTE: Business intelligence(BI DM) is for data backup of DWH & SVC(single version of truth).
                                                                                                  DAY 8
Dimensional Model
Modeling: it represent in physical or logical design from our source system to target system.
                                                                                           2010
           o Logical design: client perspective,
           o Physical design: data base perspective.
                                                           Logical View
      Pictorial       View
                                                             EM              De         optional
                                                             P               pt
                                                                                        Manual
                                                             SQ              B
    Mata Data: every entity has a structure is called Meta Data(simple say ‘a data to a
       data’)
           o In a table there are attributes and domain, two types of domain they are 1.
                Alphabetical and 2. Number.
For example:
Q: An client required a experience of an employee.
SRC
                                 EMP_table                                                       2010
Implicit requirement (is experience of employee)                           Hire Date
Foreign Key
         Product_ID PRD_Desc
         PRD_TYPE_ID
        Primary Key
         PRD_SP_ID SName
         ADD1
Fk Pk
2010
HC
Normalization De-Normalization
   De-Normalization means combining the multiple tables into one table. And
     combining is done by Horizontal combine.
   But it is not in all cases, that de-normalized is must and should.
DAY 9
E-R Model
    Here from above two tables the primary table is DEPT table, because is not depends
      for any other table. And EMP table is secondary table because it is depends on the
      DEPT table.
    But when we take in real time, that we joining the two table by using Horizontal
      combining it takes the EMP table as primary table and DEPT table as secondary table.
9.1. Horizontal Combine:
 1 – Primary, n – secondary.
                                    Fk                Pk
    ENO EName                                      DNo          Higher Quali.
    Designation          DNo                               Add1 Add2
After combining or joining the table by using HC, hence it’s like below
                  ENO EName                 Designation            DNo    Higher
                  Quali. Add1            Add2
   1. STAR Schema:
    In the star schema, you must know about two things
           o Dimensional table, and
           o Fact table.
    Dimensional table: means ‘Lowest level detailed information’ of a table.
    Fact Table: means it is collection of foreign key from n- dimensional tables.
    in practical way it directly from source to dimensional table and fact table.
                                                 DIM
                                                 table
                                        T
                     Sourc
                     e                            FACT
                                        T
                                                  tbl
Q: display what suman buy a lux product in ameerpet on January 1st week?
Customer table
Unit table
Customer_Category_table                                   Fact
                                                        table
                                                        Fk
                                                        Fk
Customer table             Cust_Dim_tbl                 Fk
                                                                                  Loc_Dim_tbl
Location table                               Pk                         Pk
For example:
                Fk             Pk     Fk             Pk          Fk         Pk
           EMP_t                Dept_t                  Locati
           bl                   bl                      ons                  Area
    If we want to require the information from location table it fetch from that table and
       display the client required.
    To minimize the huge data at once or in a one dimensional table, some times it not
       possible to bring as soon as possible if huge data in dimensional table.
    That is reason we divide the dimensional table, into some tables. And that tables is
       known as “look up tables”
      Sour                                    Reports
      ce                    DWH
                                                                  N
                                                                      MIG/H1
                                              Normalization
                                                     Snow Flake Schema works effectively
HISTORY of DataStage
       An ETL tool according year 2006 there are 600 tools in market, some of they are
    DataStage Parallel Extends,
    ODI(OWB),
    SAS(ETL Studio),
    BODI,
    Abinity and so on…
But DataStage is so famous and widely used in the market and it is to expensive also.
  Q: What is DataStage?
  ANS: DataStage is a comprehensive ETL tool, which provides End – to – End Enterprise
  Resource Planning (ERP) solution (here, comprehensive means good in all areas)
History begins:
   -   In 1997, first version of DataStage is released by VMARK company i.e., US based
       company, and the Mr. LEE SCHEFFLER is father of DataStage.
   -   Only 5 members involved in release the software into the market.
   -   DataStage those days called as “Data Integrator”.
  -   There are 90% changes from 1997 to 2010 comparing to release versions.
  -   In 1997, Data Integrator is acquiring by company name called TORRENT.
                                                                                        2010
  -   After two years i.e., in 1999, INFORMIX Company has acquired Data Integrator from
      TORRENT Company.
  -   In 2000, ACENTIAL Company acquired both Data Base and Data Integrator and after
      that ACENTIAL DataStage Server Edition released in this year.
         o By this company the DataStage has popularized into the market from that year.
         o And released software were 30 tools used to run.
  -   In 2004, a version 7.5.x2 is released that support server configuration for windows flat
      form also.
         o   For this ADSSPX is integrated with MKS_TOOL_KIT.
         o   MKS_TOOL_KIT is virtual UNIX machine that brings the capabilities to
             windows for support server configuration.
         o   NOTE: After installing the ADSSPX+MKS_TOOL_KIT into the windows, and
             all the UNIX commands works in the windows flat form.
                                                                                           2010
  -   In 2004, December the version 7.5.x2 were having ASCENTIAL suite components
         o They are,
                     Profile stage,
                     Quality stage,
                     Audit stage,                          these are individual tools.
                     Meta stage,
                     DataStage Px,
                     DataStage Tx,
                     DataStage MUS, and so on
         o There are 12 types of ASCENTIAL suite components.
  -   In 2005, February the IBM acquired all the ASCENTIAL suite components and the
      IBM released IBM DS EE i.e., enterprise edition.
  -   In 2006, the IBM has made some changes to the IBM DS EE and the changes are the
      integrated the profiling stage and audit stage into one, quality stage, Meta stage, and
      DataStage Px.
         o   With the combination of four stages they have released
             “IBM WEBSPHERE DS & QS 8.0”
         o This is also called as “Integrated Developer Environment” i.e., IDE.
  -   In 2009, IBM has released another version that “IBM INFOSPHERE DS & QS 8.1”
         o In current market,
                     7.5.x2 using 40 – 50%
                     8.0.1 using 30 – 40%
                     8.1   using 10 – 20%
2010
                                                                                   DAY 11
                                                                      DataStage FEATURES
Features of DS:
        There are 5 important features of DataStage, they are
   -    Any to Any,
   -    Plat form Independent,
   -    Node configuration,
   -    Partition parallelism, and
   -    Pipe line parallelism.
    Any to Any:
             o DataStage that capable to any source to any target.
HDD HDD
                                                                     SMP -1       SMP -2
                            C    C   C          C
       C                    P    P   P          P
      P                     U    U   U          U
      U
Navs notes                                                                        Page 31
                                           DataStage
                                          “““
                                                                         SMP -3           SMP -n
                                                                                   ”””
                                                                                              2010
                     RAM                         RAM
    Node Configuration:
          o    Node is software that is created in operating system.
          o “Node is a logical CPU i.e., is instance of physical CPU.
          o    Hence, using software it is “the process of creating virtual CPU’s is called
               Node Configuration.”
          o Node configuration concept is exclusively work on the DataStage, it is the best
               feature comparing from other ETL tools.
          o For example:
                        An ETL job requires executing 1000records?
                        For above question an UNI processor takes 10mins to execute 1000
                         records.
                        But for the same question an SMP processor takes 2.5 minutes to
                         execute 1000 records.
                        It is explained clearly in below diagram.
RAM RAM
         o As per above example, Node Configuration is also can create virtual CPU’s to   2010
             reduce the execution time for UNI processor.
                                                                 PU to 2.5minutes
                                                    10 minutes reduces
RAM
    Partition parallelism:
         o   Partition is a distributing the data across the nodes, based on partition
             techniques.
         o Considering one example why we use the partition technique’s
         o Example: taking some records in EMP table and some in DEPT table
                   EMP table have 9 records,
                   DEPT table have 3 records.
         o After partitioning these records output must and should have 9 records, because
             here primary table is 9 records.
EMP(10,20,10,30,20,10,10,20,30) and DEPT(10,20,30)
              N3       10,20,30        30     1
          o   In the above example, only 4 records are in there in final output and 5 records
              are missing for this reason the partition techniques are introduced.
          o   And there are two types of partition parallelism categories, in those total 8
              types of partition techniques are there.
                      Key based
                          •   Hash
                          •   Modulus
                          •   Range
                          •   Db/2
                      Key less
                          •   Same
                          •   Random
                          •   Entire
                          •   Round robin
          o Key based category or key based techniques will give the assurance, to the
              same key column value to collected same key partition.
o Key less technique is used to append the data for joining given tables.
EMP DNO N1 10
JOIN N2 20 2010
DEPT N3 30
                                                                                    DAY 12
                                                                              Continues…
Features of DataStage
    Partition Parallelism:
          o   Re – Partition: means re – distributing the distributed data.
                        ENO EName
                        DNo     Loc
                                                                               P1
                        111     naveen               10
                        AP
                        222     munna                20                         P2
                        TN
                        333     Sravan               10
                                                                                    P3
                        KN
                        444     Raju       30
EMP 10 N1 N1 AP
                                                20     N2      N2 TN
                                 JOIN
                               Dno
       DEPT                                     30     N3      N3 KN
Dno Loc
          o    First partition is done by key based partition for dno, and taking a separate
               column as location, for that it re – distributing for the distributed data. i.e.,
               known as Re – Partition.                                                        2010
          o Reverse Partitioning:
                       It is also called as collecting. But it done in one case only or in one
                        situation only : “when the data move from parallel stage to sequential
                        stage the collecting happens in this case only”
                       Designing job in “stages”
                               is also called as link or pipe, this is channel it is moving data from
                        one stage to another stage.
                                               TRS
              SRC                              F                          TRG
     S1                                   S2                             S3
                       Example:
                                                                  Here collecting to Nodes
                                                      N1          N
                                     S                                              S
                                 1                    Nn                        2
                        Parallel files                     into          Sequential/Single file
                                •   Sort – merge
                                •   Auto
                          Example for collecting techniques:                                         2010
                                                               Order               RR        SM
                       N1 a,x
                                                                          Auto
                                                               a          a        a         a
                       N2 b,y                 N                x          b        b         z
                                                               b          c        c         y
                                                               y          x        x         c
                       N3 c,z
                                                               c          y        y         x
   Pipeline Parallelism:                                      z          z        z         b
                 “All pipes carry the data parallel and the process done simultaneously”
HD HD
                   E                                     T                                   L
        R5                                          R3                                  R1
             S                                      S                                   S
             1                                      2                                   3
                                    R4                               R2
                        Here, all the pipe carry the data parallel and processing the job
                         simultaneously and the execution taken only 10 minutes to complete
                        By using the pipeline parallelism we can reduce the process time.    2010
                                                                                             DAY 13
                                                              Differences between 7.5.x2 & 8.0.1
Differences:
               7.5.x2
                7.5.x2                                         8.0.1
                                                                8.0.1
                         •   Mainframes job
                         •   Server job
                         •   Parallel job
                         •   Job sequence job
                         •   Data quality job
  -   DS Director: same in as above shown in 7.5.x2
  -   DS Administrator: same in as above shown in 7.5.x2
  -   Web Console: administrator components through which performing.
                    Security services
                    Scheduling services
                    Logging services
                    Reporting services
                    Session management
                    Domain manager
  -   Information Analyzer: is also called as console for IBM INFO SERVER.
                    It perform all phase-I activities
                         •   Column analysis,
                         •   Primary key analysis,
                         •   Foreign key analysis,
                         •   Base Line analysis, and
                         •   Cross domain analysis.
  But, some information to be knows about Web console, Information Analyzer, and DS
  Administrator.
2010
                                                                                    DAY 14
                                                   Description of 7.5.x2 & 8.0.1 Architecture
    Package Installer: in this component contains two types of package installer one plug-
       in and another is pack’s.
       Example:
               Derivers needed 1100 to install
                Comput                                         Printer
                                      Interfac
                er
                                      e
                                         1100 driver provide
                ER                      SW                       DS
               P
                                         Packs
    Best example that normal windows XP acquires Service Pack2 for more capabilities
    Here, packs are used to configuration for DataStage to ERP solution.
   d. DS Administrator
            These categories are shown above what they handle i.e., in page no 39.
                                                                                              2010
14.2. Architecture of 8.0.1:
   3. Common Engine:
           o It is responsible of
                     Data Profiling analysis
                     Data Quality analysis
                     Data Transmission analysis
   4. Common Connectivity:
              It provides the connections to common repository.
                                                       WC     IA       DE   DI   DA
                                                                                      2010
                  REPOSITORY
                                                      Common shared services
                  MD SERVER
                  Project level MD                             DP      DQ  DT    DA
                  Design level MD                                      Common Engine
                  Operation level
                  MD
                                                                  Common
                                                               Connectivity
                   Classic federation
                   ODBC connector
                   NETEZZA
         o Enhanced Stages:
                   All Stages techniques used with respect to SQL Builder.
2010
DAY 16
The starting of DataStage on the system we must follow the difference steps to do job.
• Five difference steps job development process (this is for design a job).
 Palate -> (it’s from tool bar)               And link them (or giving connectivity) and after
General
                     Designer Canvas           that setting properties is important.
                     CANVAS
                     or Editor
Data Quality
Database
                 Where the
File
Navs             place we
      notes & Debug
Development                                                                             Page 46
Processing       design the job.
Real Time        Eg: Seq to Seq
Restructure
                                            DataStage
 Run director (to see views) or to view the status of your job.
DAY 17
Process:
      In computer desktop, the current running process will show at the left Conner in that
           a round symbol with green color is to start when it is not automatically starts. i.e.,
           whether the server for DataStage was start or not. If not manually to start.
      When 8th version of DataStage is installed five client components short cuts visible
           on desktop.
                        Web Console
                        Information Analyzer
                        DS Administrator
                        DS Designer
                        DS Director
      Web Console: when you will click, it displays “ the login page appears”
         o If server is not started, it displays “the page cannot open” error will appear.
         o If error occurs like that, the server must be restart for doing or creating jobs.
                                                                                          2010
     DS Director: it is for views the status of the job executed, and to view log, status,
       warnings.
     DS Designer: when you will click on the designer icon, it will display to attach the
       project for creating a new job. As shown as below
         o User id: admin
         o     Password: ****
         o If authentication failed to login i.e., because repository interface error.
     Below figure showing how to authenticate & shows designer canvas for creating
       jobs.
                             admin                             canc
                                                               el
                           Password
phil
Project
Teleco
                       Main frames
                       Parallel
                       Sequential                                                                2010
                       Server jobs
        After clicking on parallel jobs, go to tool bar – view – palate.
        In palate the 8 types of stages were displayed for designing a job, they are
                       General
                       Data Quality
                       Data Base
                       File
                       Development & Debug
                       Processing
                       Real Time
                       Re – Structure
17.1. File Stage:
        Q: How data can read from files?
        File stage can read only flat files and the formats of flat files are .txt, .csv, .xml
        In .txt there are different types of formats like fwf, sc, csv, s & t, H & T.
        .csv means comma separated value.
        .xml means extendable markup language.
   -    In File Stage, there are sub–stages like sequential stage, data set, file set and so on.
            o Example how a job can execute:
                one sequential file(SF) to another SF.
Source Target
    File: \ c:\data\se_source_file.txt
     File: \? (This option for multiple purposes)
                                                               C:\data\se_source_file.txt
                                                                           Browse button
2. Format selection:
   -   As per input file taken and the data must to be in given format
   -   Like “tab/ space/ comma” must to be select one them.
                                            LOA
                                                               To get the structure of file.
   -    Steps for load a structure
   -    Import
            o Sequential file
                      Browse the file and import
                           •    Select the import file
                                     o Define the structure.
 These three are general properties when we design for simple job.
                                                                                           DAY 18
                                                                            Sequential File Stage
                          Output                                 Input
                          Properties                             Properties
        Step1: Sequential file is file stage, that it to read flat files from different of
           extensions(.txt, .csv, .xml)
 Step 3: Sequential stage supports one input (or) one output and one reject link.
Link :
         Link is also a stage that transforms data from one stage to another stage.
             o That link has divided into categories.
                         Stream link            SF                      SF
 Reject link SF SF
                         Reference link         SF                      SF
Link Marker:
         It is show how the link behaves between the transmissions from source to target.
1. Ready BOX: it is indicate that “a stage is ready with Mata Data” and data transform
   between sequential stages to sequential stage.                                        2010
Ready BOX
2. FAN IN: it indicates when “a data transform from parallel stage to sequential stage” and it
   done when collecting happens
FAN IN
3. FAN OUT: it indicates when “a data transform from sequential stage to parallel stage” and
   it is also called auto partition.
FAN OUT
4. BOX: it indicates when “a data transform from parallel stage to parallel stage” and it is
   also known as partitioning.
BOX
5. BOW – TIE: it indicates when “a data transform parallel stage to parallel stage” and it is
   also known as re-partitioning.
                                                                                         2010
BOW – TIE
Link Color:
       The link color indicates the process in execution of a job.
                                     LINK
      RED:
           o A link in RED color means
                     case1: a stage not connected properly and
                     case2: job aborted
      BLACK:
           o A link in BLACK color means “a stage is ready”.
      BLUE:
           o   A link in BLUE color means “ it indicates that a job execution on process”
      GREEN:
           o A link in GREEN color means “execution of job finished”.
Compile:
       Compile is a translator that source code to target code.
      Compiling .C function                                                               2010
                                              .
             .
                  HLL                         EX      BC
             C
                                              E
                             .
                             OB
                             J      ALL               *HLL – High Level Language
                                                      *ALL – Assembly Level Language
                                                      *BC – Binary Code
      Compiling process in DataStage:
                      GU                                .
                      I                                 EX        MC
                                                        E
                                       .
                                       OB
                                               OSH Code & C++
                                       J
                                                                                               DAY 19
                                                                   Sequential File Stage Properties
                                                                                                 2010
Properties:
    Read Methods: two options are
          o   Specific File: user or client to give specifically each file name.
          o   File Pattern: we can use wild card character and search for pattern i.e., * & ?
                     For example: C:\eid*.txt
                                       C:\eid??.txt
    Reject Mode: to handle a “format/data type/condition” miss match records.
      Three options
          o   Continue: Drops the miss match and continue other records.
          o   Fail: job aborted.
          o   Output: its capture the drop data through the link to another sequential file.
                     First line or record of table: true/false.
                                   o    If it false, it display the first line also a drop record.
                                   o    Else it is true, it’s doesn’t drop the first record.
    Missing File Mode: if any file name miss this option used
      Two options
          o   Ok: drops the file name when missed.
          o   Error: if file name miss it aborts the job.
    File Name Column: “source information at the target” it gives information about which
      record in which address in local server.
      Directly to add a new column to existing table and it’s displays in that column.
    Row Number Column: “Source record number at target” it gives information about
      which source record number at target table.
     It is also directly to add a new column to existing table and it’s displays in that column.
   Read First Rows: “will get you top first n-records rows”
         o Read First Rows option will asks give n value to display the n number of      2010
records
   Read from Multiple Nodes: we can read the data parallel from using sequential stage
                      Reads parallel is possible
                      Loading parallel is not possible
     LIMITATIONS of SF:
         o It should be sequential processing( process the data in sequential)
         o Memory limit 2gb(.txt format)
         o Problem with sequential is conversions.
                      Like ASCII – NF – ASCII – NF
         o It is lands or resides the data “outside of boundary” of DataStage.
                                                                                         DAY 20
                                                                                           2010
                                                 General settings DataStage and about Data Set
General Stage:
        In this stage the some of stage were used for commenting a stage what they behave or
what a stage can perform to do i.e., simple giving comments for a stage.
Resides into or
                 SRC                             TRG
Extracting                                                     landing the data into LS/RR/db
                                                                                              2010
Q: In which format the data sends between the source file to target file?
A: if we send a .txt file from source, it is ASCII format because .txt file support only ASCII
format and DataStage support the Native format only, here the ASCII code will convert into
Native format that is understandable to DataStage. And at target ASCII code will convert
into .txt format to user/client visible.
                                                                             When we convert
                                                                             NF code into ASCII.
                                                                             Target need to
                                                                             import an
                                           NF                                operator.
ASCII                                                                       ASCII
                  src_f.txt                             trg_f.txt
 When we convert
 ASCII code into NF.
 SRC need to
 import an
Q: How the Data Set over comes the sequential file limitation?
    -    By default the data process parallel.
    -    More than 2 GB.
   -   No need of conversion, because Dataset represent or data directly resides into Native
       format.
   -   The data Lands in the DataStage repository.                                          2010
   -   Data Set extension is *.ds
src_f.txt trg_f.ds
trg_f.ds trg_f.txt
   -   Data Set can read only Native Format file, like DataStage reads only orchestrate
       format.
                                                                                             2010
                                                                                           DAY 21
                                                                            Types of Data Set (DS)
    -    Virtual: it is a Data Set stage that the data moves in the link from one stage to another
         stage i.e., link holds the data temporary.
    -    Persistency: means the data sending from the link it directly lands into the repository.
         That data is permanent.
Q: How many files are created internally when we created data set?
A: Data Set is not a single file; it creates multiple files when it created internally.
             o Descriptor file
             o Data file
             o Control file
             o Header file
Q: How can we organize Data Set to view/copy/delete in real time and etc.,
A:     Case1: we can’t directly delete the Data Set
       Case2: we can’t directly see it or view it.
      Data Set organizes using utilities.
           o Using GUI i.e., we have utility in tool (dataset management)
           o Using Command Line: we have to start with $orachadmin grep “moon”;
      At command line
           o $orachadmin rm dataset.ds (this is correct process) \\ this command for remove
              a file
           o $rm dataset.ds (this is wrong process) \\ cannot write like this
           o $ds records \\ to view files in a folder
Dataset Version:
   -    Dataset have version control
   -    Dataset has version for different DataStage version
        After doing this when we want to save the job, it will ask whether which version you
          want.
2010
                                                                                      DAY 22
                                             File Set & Sequential File (SF) input properties
           DS is exclusively for
            internal use DataStage
            environment                            External application
                                                    create FS we use the
                                                    any other application
Reject mode: here reject mode is same like as output properties we discussed already before.
In this we have three options – continue/fail/output.
      Continue – it just drops when the format/condition/data type miss match the data and
                                                                                       2010
         continues process remain records.
      Fail – it just abort the file when format/condition/data type miss match were found.
      Output – it capture the drops record data.
                                                                                     DAY 23
                                                               Development & Debug Stage
The development and debug stage having three categories, they are
      Simply say in development and debug we having 6 types of stages and the 6 stages
         where divided into three categories as above shown.
   -   The row generator is for generating the sample data; in some cases it is used.
   -   Some cases are,
           o   When client unable to give the data.                                       2010
           o For doing testing purpose.
           o    To make job design simple that shoots for jobs.
   -   Row Generator can generate the junk data automatically by considering data type, or
       we manual can set a some related understandable data by giving user define values.
   -   In this having only one property and select a structure for creating junk data.
Row Generator design as below:
Column Generator Data: “it having the one input and one output”
   -   Main purpose of column generator to group a table as one.
   -   And by using this we add extra column for the added column the junk data will be
       generated in the output.
   -   Here mapping should be done in the column generated properties, means just drag and2010
       dropping created column into existing table.
In the output,
   -   The junk data will generate automatically for extra added columns.
   -   For manual we can generate some meaning full data to extra column’s
   -   Navigation for manual:
           o Column
                       Ctrl+E
                            •      Generator
                                                                                       DAY 24
                                                                       Pick sample Data & Peek
24.1. Pick sample data: “it is a debug stage; there are three types of pick sample data”.
   -    Head
   -    Tail
   -    Sample
    Head: “it reads the top ‘n’ records of the every partition”.
            o It having one input and one output.
            o    In the head stage mapping must and should do.
        Properties of Head:
            o Rows
   Tail: “it is debug stage, that it can read bottom ‘n’ rows from every partition”
         o Tail stage having one input and one output.
         o In this stage mapping must and should do. That mapping done in the tail output
             properties.
     Properties of Tail:
         o The properties of head and tail are similar way as show above.
         o Mainly we must give the value for “number of rows to display”
2010
     Period: if I have some records in source table and when we give ‘n’ number of
       period value it displays or retrieves the every nth record from the source table.
 Skip: it also displays or retrieves the every nth record from given source table.
Target1
Target2
SF_SRC SAMPLE
Target3
2010
NOTE: sum of percentage of all outputs must be less than are equal to ‘<=’ to ‘n’ records of
input records.
SF_SRC PEEK
           o   If we put column name = false, it doesn’t shows the column in the log.
    For seeing the log records that we stored.
           o In DS Director                                                                2010
 From Peek – log – peek - We see here ‘n’ values of records and fields
In this stage we have use generally oracle enterprise, ODBC enterprise, Tara data with ODBC,
and dynamic RDBMS and so on.
                                                 Password: tiger
                                                 After loading select specific table and import.
                           •   After importing into column, in define we must change hired    2010
Q: A table containing 300 records in that, I need only 100 fields from that?
A: In read method we use user-defined SQL query to solve this problem by writing a query for
reading 100 records.
    But by the first read method option, we can auto generate the query by that we can use
       by coping the query statement in user-defined SQL.
NOTE: in version 7.5.x2 we don’t have saving and reusing the properties.
                                 o Naveen_dbc \\ it is a saved dc
                                 o Save in table definition.
                                                                                      2010
                                                                                     DAY 26
                                                                             ODBC Enterprise
    When coming to connection oracle enterprise connects directly to oracle data base. But
       ODBC needs OS drivers to hit oracle or to connect oracle data base.
                Oracle
                Enterpris
                e                                                        ORACLE
                                                                         DB
Navs notes      ODBC                                                                Page 77
                Enterpris
                e                                          OS
                                       DataStage
Directly hitting
2010
OE ODBCE
        o Password = ******
        o User = Scott
                                                                                    2010
   ODBCE driver at OS level having lengthy process to connect, to over this ODBC
     connector were introduced.
   Using ODBC Connector is quick process as we compare with ODBCE.
   Best Feature by using ODBC Connector is “Schema reconciliation”. That
     automatically handles data type miss match between the source data types and
     DataStage data types.
   Differences between ODBCE and ODBC Connector.
              ODBCE                                           ODBC
         Connector
             It cannot make the list             It provides the list
              of Data Source Name                  have in ODBC DSN.
              (DSN).
                                                  In this we can test the
              In the ODBCE “no                    connection by test
               testing the                         button.
               connection”.
                                                  It read parallel and
              ODBCE read                          loads parallel (good
               sequentially and load               performance).
          o Password = *****
          o SQL query
    First step is to create MS Excel that is called “work book”. It’s having ‘n’ number of
      sheets in that.
    For example CUST work book is created
    Connections
          o DSN = EXE
          o Password = *****
          o User = xxxxx
    Column
          o Load
                       Import ODBC table definitions
                           •   DSN \\ here select work book
                           •   User id & password
           o Add in ODBC
                     MS EXCEL drivers
                          •   Name = EXE \\ it is DSN
         o Connections
                   DSN = tduser
                   Uid = tduser                                                        2010
                   Pwd = tduser
   Column
         o Load
                   Import
                        •    Table definitions\plug-in\taradata
                        •    Server: 127.0.0.1
                        •    Uid = tduser
                        •    Pwd = tduser
 After all this navigation at last we view the data, which we have load in source.
                                                                                    DAY 27
                                             Dynamic RDBMS and PROCESSING STAGE
                                                                                        2010
                               Ln_EMP_Data                           Data Set
     DRS
                               Ln_DEPT_Data
Data Set
         o Column
                      Load
                          •    Meta data for table EMP & DEPT.
   In oracle enterprise we can read multiple files, but we can’t load into multiple files.
   We can solve this problem with DRS that we can read multiple files and load in to
     multiple files.
Here, setting the connection                         here, source field and structure available
and load Meta data in to column                      mapping should be do.
                                                          Transformer Stage is “all in one
                                                             stage”.
Stage Variable: “it is a temporary variable which will holds the value until the process
completes and which doesn’t sent to the result to output”
      Stage variable is shown in the tool bar of transformer properties.
      After clicking that it visible in the input properties
1. Left Function
2. Right Function
3. Substring Function                                                                       2010
4. Concatenate Function
5. Field Function
6. Constraints Function (Filter)
For example, a word MINDQUEST, from that word we need only QUE.
Constraints:
       “In transformer constraints used as filter, means constraints is also called as filter”
Q: how a constraint used in Transformer?
A: in transformer properties, we will see a constraints row in output link. There we can write
the derivation by double clicking.
2010
NOTE: Tx is very sensitive with respect to Data Types, if an source and target be cannot
different data types.
Q: How the below file can read and perform operation like filtering, separating by using left,
right, substring functions and date display like DD-MM-YYYY?
A: File.txt
   HINVC23409CID45432120080203DOL
   TPID5650        5       8261.99
   TPID5655        4       2861.69
   TPID5657        7       6218.96
   HINVC12304CID46762120080304EUO
   TPID5640        3       5234.00
   TPID5645        2       7855.67
   TPID5657        9       7452.28
   HINVC43205CID67632120080405EUO
   TPID5630        8       1657.57
   TPID5635        6       9564.13
   TPID5637        1       2343.64
Design:
IN1 IN2
SF Tx1 Tx2
IN3 2010
OUT
Tx3 DS
Step 1: Loading file.txt into sequential file, in the properties of sequential file loading the
whole data into one record. Means here creating one column called REC and no need of
loading of Meta data for this.
Step 2: IN1 Tx- Properties, in this step we are filtering the “H” staring records from the given
file. Here, we are creating two columns TYPE and DATA.
           IN1
                                      IN1       CONSTRAINT Left (IN1.REC,1)=”H”
           REC
                                                                 IN2
                                      Derivation                                    Column
                                      Left (IN1.REC, 1)                               TYPE
                                      IN1.REC
                                            DATA
Step 3: IN2 Tx properties, here creating four column and separating the data as per created
columns.
      IN2                                                     IN3
      TYPE
      DATA
                                    Left (IN1.REC, 1)                                INVCNO
Navs notes                                                                                Page 89
                                    Left (Right (IN2.DATA, 21), 9)
                                           CID
                                    IN2.DATA [20, 8]
                                       DataStage
Derivation Column
2010
Step 4: IN3 Tx properties, here BILL_DATE column going to change into DD-MM-YYYY
format using Stage Variable.
                                                    Stage Variable
                                  Derivation                               Column
        IN3
        INVCNO                    Right (IN3.BILL_DATE, 2)                    D
        CID                       Right (Left (IN3.BILL_DATE, 6), 2)                 M
        BILL_DA                   Left (IN3.BILL_DATE, 4)                            Y
        TE
        CURR
                                                          OUT
                                  Derivation                               Column
                                  IN3.INVCNO
                                  INVCNO
                                  IN3.CID
                                          CID
                                  D:’-‘: M:’-‘: Y
Step 5: here, setting the output file name for displaying the BILL_DATE.
                                                                                    DAY 29
                                                                 Transformer Functions-II
Q: A file.txt consisting of special character, comma delimiters and spaces (before, after, and in
between). How to solve by above functions and at last it to be one record?
File.txt
EID,ENAME,STATE
111, NaVeen, AP
222@, MUnNA, TN
555, anvesh,MH
Design:
                       IN1                                   IN2
SF Tx Tx
                                                         IN3
                                                                                           2010
OUT
Tx DS
Step 1: Here, extracting the file.txt and setting into all data into one record to the new column
created that REC. no need of load meta data to this.
    Point to remember keep that first line is column name = true.
IN1 IN2
           IN2                                                  IN3
         EID
                                      Derivation                                 Column
         ENAME
         STATE                        Trim(IN2.EID,”@”,””)
Navs notes                                  EID                                         Page 92
                                      Upcase(Trim(SWS(IN2.ENAME,”@”,””))
                                            ENAME
                                            DataStage
2010
              IN3                                               OUT
               EID
                                         Derivation                              Column
               ENAME
               STATE                     IN3.EID: IN3.ENAME: IN3.STATE              REC
Step 5:
    For the output, here assigning a target file. And at last the answer will display in one
          record but all special characters, spaces were removed after doing are implementing
          the transformer functions to the above file.txt.
Final output:
                 Trg_file.ds     REC
                                 111NAVEEN AP
                                 222 MUNNATN
                                 333SRAVAN KN
                                 444SAN DEEPKN
                                 555 ANVESHMH
Column Export:
   “it is used to combine the multiple of columns into single column” and it is also like
     concatenate in the transformer function.
   Properties:                                                                        2010
         o Input
                     Column method = explicit
                     Column To Export = EID
                     Column To Export = ENAME
                     Column To Export = STATE
         o Output
                     Export column type = “varchar”
                     Export output column = REC
Column Import:
   “it is used to explore from single column into multiple columns” and it is also like field
     separator in the transformer function.
   Properties:
         o Input
                     Column method=
                     Column To Import = REC
         o Output
                     Import column type = “varchar”
                     Import output column= EID
                     Import output column= ENAME
                     Import output column= STATE
                                                                                     DAY 30
                                                       JOB Parameters (Dynamic Binding)
Dynamic Binding:
       “After compiling the job and passing the values during the runtime is known as
dynamic binding”.                                                                             2010
    Assuming one scenario that when we taking a oracle enterprise, we must provide the
       table and load its meta data. Here table name must be static bind.
    But there is no need for giving the authentication to oracle are to be static bind,
       because of some security reasons. For this we can use job parameters that can provide
       values at runtime to authenticate.
Job parameters:
       “job parameters is a technique that passing values at the runtime, it is also called
dynamic binding”.
    Job parameters are divided into two types, they are
          o Local variables
          o Global Variable
    Local variables (params): “it is created by the DS Designer only, it can use with in the
       job only”.
    Global Variables: “it is also called as environment variables”, it is divided into two
       types. They are,
          o   Existing: comes with in DataStage, in this two types one general and another
              one parallel. Under parallel compiler, operator specific, reporting will
              available.
          o   User Defining: it is created in the DataStage administrator only.
 NOTE: “The local parameters that created one job they cannot be reused in other job, this is
 up to version7. But coming to version8 we can reuse them by technique called parameter
 set”. But in version7 we can also reuse parameters by User Define values by DataStage
 Administrator.
Q: How to give Runtime values using parameters for the following list?
   a. To give runtime values for user ID, password, and remote server?
   b. Department number (DNO) to keep as constraint and runtime to select list of any
         number to display it?
   c. Add BONUS to SAL + COMM at runtime?                                                2010
   d. Providing target file name at runtime?
   e. Re-using the global and parameter set?
Design:
Step1:
    “Creating job parameters for given question in local variable”.
    Job parameters
                 o Parameters
                           Name        DNAME           Type           Default value
                          UID         USER            string         SCOTT
    DS Administrator
               o Select a project
                         Properties                                                          2010
                             •   General
                                       o Environment variables
                                                   User defined (there we can write parameters)
          Name            DNAME            Type            Default value
          UID             USER             string          SCOTT
          PWD             Password         Encrypted       ******
          RS              SERVER           String          ORACLE
    Here, global parameters are preceded by $ symbol.
    For Re-use, we must
               o Add environment variables
                         User defined
                             •   UID       $UID
                             •   PWD $PWD
                             •   RS        $RS
Step 3:
          “Creating parameter set for multiple values & providing UID and PWD other values
for DEV, PRD, and TEST”.
    In local variables job parameters
               o Select multiple of values by clicking on
                         And create parameter set
                             •   Providing name to the set
                                       o   SUN_ORA
                                                   Saving in Table definition
                                                       •   In table definition
NOTE: “Parameter set use in the jobs with in the project only”.
Step 4:
          “In oracle enterprise properties selecting the table name and later assign created job
parameter as shown below”.
                                                     Parameters
Step 5:
          “In Tx properties dept no using as a constraint and assign bonus to bonus column”.
                                                                                           2010
                                                      Stage Variable
                                    Derivation                                Column
        IN
       EID                          IN.SAL + NullToZero(IN.COMM)
       ENAME                              NS
       STATE
       SAL
       COMM
       DEPTNO                                                OUT
Derivation Column
                                    IN.EID                                         EID
                                    IN.ENAME                                     ENAME
                                    NS                                           NETSAL
                                    NS+BONUS                                     BONUS
Here, DNO and BONUS are the job parameters we have created above to use here.
For that simply right click->job parameters->DNO/BONUS (choose what you want)
Step 6:
          “Target file set at runtime, means following below steps to follow to keep at runtime”.
                                                                                       DAY 31
                                                                                           2010
                                                                  Sort Stage (Processing Stage)
Q: What is sorting?
       “Here sorting means higher than we know actually”.
                      Go to OE properties
                           •   Select user define SQL
                                   o   Query: select * from EMP order by DEPTNO.          2010
OE
JOIN DS
    It is done in a stage called sort stage, in the properties of the sort stage and in the
       options by keeping create key change column (CKCC) = “true”, default is false.
    Here we must select to which column group id you want.                                    2010
Sort Stage:
       “It is a processing stage, that it can sort the data in traditional sort or in complex sort”.
                                       Sort Stage
    Complex sort means to create group id, blocking unwanted sorting, and group wise
       sorting in some sort stage like join, merge, aggregate, and remove duplicates.
Sort Properties:
    Input properties
           o Sorting key = EID (select the column from source table)
           o Key mode = sort (sort/ don’t sort (previously sorted)/ don’t sort (previously
               grouped))
           o Options
                      Create cluster key change column = false (true/ false)
                      Create key change column = (true/ false)
                           •   True = enables group id.
                           •   False = disables the group id.
    Output properties
           o Mapping should be done here.
2010
                                                                                       DAY 32
                                                             A Transformer & Sort stage job
Q: Sort the given file and extract the all addresses to one column of a unique record and count
of the addresses to new column.
File.txt
            EID, ENAME, ACCTYPE
Design:
SF Sort1 DS
Tx Sort2 2010
   Sequential File (SF): here reads the file.txt for the process.
   Sort1: here sorting key = EID
                    And enables the CKCC for group id.
                                                      Stage Variable
                                   Derivation                            Column
        IN2
        EID                        if (IN2.keychange = 1) then IN2.ACCTYPE
        ENAME                              func1
        ACCTYP                     else func1 :’,’: IN2.ACCTYPE
        E                          if(IN2.keychange=1) then 1 else c+1
        KeyChan
                                                            OUT
                                   Derivation                            Column
                                   IN2.EID
                                   EID
                                   IN3.ENAME                               ENAME
                                   func1
                                         ACCTYPE
2010
   Sort2:
        o    Here, in the properties we must set as below.
                   Stage
                        •    Key=ACCTYPE
                                o      Sort key mode = sort
                                o      Sort order = Descending order
                   Input
                        •    Partition type: hash
                        •    Sorting
                                o Perform sort
                                             Stable (uncheck)
                                             Unique (check this)
                                o Selected
                                             Key= count
                                             Usage= sorting, partitioning
                                             Options= ascending, case sensitive
                   Output
                        •    Mapping should be doing here.
                         Perform sort
                         Stable (check this)
                         Unique (check this)            Final output:                            2010
           o Selected                                   EID, ENAME, ACCTYPE,              COUNT
Filter means “blocking the unwanted data”. In DataStage Filter stage can perform in three
level, they are
1. Source level
2. Stage level
3. Constraints
    Source Level Filter: “it can be done in data base and as well as in file at source level”.
           o      Data Base: by write filter quires like “select * from EMP where DEPTNO =
                  10”.
           o Source File: here we have option called filter there we can write filter
                  commands like “grep “moon”/ grep –I “moon”/ grep –w “moon” ”.
    Stage Filter:
           o      “Stage filters use in three stages, and they are 1. Filter, 2. Switch and 3.
                  External filter”.
           o Difference between if and switch:
                       Poor performance.
                              IF                              Better SWITCH
                                                                      performance
                                                               than IF.
2010
Filter stage: “it having one input, n outputs, and one reject link”.
    The symbol of filter is
Filter
Q: How the filter stage to send the data from source to target?
Design:
                                                                                 DS
                                                                  T
Navs notes                                                        1                   Page 107
                                             DataStage
                                             Filter                                        2010
         OE
                                                                 T              DS
                                                                 2
                                       Reject
                                                               DS
Step1:
    Connecting to the oracle for extracting the EMP table from it.
Step2:
Filter properties
    Predicates
              o Where clauses = DEPT NO =10
                        Output link =1
              o Where clauses = SAL > 1000 and SAL < 3000
                        Output link = 2
              o Output rejects = true // it is for output reject data.
    Link ordering
              o Order of the following output links
    Output:
              o Mapping should be done for links of the targets we have.
                        Here, Mapping for T1 and T2 should be done separately for both.
Step3:
    “Assigning a target files names in the target”.
It have no reject link, we must convert a link as reject link. Because it has ‘n’ number of
outputs.
                                                                                              2010
                                                                                         DAY 34
                                                Jobs on Filter and properties of Switch stage
Assignment Job 1:
   a. Only DEPTNO 10 to target1?
   b. Condition SAL>1000 and SAL<3000 satisfied records to target2?
   c. Only DEPTNO 20 where clause = SAL<1000 and SAL>3000 to target3?
   d. Reject data to target4?
                                                          T
                                Filter
                                                           T
       EMP_TBL
Filter
                                                                                      2010
Step1: “For target1: In filter where clause for target1 is DEPTNO=10 and link order=0”.
Step2: “For target2: where clause = SAL>1000 and SAL<3000 and link order=1”.
Job 2:
   a. All records from source to target1?
   b. Only DEPTNO=30 to target2?
   c. Where clause = SAL<1000 and SAL>3000 to target3?
   d. Reject data to target4?
                                                          T
                           Copy
         EMP_TBL
                                                          T
Filter
Step1: “For target1 mapping should be done output links for this”.
Step2: “For target2 where clause = DEPTNO=30 and link order =0”. 2010
Step3: “For target3 where clause = SAL<1000 and SAL>3000 and link order=1”.
Step4: “For target4 convert link into reject link and output reject link=true”.
Job 3:
   a. All unique records of DEPTNO to target1?
   b. All duplicates records of DEPTNO to target2?
   c. All records to target3?
   d. Only DEPTNO 10 records to target4?
   e. Condition SAL>1000 & SAL<3000, but no DEPTNO=10 to target5?
                                                               K=
                                                                                       T
Filter K=
EMP_TBL
TT
Filter
                                                                                         2010
                                                                      T
Step3: “For target3: mapping should be done output links for this”.
Step4: “For target4: where clause= DEPTNO=10”.
Step5: “For target5: in filter properties put output rows only once= true for where clause
SAL>1000 & SAL<3000”.
SWITCH Stage:
       “Condition on single column and it has only 1 – input, 128 – outputs and 1- default”.
Picture of switch stage:
    Cases                     values
           o   Case = 10 = 0                 link order
           o Case = 20 = 1
 Options
                                                                                         DAY 35
                                                                External Filter and Combining
External Filter: “It is processes stage, which can perform filter by UNIX commands”.
    It having 1-input, 1-output, and 1-reject link.
    To perform a text file, first it must read in single record in the input.
    Example filter command: grep “newyork”.
          o Horizontal combining
          o Vertical combining
          o Funneling combining                                                   2010
Horizontal combining: combining primary rows with secondary rows based on primary key.
          o This stage that perform by JOIN, LOOKUP, and MERGE.
    These three stages differs with each other with respect to,
          o Inputs requirements,
          o Treatment of unmatched records, and
          o Memory usage.
                                                                                DAY 36
                                Horizontal Combining (HC) and Description of HC stages
Horizontal Combining (HC): “combining the primary rows with secondary rows based on
primary key”.
   ENO EName
   DNo
   111    naveen
                                                         DNO DNAME LOC ENO
   10
                                                         ENAME
   222    munna
                                       H
   DNo       DName                     C
   LOC
   10     IT     HYD
   20     SE     SEC        Here we can combine
   40     SA
                              Inner join,
                              Left outer join,
                              Right outer join, and                                   2010
                              full outer join
   The differences between join, lookup, and merge with respect to above nine points are
      shown below.
                                                                                   2010
         JOIN                                    LOOKUP
        MERGE
 Input names:
 When we work on HC           The first link from           The first table is
 with JOIN the first SRC is   source is primary/            master table and
 left table, and last SRC     input and remaining           remaining tables are
 is right table. And all      links are lookup/             updates tables.
 middle SRC’s are             references links.
 intermediate tables.
 Input output rejects:
Join Types:
                                                                                      2010
  :: MEMORY USAGE::
  Light memory                   Heavy memory                  Light memory
                                                                                    DAY 37
                                                          LOOKUP stage (Processer Stage)
Lookup stage:
    In real time projects, 95% of horizontal combining is used by this stage.
    “Look up stage is for cross verification of primary records with secondary records”.
    DataStage version8 supports four types of LOOKUP, they are
          o Normal LOOKUP
          o Sparse LOOKUP
          o Range LOOKUP
          o Case less LOOKUP
2010
                  Primary
          Table                                                      Target
                        ENO
                        ENAM                                           ENO
                        E                                              ENAM
                        DNO                                            E
DNAM
               Reference
          Table
                     DNO
                     DNAM
                     E
                     LOC
In tool bar of LOOKUP stage consists of constraints button, in that we have to select
    Continue: this option for Left Outer Join.
    Drop: it is to Inner Join.                                                          2010
    Fail: its aborts job, if a primary unmatched records are their.
    Reject: it’s captured the primary unmatched records.
                                                                                        DAY 38
                                                                Sparse and Range LOOKUP
Sparse LOOKUP:
    If the source is database, its supports only two inputs.
    Normal lookup: “is cross verification of primary records with secondary at memory”.
    Sparse lookup: “is cross verification of primary records with secondary at source level
       itself”.
    To set sparse lookup we must adjust key type as sparse in reference table only.
But in ONE Case sparse LOOKUP stage can supports ‘n’ references. By taking lookup file set
Job1: a sequential file extracting a text file to load into lookup file set (lfs).
                                                                                          2010
Job2: in this job we are using lookup file set as sparse lookup.
                        LFS                                              LFS
                                      ……………………
SF LOOKUP DS
     In lookup file set, we must paste the address of the above lfs.
     Lookup file supports ‘n’ references means indirectly sparse supports ‘n’ references.
                                                                                           DAY 39
                                                              Funnel, Copy and Modify stages
Funnel Stage:
       “It is a processing stage which performs combining of multiple sources to a target”.
Copy Stage:
       “It is processing stage which can be used from”.
   1. Copying source data to multiple targets.
   2. Charge the column names.
   3. Drop the columns.
   4. Stub stage.
NOTE: best for change column names and drop columns.
Modify Stage:
       “It is processing stage which can perform”.                                      2010
   1. Drop the columns.
   2. Keep the columns.
   3. Change the column names.
   4. Modify the data types.
   5. Alter the data.
From OE using modify stage send data into data set with respect to above five points.
In modify properties:
                                                                                         DAY 40
                                                                 JOIN Stage (processing stage)
Join stage it used in horizontal combining with respect to input requirements, treatment of
unmatched records, and memory usage.
    Join stage input names are left table, right table, and intermediate tables.
    Join stage having n – inputs (inner, LOJ, ROJ), 2 – inputs (FOJ), 1- output, no
       reject.
    Types of Join stage are inner, left outer join, right outer join, and full outer join.
    Input requirements with respect to sorting: it is mandatory in primary and secondary
       tables.
JOIN properties:
    In join stage when we sort with different key column names, that job can executes but
       its effect on the performance (simply say WARNINGS will occurs)
                                                                                     DAY 41
                                                            MERGE Stage (processing stage)
Merge stage is a processing stage it perform horizontal combining with respect to input
requirements, treatment of unmatched records, and memory usage.
    Input requirements with respect to de – duplication in the primary table it will get
        warnings when we don’t remove the duplicates in primary table. And in secondary
        table nothing will happens its OK when we don’t remove the duplicates.          2010
    Treatment of unmatched records in primary table Drop (drops), Target (keep) the
        unmatched records of the unmatched primary table records. And in secondary table
        drops and reject it captures the unmatched secondary table records.
    In the merge stage the memory usage is LIGHT memory.
    The key column names must be the SAME.
    In type of inner join it compares in ANY update tables.
NOTE:
    Static information stores in the master table.
    All changes information stores in the update tables.
    Merge operates with only two options
           o Keep (left outer join)
           o Drop (inner Join)
Master table
                                                                                         TRG
                U1                                                                       2010
                U2
                                                                            or
In MERGE properties:
     Merge have inbuilt sort = (Ascending Order/Descending Order)
     Must to follow link order.
     Merge supports (n-1) reject links.
     NOTE: there has to be same number of reject links as update links or zero reject links.
     Here COPY stage is acting as STUB Stage means holding the data with out sending
        the data into the target.
                                                                                       DAY 42
                                                   Remove Duplicates & Aggregator Stages
Remove Duplicates:
        “It is a processing stage which removes the duplicates from a column and retains the
first or last duplicate rows”.
o Dup to retain=(first/last)
Aggregator:
       “It is a processing stage that performs count of rows and different calculation between
columns i.e. group by same operation in oracle”.
       SF                             Aggregator                                DS
Properties of Aggregator:
    Grouping keys:
            o Group= Deptno
    Aggregator
            o Aggregator type = count rows (count rows/ calculation/ re – calculation)
            o Count output column= count <column name>
1Q: Count the number of all records and deptno wise in a EMP table?
1 Design:
2010
3Q: To find max salary from emp table of a company and find all the details of that?
                                        &
4Q:   To find max, min, sum of salary of a deptno wise in a emp table?
                                                                                             2010
3 & 4 Design:                                    dummy                           dno=10
                              compare
        emp
                                 max(deptno)                                     dno=20
                                                          UNION ALL     diving
max (IBM)
                                                                                           DAY 43
                                                    Slowly Changing Dimensions (SCD) Stage
Example: #1
Example: #2
Before Data:
CID                              CNAME                         ADD
11                               A                             HYD
22                               B                             SEC
33                               C                             DEL
33 D PUN
 Extracting after and before data from DW (or) database to compare and upsert. 2010
Explanation:
SCD – I:        “it only maintains current update, and no historical data were organized”.
As per SCD – I, it updates the before data with after data and no history present after the
execution.
SCD – II:      “it maintains both current update data and historical data”. With some special
operation columns they are, surrogate key, active flag, effect start date, and effect end date;
      In SCD – II, not having primary key that need system generated primary key, i.e.,
        surrogate key. Here surrogate key acting as a primary key.
      And when SCD – II performs we get a practical problem is to identify old and current
        record. That we can solve by active flag: “Y” or “N”.
      In SCD – II, new concepts are introduced here i.e., effect start date (ESDATE) and
        effect end date (EEDATE).
      Record version: it is concept that when the ESDATE and EEDATE where not able to
        use is some conditions.
      Unique key: the unique key is done by comparing.
SCD – III: SCD – I (+) SCD – II “maintain the history but no duplicates”.
Table: this table is describing the SCD six types and the description is shown above.
                                                                                        DAY 44
                                                     SCD I & SCD II (Design and Properties)
DS_TRG_DIM OE_UPSERT
OE_SRC DS_TRG_DIM
Step 1: Load plug-in Meta data from oracle of before and after data as shown in the above
links that coming from different sources.
Fast path 2 of 5: navigating the key column value between before and after tables
               AFTER                                            BEFORE
                    SNO                          KEY EXPR    COLUMN N PURPOSE
                    SNAME                                       SKID
                                                 surrogate key
                                                 AFTER.SNO      SNO    business
                                                 key
NOTE: for every time of running the program we should empty the source name i.e.,
empty.txt, else surrogate key will continue with last stored value.                       2010
       AFTER                                                DIM
          SNO                           Derivation COLUMN N PURPOSE
          SNAME                          next sk()    SKID
                                        surrogate key
                                        AFTER.SNO     SNO    business
                                        key
      AFTER                                       FACT
         SNO                           Derivation COLUMN
         SNAME                         N
                                        BEFORE.SKID
                                       SKID
                                       AFTER.SNO     SNO
     BEFORE
         SKID
         SNO
         SNAME
Step 3: In the Next job, i.e. in load job if we change or edit in the source table and when you
are loading into oracle we must change the write method = upsert in that we have two options
they are, -update n insert    \\ if key column value is already.
DS_TRG_DIM OE_UPSERT
10, 20, 40 After dim 10, 20, 20, 30, 40 -update and insert
OE_SRC DS_TRG_DIM
Adding some columns to the to before table – to covert EEDATE and ESDATE columns into
time stamp transformer stage to perform SCD II
                                                                                          2010
In TX properties:
      BEFORE                                            BEFORE_TX
          SKID                            Derivation                 COLUMN
          SNO                             NAM
          SNAME                           BEFORE.SKID
          ESDATE                          SKID
          EEDATE                          BEFORE.SNO                          SNO
          ACF
                                          BEFORE.SNAME
                                          SNAME
In SCD II properties:
Fast path 2 of 5: navigating the key column value between before and after tables
                                                                BEFORE
              AFTER                              KEY EXPR    COLUMN N PURPOSE
                    SNO                                        SKID
                    SNAME                        surrogate key
                                                 AFTER.SNO     SNO    business
                                                 key
                                                               SNAME  Type2
                                                               ESDATE  experi
                                                 date
NOTE: for every time of running the program we should empty the source name i.e.,
empty.txt, else surrogate key will continue with last stored value.
       AFTER                                               DIM
          SNO                    Derivation COLUMN N PURPOSE
          SNAME                  Expires
                                  next sk()  SKID       surrogate key
                                 -
                                 AFTER.SNO   SNO     business key     -
                                 AFTER.SNAME      SNAME    Type2
                                 -
                                 curr date() ESDATE   experi date     -
Date from Julian (Julian day from day (current date ()) – 1)
For path 5 of 5:       setting the output paths to FACT data set.
      AFTER
         SNO                                        FACT
         SNAME                         Derivation COLUMN
                                       NAME
                                       BEFORE.SKID     SKID
                                       AFTER.SNO   SNO
                                       AFTER.SNAME
     BEFORE                            SNAME
         SKID                          BEFORE.ESD ESDATE
         SNO
         SNAME
         ESDATE
         EEDATE
         ACF
Step 3: In the Next job, i.e. in load job if we change or edit in the source table and when you
are loading into oracle we must change the write method = upsert in that we have two options
                                                                                           2010
they are, -update n insert    \\ if key column value is already.
         -insert n update     \\ if key column value is new.
Change_capture
2010
2010
Change Apply
c=all
after.txt
                                                key= EID
                             -option: e k & v
before.txt
In version 7.5.x2: “identifying last value which generated for the first time compiling and
running the job in surrogate key stage, for that reason in version 7 we have to do a another job
to store a last generated value”.
And that job in version 7.5.x2: design
SF Sk copy ds
Tail peek
     In this job, a surrogate key stage used for generates the system key column values that
         are like primary key values. But it generate at first compile only.
     But by taking tail stage with that we tracing the last value and storing into the peek
         stage that is in buffer.
     With that buffer value we can generate the sequence values that are surrogate key in
         version 7.5.x2.
In version 8.0:
         “The above problem with version7 is over comes by version 8.0 surrogate key by
taking an empty text(empty.txt) file and storing last value information in that file, and by using
that it generates the sequence values”
                                                                                          2010
Properties of SK version8:
Option 1: generated output column name = skid
          Source name = g:\data\empty.txt
          Source type = flat file
Option 2: database type= oracle (DB2/ oracle)
          Source name = sq9 (in oracle – create sequence sq9)\\ it is like empty.txt
          Password= tiger
          User id= scott
          Server name= oracle
          Source type = database sequence
                                                                                        DAY 46
                                                                           DataStage Manager
Export:
       “Export is used to save the group of jobs for the export purpose that where we want”.
Navigation - “how to export”?
DataStage toolbar
    Change selection: AD             or     REMOV        or      SELECT
                       D                     E                    ALL
           o Job components to export
 Source name\.....
Import:
       “It is used to import the .dsx or .xml extensions to a particular project and also to
import some definitions as shown below”.
Options of import are
          o DataStage components…
          o DataStage components (xml)…
          o External function definitions
          o Web services function definitions
          o Table definitions
          o   IMS definitions
                       In IMS two options are,
                           •   Database description (DBD)
                           •   Program Specification Block (PSB / PCB)
    In DataStage components..
          o Import from file
Generate Report:
       “It is for to generate report to a job or a specific, that it generates a report to a job
instantly”.
For that, go to
    File
              o Generate report
                        Report name
                               •   Options
                                   Use default style sheet
                                   Use custom style sheet
After finishing the settings:
    It’s generates in default position
                  “/reportingsendfile/ send file/ tempDir.tmp”
Node Configuration:
Q: To see nodes in a project:
              o Go to run director
                        Check in logs
                               •   Double click on main program: APT config file
Q: What node that handles to run each and every job and name of the configuration file?
           o Every job runs on APT node as on below name that is default for every job.
           o   Name of configuration file is C:\ibm\.........\default.apt
Q:      If uni processing system with 1 CPU needs minimum 1 node to run a job then for SMP
        with 4 CPU needs how many minimum nodes?
           o Only 1 node.
Advanced Find:
        “It is the new feature to version8”
        It consists of to find objects of a job like list shown below
        1. Where used,
        2. Dependency,
        3. Compared report.
Q: How to run a job in a job?
        Navigation for how to run a job in a job
      Job properties
           o Job control
                       Select a job
                           •    -------------
                           •    -------------       here, Job Control Language (JCL) script presents.
                           •    -------------
                                    o Dependencies
                                                   Select job (first compile this job before the main
                                                    job)
Q: Repository of Advance Find (means palate of advance find)?
           o Name to find:       Nav*
          o Last modification
          o Where used
                     Find objects that use any of the following objects.               2010
                                                                                      DAY 47
                                                                            DataStage Director
DS Director maintains:
    Schedule
    Monitor
    Views
            o Job view
            o Status view
            o Log view                                                                 2010
    Message Handling
    Batch jobs
    Unlocking
Schedule:
         “Schedule means a job can run in specific timings”
    To set timings for that,
            o Right click on job in the DS Director
                       Click on “add to schedule…”
                           •   And set the timings.
    In real time, specific the job sequence by some tools shown below
            o Tools to schedule jobs (its happen the production only)
                       Control M
                       Cron tab
                       Autosys
Purge:
         “It means cleaning or wash out or deleting the already created logs”
                           -   In job can we clear
                           -   Job logs having a option is FILTER. By right clicking we can
                               filter.
    Navigation for set the purge.
            o Tool bar
                      Job
                              -   Clear log (choose the option)                         2010
                                     o Immediate purge
                                     o Auto purge
Monitor:
       “It shows the Status of job, numbers of row where executed, started at (time), elapsed
time (i.e. rows/sec), percentage used by CPU)”
    Navigation for job that how to monitor.
            o Right click on job
                      Click monitor
                              •   “it shows performance of a job”
       NOTE: Based on this we can check the performance tuning of a stage in a particular
       job.
Abort a job:
Q: How can we abort a job conditionally?
    Conditionally
          o When we Run a job
                     Their we can keep a constraint
                         •   Like warnings
                                 o No limit
                                                        5
                                 o Abort job after:
    In transformer stage
          o Constraint
                     Otherwise/log
                         •   Abort after rows: 5 (if 5 records not meet the constraint it’s
                             simple aborts the job)
    We can keep constraint same like this only in Range Lookup.
Message Handling:
       “If the warnings are failed to handle then we come across the message handling”
    Navigation for how to add rule set a message handle the warnings.
    Jog logs
           o Right click on a warning
                      Add rule to message handler                                     2010
                      Two options
                          •    Suppress from log
                          •    Demote to information
                      Choose any one of above option and add rule.
Batch jobs:
       “Executing set of jobs in a order”
Q: How to create a Batch?
Navigation for creating a batch
    DS Director
           o Tools
                      Batch
                          •    New (give the name of batch)
                          •    Add jobs in created job batch
                                  o Just compile after adding in new batch.
       “We can unlock the jobs for multiple instances by release all the permissions”
Navigation for unlock the job
DS Director                                                                             2010
    Tool bar
           o Job
                      Cleanup resources
                      Processes
                           •    Show by job
                           •    Show all
                                   o Release all
For global to see PIDs for jobs, for that
    DS Administrator
           o General
                      Environment variables
                           •    Parallel
                                   o Reporting
                                               Add (APT_PM_SHOW_ PIDS)
                                                   •   Set as (true/false)
                                                                               DAY 48
                                                              Web Console Administrator
                                                                                 2010
Components of administrator:
    Administration:
         o User & group
                    Users
                        •    User name & password is created here.
                        •    And assigning permissions
    Session managements:
         o Active sessions
                    For admin
    Reports:
         o DS
                    INDIA (server/system name)
                        •    View report.
                        •    We can create the reports.
    Domain Management:
         o License
                    Update the license here
                    Upload to review
    Scheduling management:
      “It is know what user is doing from part”
         o Scheduling views
                    New
                          •    schedule | Run
                          •    creation task run | last update
                                                                                         DAY 49
                                                                                           2010
Job Sequencing
Job Activity:
       “It is job activity that holds the job and it have 1-input and n-outputs”
                               Job activity
How the Job Activity drag into design canvas?
OK
WAR
FAIL
Terminator activity
Check Point:
       “Job has re-started where it aborted it is called check point”
    It is special option that we must enable manually
    Go to
           o   Job properties of DS Designer
                      Enable check point
Parameter mapping:
       “If job have already some parameters to that we can map to the another job if we need”
                                                                                         2010
Triggers:
       “It holds the link expression type that how to act”
Terminator Activity:
       “It is stage that handles the error if it fails”
Properties:
It consists of two options: for if any sub ordinate jobs are still running.
    Its for job failure
              o Send STOP requests to all Running Jobs
                        And wait for all jobs to finish
Notification Activity:
        “It is sending acknowledgement in between the process”
Option to fill in the properties:
        SMTP Mail server Name:
        Senders email address:
        Recipients email address:
        Email subject:
        Attachments:
        Email body:
              D:\DS\SCD_LOAD
                     browse file
Navs notes                                                                                   Page 159
                                            DataStage
File name:
Two options:             wait for file to appear
                         Wait for file to disappear                                       2010
Timeout length (hh:mm:ss)               Do not timeout (no time length for the above options)
                                                                                        DAY 50
                                       Performance tuning w.r.t partition techniques & Stages
NOTE: Modulus is having high performance than Hash, because the way its groups the data
and based on the mod value.                                                                2010
NOTE: But modules can only be selected, if the only one key column and only one data type
that is only integer (data type).
In Key less partition technique:
     Same: is never distributes the data, but is carry previous technique that continuous.
     Entire: will distribute the same group of records to all nodes. That is the purpose of
        avoiding the mismatch records in between the operation.
     Round Robin: it is for generated stage like Column Generator and so on is associated
        this partition technique.
            o It is the best partition technique than comparing to random.
     Random: all key less partition techniques stages are used this technique its default.
 Conversions: Modify stage and Transformer stage (it takes more compile time). 2010
DAY 51
Compress Stage:
       “It is a processing stage that compresses the records into single format means in single
file or it compresses the records into zip”.
    It supports – “1 input and 1 output”.
Properties:
    Stage
              o Options
                        Command= (compress/gzip)
    Input
              o <do nothing>
    Output
              o Load the ‘Meta’ data of the source file.
Expand Stage:
       “It is a processing stage the extract the compress data or its extract the zip data into
unzip data”.
    It supports – “1 input and 1 output”.
                                                                                          2010
Properties:
    Stage:
              o Options : - command= (uncompress/gunzip)
    Input:
              o <do nothing>
    Output:
              o Load the Meta data of the source file for the further process.
Encode Stage:
       “It is processing stage that encodes the records into single format with the support of
command line”.
    It supports – “1-input and 1-output”.
Properties:
    Stage
              o   Options: Command line = (compress/ gzip)
    Input
              o <do nothing>
    Output
              o Load the ‘Meta’ data of the source file.
Decode Stage:
       “It is processing stage that decodes the encoded data”.
    It supports – “1-intput and 1-output”.
Properties:                                                                                 2010
     Stage
              o Options: command line = (uncompress/gunzip)
     Output
              o   Load the ‘Meta’ data of the source file.
Generic Stage:
        “It is processing stage that holds any operator can call here, but it must and should full
fill the properties”.
     It supports – “n- inputs and n-outputs, but no rejects”
     When compiling the job, the job related OSH code will generated.
     Generic stage can call the operator on the datastage.
     Its purpose is migration serve jobs to parallel jobs (IBM has x- migrator that converts
        into 70%)
     And it can call ANY operator here, but it must full fill the properties.
Properties:
     Stage
              o Options
                        Operator: copy (we can write any stage operator here)
     Input
              o <do nothing>
     Output
              o Load the Meta data of the source file.
Pivot Stage:
       “It is processing stage that converts rows into columns in a table”.
    Its supports – “1-input and 1-output”.                                                  2010
XML Input:
       “”.