0% found this document useful (0 votes)

697 views100 pages

Ab Initio Training

Ab initio is a platform for data integration and analytics. It uses a graphical development environment to build and run workflows called graphs. These graphs can integrate data from various sources, perform transformations and analytics, and run in parallel across hardware. The core components include the Co-Operating system, graphical development environment, and metadata repository. It provides scalability, flexibility, and portability for data processing tasks.

Uploaded by

Namrata Mukherjee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

697 views100 pages

Ab Initio Training

Uploaded by

Namrata Mukherjee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 100

What Does Ab Initio Mean?

 Ab Initio is a Latin phrase that means:

 Of, relating to, or occurring at the beginning;
first
 From first principles, in scientific circles
 From the beginning, in legal circles
About Ab Initio

 Ab Initio is a general purpose data processing platform for enterprise

class, mission critical applications such as data warehousing,
clickstream processing, data movement, data transformation and
analytics.
 Supports integration of arbitrary data sources and programs, and
provides complete metadata management across the enterprise.
 Proven best of breed ETL solution.
 Applications of Ab Initio:
– ETL for data warehouses, data marts and operational data sources.
– Parallel data cleansing and validation.
– Parallel data transformation and filtering.
– High performance analytics
– Real time, parallel data capture.
Ab initio Platforms
 No problem is too big or too small for Ab Initio.
Ab Initio runs on a few processors or few
hundred processors. Ab Initio runs on virtually
every kind of hardware
 SMP (Symmetric Multiprocessor) systems
 MPP (Massively Parallel Processor) systems
 Clusters
 PCs
Ab Initio runs on many operating
systems
 Compaq Tru64 UNIX
 Digital unix
 Hewlett-Packard HP-UX
 Ibm aix
 NCR MP-RAS
 Red Hat Linux
 IBM/Sequent DYNIX/ptx
 Siemens Pyramid Reliant UNIX
 Slicon Graphics IRIX
 Sun Solaris
 Windows NT and Windows 2000
Ab Initio base software
consists of three main pieces:

 Ab Initio Co>Operating System and

core components
 Graphical Development
environment(GDE)
 Enterprise Metadata
Environment(EME)
Ab Initio Architecture

Applications
Ab Initio
Application Development Environments Metadata
Graphical C ++ Shell Repository
Component User-defined Third Party
Library Components Components

Ab Initio Co>Operating System

Native Operating System

UNIX Windows NT
Ab Initio Overview

Store all variables

in a repository / is
Run all your
also used for
Create all graphs
control / also
your
User graphs EME collects all
metadata about
graph developed
Co>Operating in GDE
User GDE
system

DTM
Graph when
User deployed
generate .ksh Used to schedule graphs developed in
GDE. It also has capability to maintain
dependencies between graphs
Co>Operating System

 The Co>Operating System is core software that unites a

network of computing resources-CPUs, storage disks,
programs, datasets-into a production-quality data
processing system with scalable performance and
mainframe reliability.

 The Co>Operating System is layered on top of the native

operating systems of a collection of computers. It provides
a distributed model for process execution, file
management, process monitoring, check-pointing, and
debugging.
Graphical Development Environment
(GDE)
 GDE lets create applications by dragging and dropping
components onto a canvas configuring them with
familiar, intuitive point and click operations, and
connecting them into executable flowcharts.
 These diagrams are architectural documents that
developers and managers alike can understand and
use. the co>operating system executes these flowcharts
directly. This means that there is a seamless and solid
connection between the abstract picture of the
application and the concrete reality of its execution.
Graphical Development Environment
(GDE)
 The Graphical Development Environment (GDE) provides
a graphical user interface into the services of the
Co>Operating System.
 Unlimited scalability : Data parallelism results in speedups
proportional to the hardware resources provided, double
the number of CPUs and execution time is halved.
 Flexibility : Provides a powerful and efficient data
transformation engine and an open component model for
extending and customizing Ab Initio’s functionality.
 Portability : Runs heterogeneously across a huge variety of
operating system and hardware platforms.
Graphical Method for Building
Business Applications
 A Graph is a picture that represents the various
processing stages of task and the streams of data
as they move from one stage to another.
 One Picture is worth a thousand words, is one
graph worth a thousand lines of code? Ab Initio
application graphs often represent in a diagram or
two what might have taken hundreds to thousands
of lines of code. This can dramatically reduce the
time it takes to develop, test, and maintain
application
What is Graph Programming
Ab Initio has based the GDE on the Data Flow
Model
 Data flow diagrams allow you to think in terms of
meaningful processing steps, not microscopic
lines of code
 Data flow diagrams capture the movement of
information through the application.
 Ab Initio calls this development method Graph
Programming
Graph Programming?
 The process of constructing Ab Initio applications
is called Graph Programming.
 In Ab Initio’s Graphical Development
Environment, you build an application by
manipulating components, the building blocks of
the graph.
 Ab Initio Graphs are based on the Data Flow
Model. Even the symbols are similar. The basic
parts of Ab Initio graphs are shown below.
Symbols
Boxes for processing and Data
Transforms
Arrows for Data Flows between
processes
Cylinders for serial I/O files

Divided cylinders for parallel I/O files

Grid boxes for database tables

Graph Programming
 Working with the GDE on your desktop is easier
than drawing a data flow diagram on a white
board. You simply drag and drop functional
modules called Components and link them with a
swipe of the mouse. When it’s time to run the
application, Ab Initio Co>Operating System turns
the diagram into a collection of process running on
servers
 The Ab Initio term for running data flow diagram is
a Graph. The inputs and outputs are dataset
components; the processing steps are program
components; and the data conduits are flows.
Anatomy of a Running Job

What happens when you push the “Run”

button?
 Your graph is translated into a script that can be executed in
the Shell Development Environment.
 This script and any metadata files stored on the GDE client
machine are shipped (via FTP) to the server.
 The script is invoked (via REXEC or TELNET) on the server.
 The script creates and runs a job that may run across many
nodes.
 Monitoring information is sent back to the GDE client.
Anatomy of a Running Job
 Host Process Creation
 Pushing “Run” button generates script.
 Script is transmitted to Host node.

 Script is invoked, creating Host process .

Host
GDE

Client Host Processing nodes

Anatomy of a Running Job
 Agent Process Creation
 Host process spawns Agent processes.

Host
GDE Agent Agent

Client Host Processing nodes

Anatomy of a Running Job
 Component Process Creation
 Agent processes create Component processes on each processing
node.

Host
GDE Agent Agent

Client Host Processing nodes

Anatomy of a Running Job
 Component Execution
 Component processes do their jobs.
 Component processes communicate directly with datasets and each
other to move data around.

Host
GDE Agent Agent

Client Host Processing nodes

Anatomy of a Running Job
 Successful Component Termination
 As each Component process finishes with its data, it exits with
success status.

Host
GDE Agent Agent

Client Host Processing nodes

Anatomy of a Running Job
 Agent Termination
 When all of an Agent’s Component processes exit, the Agent informs
the Host process that those components are finished.
 The Agent process then exits.

Host
GDE

Client Host Processing nodes

Anatomy of a Running Job
 Host Termination
 When all Agents have exited, the Host process
informs the GDE that the job is complete.
 The Host process then exits.

Host
GDE

Client Host Processing nodes

Ab Initio S/w Versions & File Extensions
 Software Versions
– Co>Operating System Version => 2.8.32
– GDE Version => 1.8.22

 File Extensions
– .mp Stored Ab Initio graph or graph component
– .mpc Program or custom component
– .mdc Dataset or custom dataset component
– .dml Data Manipulation Language file or record type
definition
– .xfr Transform function file
– .dat Data file (either serial file or multifile)
Versions
 To find the GDE version Select
Help >> About Ab Initio from the
GDE window.
 To find the Co>Operating
System version Select Run >>
Settings from the GDE window.
Look for the Detected base
System Version.
Connecting to Co>op Server from GDE
Host Profile Setting

1. Choose settings from the run menu

2. Check the use host profile setting checkbox.
3. Click Edit button to open the Host profile dialog.
4. If running Ab Initio on your local NT system, check Local
Execution (NT) checkbox and go to step 6.
5. If running Ab Initio on a Remote UNIX system, fill in the
path to the Host and Host Login and Password.
6. Type the full path of Host directory.
7. Select the Shell Type from pull down menu.
8. Test Login and if necessary make changes.
Host Profile

Enter Host,
Select the Shell
Login,
Type
Password &
Host directory
Ab Initio Components

Ab Initio provided
components. Datasets,
Partition, Transform,
Sort, Database are
frequently used.
Creating Graph

Type the
Label

Specify the
Input .dat
file
Create Graph - Dml
 Propagate from Neighbors: Copy
record formats from connected flow.
Specify the  Same As: Copy record format’s
.dml file
from a specific component’s port.
 Path: Store record formats in a
Local file, Host File, or in the Ab
Initio repository.
 Embedded: Type the record format
directly in a string.
Creating Graph - dml
 DML is Ab Initio’s Data
Manipulation Language.
 DML describes data in terms
of
– Record Formats that list the
fields and format of input,
output, and intermediate
records.
– Expressions that define
simple computations, for
example, selection.
– Transform Functions that
control reformatting,
Editing .dml file through aggregation, and other data
Record Format Editor – Grid transformations.
View – Keys that specify groupings,
ordering, and partitioning
relationships between
records.
Creating Graph - Transform
 A transform function is either a
DML file or a DML string that
describes how you manipulate
your data.
 Ab Initio transform functions
mainly consist of a series of
assignment statements. Each
statement is called a business
rule.
Specify the .xfr file  When Ab Initio evaluates a
transform function, it performs
following tasks:
– Initializes local variables
– Evaluates statements
– Evaluates rules.
 Transform function files have the
xfr extension.
Creating Graph - xfr
 Transform functions: A set
of rules that compute
output values from input
values.
 Business rule: Part of a
transform function that
describes how you
manipulate one field of
your output data.
 Variable: Optional part of a
transform function that
provides storage for
temporary values.
 Statement: Optional part of
a transform function that
assigns values of variables
in a specific order.
Sample Components

 Sort
 Dedup
 Join
 Replicate
 Rollup
 Filter by Expression
 Merge
 Lookup
 Reformat etc.
Creating Graph – Sort Component
 Sort: The sort component
reorders data. It
comprises two
parameters: Key and
Specify Key for
the Sort
max-core.
 Key: The Key is one of
the parameters for Sort
component which
describes the collation
order.
 Max-core: The max-core
parameter controls how
often the sort component
dumps data from
memory to disk.
Creating Graph – Dedup component
 Dedup component
removes duplicate
records.
 Dedup criteria will
be either unique-
only, First or Last.

Select Dedup criteria.

Creating Graph – Replicate Component
 Replicate
combines the
data records from
the inputs into
one flow and
writes a copy of
that flow to each
of its output ports.
 Use Replicate to
support
component
parallelism.
Creating Graph – Join Component

• Specify the key for join

• Specify Type of Join
Database Configuration (.dbc)

 A file with a .dbc extension which provides the GDE with

the information it needs to connect to a database. A
configuration file contains the following information:
– The name and version number of the database to which you want
to connect.
– The name of the computer on which the database instance or
server to which you want to connect runs, or on which the database
remote access software is installed.
– The name of the database instance, server, or provider to which
you want to connect.
– You generate a configuration file by using the Properties dialog box
for one of the Database components.
Creating Parallel Applications

 Types of Parallel Processing

– Component-level Parallelism: An application with multiple
components running simultaneously on separate data uses
component parallelism.
– Pipeline parallelism: An application with multiple components
running simultaneously on the same data uses pipeline parallelism.
– Data Parallelism: An application with data divided into segments
that operates on each segment simultaneously uses data
parallelism.
Partition Components
 Partition by Expression: Dividing data according to a DML expression.
 Partition by Key: Grouping data by a key.
 Partition with Load balance: Dynamic load balancing.
 Partition by Percentage: Distributing data, so the output is proportional
to fractions of 100.
 Partition by Range: Dividing data evenly among nodes, based on a key
and a set of partitioning ranges.
 Partition by Round-robin: Distributing data evenly, in blocksize chunks,
across the output partitions.
Departition Components
 Concatenate: Concatenate component produces a single output flow
that contains first all the records from the first input partition, then all
the records from the second input partition and so on.
 Gather: Gather component collects inputs from multiple partitions in an
arbitrary manner, and produces a single output flow, does not maintain
sort order.
 Interleave: Interleave component collects records from many sources
in round robin fashion.
 Merge: Merge component collects inputs from multiple sorted partitions
and maintains the sort order.
Multifile systems
 A multifile system is a specially created set of directories, possibly on
different machines, which have identical substructure.
 Each directory is a partition of the multifile system. When a multifile is
placed in a multifile system, its partitions are files within each of the
partitions of the multifile system.
 Multifile system leads to better performance than flat file systems
because multifile systems can divide your data among multiple disks or
CPUs.
 Typically (SMP machine is exception) a multifile system is created with
the control partition on one node and data partitions on other nodes to
distribute the work and improve performance.
 To do this use full internet URLs that specify file and directory names
and locations on remote machines.
Multifile
SANDBOX
 A sandbox is a collection of graphs and related files that
are stored in a single directory tree, and treated as a group
for purposes of version control, navigation, and migration.
 A sandbox can be a file system copy of a datastore project.

 In the graph, instead of specifying the entire path for any

file location ,we specify only the sandbox parameter
variable. For ex : $AI_IN_DATA/customer_info.dat. where
$AI_IN_DATA contains the entire path with reference to the
sandbox $AI_HOME variable.

 The actual in_data dir is $AI_HOME/in_data in sandbox

SANDBOX
 The sandbox provides an excellent mechanism to
maintain uniqueness while moving from
development to production environment by means
switch parameters.

 We can define parameters in sandbox those can

be used across all the graphs pertaining to that
sandbox.

 The topmost variable $PROJECT_DIR contains

the path of the home directory
SANDBOX
Deploying
 Every graph after validation and testing has to be deployed
as .ksh file into the run directory on UNIX.
 This .ksh file is an executable file which is the backbone for
the entire automation/wrapper process.
 The wrapper automation consists of .run, .env, dependency
list ,job list etc
 For a detailed description on wrapper and different
directories and files , Please refer the documentation on
wrapper / UNIX presentation.
Symbols
Boxes for processing and Data
Transforms
Arrows for Data Flows between
process
Cylinders for serial I/O files

Divided cylinders for parallel I/O files

Grid boxes for database tables

Parallelism

 Component parallelism

 Pipeline parallelism

 Data parallelism
Component Parallelism

Sorting Customers

Sorting Transactions
Component Parallelism
 Comes “for free” with graph programming.

 Limitation:
– Scales to number of “branches” a graph.
Pipeline Parallelism
Processing Record: 100

Processing Record: 99
Pipeline Parallelism
 Comes “for free” with graph programming.

 Limitations:
– Scales to length of “branches” in a graph.
– Some operations, like sorting, do not pipeline.
Data Parallelism
Two Ways of Looking at
Data Parallelism
Expanded View:

Global View:
Data Parallelism
 Scales with data.

 Requires data partitioning.

 Different partitioning methods for different

operations.
Data Partitioning
Expanded View:

Global View:
Data Partitioning:
The Global View
Degree of Parallelism

Fan-out Flow
Session III
Partitioning
Partitioning Review
Fan-out Flow

 For the various partitioning components:

– Is it Key-based? Does the problem require a
key-based partition?
– Performance: Are the partitions balanced or
skewed?
Partitioning: Performance

Partition 0 Partition 0

Partition 1 Partition 1
Partition 2
Partition 2

Partition 3
Partition 3

Balanced: Skewed:
Processors get neither Some processors get
too much nor too little. too much, others too little.
Sample Data to be Partitioned

 Customers
 42John 02116 30
record
 43Mark 02114 9
 44Bob 02116 8 decimal(2) id;
 45Sue 02241 92 string(5) name;
 46Rick 02116 23 decimal(5) zipcode;
 47Bill 02114 14 decimal(3) amount;
 48Mary 02116 38
 49Jane 02241 2.
string(1) newline;
end
Partition by Round-robin

Partition 0 Partition 1 Partition 2

Customers Customers Customers

42John 02116 30 43Mark 02114 9 44Bob 02116 8
45Sue 02241 92 46Rick 02116 23 47Bill 02114 14
48Mary 02116 38 49Jane 02241 2
Partition by Round-robin

 Not key based.

 Results in very well balanced data, especially
with block-size of 1.
 Useful for record-independent parallelism.
Partition by Key

partition on zipcode:
Customers Customers
43Mark 02114 9 42John 02116 30
45Sue 02241 92 44Bob 02116 8
47Bill 02114 14 46Rick 02116 23
49Jane 02241 2 48Mary 02116 38
Partition by Key often
followed by a Sort
Sort on zipcode:
Customers Customers
43Mark 02114 9 42John 02116 30
47Bill 02114 14 44Bob 02116 8
45Sue 02241 92 46Rick 02116 23
49Jane 02241 2 48Mary 02116 38

Rollup by zipcode:
Totals by Zipcode Totals by Zipcode
02114 23 02116 99
02241 94
Partition by Key

 Key-based.
 Usually results in well balanced data.
 Useful for key-dependent parallelism.
Partition by Expression

Expression: amount/33
Customers Customers Customers
42John 02116 30 48Mary 02116 38 45Sue 02241 92
43Mark 02114 9
44Bob 02116 8
46Rick 02116 23
47Bill 02114 14
49Jane 02241 2
Partition by Expression

 Key-based, depending on the expression.

 Resulting balance very dependent on
expression and on data.
 Various application-dependent uses.
Partition by Range

With splitter values of 9 and 23:

Customers Customers Customers
43Mark 02114 9 46Rick 02116 23 42John 02116 30
44Bob 02116 8 47Bill 02114 14 45Sue 02241 92
49Jane 02241 2 48Mary 02116 38
Range+Sort: Global Ordering

Sort following a partition by range:

Customers Customers Customers
49Jane 02241 2 47Bill 02114 14 42John 02116 30
44Bob 02116 8 46Rick 02116 23 48Mary 02116 38
43Mark 02114 9 45Sue 02241 92
Partition by Range

 Key-based.
 Resulting balance dependent on set of
splitters chosen.
 Useful for “binning” and global sorting.
Partition with Load Balance

if middle node highly loaded:

Customers Customers Customers
42John 02116 30 45Sue 02241 92 46Rick 02116 23
43Mark 02114 9 47Bill 02114 14
44Bob 02116 8 48Mary 02116 38
49Jane 02241 2
Partition by Load Balance

 Not key-based.
 Results in skewed data distribution to
complement skewed load.
 Useful for record-independent parallelism.
Partition with Percentage
With percentages: 4, 20
Customers Customers Customers
42John 02116 30 46Rick 02116 23 ...
43Mark 02114 9 47Bill 02114 14
44Bob 02116 8 48Mary 02116 38
45Sue 02241 92 49Jane 02241 2

The next 16 records

would go here,
and the next 76 records would go here
Partition by Percentage

 Not key-based
 Results in usually skewed data distribution
conforming to the provided percentages.
 Useful for record-independent parallelism.
Broadcast (as a Partitioner)
Unlike all other partitioners which write a record to ONE output
flow, Broadcast writes each record to EVERY output flow.

Customers Customers Customers

42John 02116 30 42John 02116 30 42John 02116 30
43Mark 02114 9 43Mark 02114 9 43Mark 02114 9
44Bob 02116 8 44Bob 02116 8 44Bob 02116 8
45Sue 02241 92 45Sue 02241 92 45Sue 02241 92
46Rick 02116 23 46Rick 02116 23 46Rick 02116 23
47Bill 02114 14 47Bill 02114 14 47Bill 02114 14
48Mary 02116 38 48Mary 02116 38 48Mary 02116 38
49Jane 02241 2 49Jane 02241 2 49Jane 02241 2
Broadcast

 Not key-based
 Results in perfectly balanced partitions
 Useful for record-independent parallelism.
Session IV
De-Partitioning
Departitioning

Departitioning combines many flows of data to

produce one flow. It is the opposite of partitioning.

Each departition component combines flows in a

different manner.
Departitioning
Expanded View:

Score 1

Departition
Score
2 Output File

Score
3

Global View:
Departitioning
Fan-in Flow

 For the various departitioning components:

– Key-based?
– Result ordering?
– Effect on parallelism?
– Uses?
Concatenation
Globally ordered, partitioned data:
49Jane 02241 2 47Bill 02114 14 42John 02116 30
44Bob 02116 8 46Rick 02116 23 48Mary 02116 38
43Mark 02114 9 45Sue 02241 92

Sorted data:
49Jane 02241 2
44Bob 02116 8
43Mark 02114 9
47Bill 02114 14
46Rick 02116 23
42John 02116 30
48Mary 02116 38
45Sue 02241 92
Concatenation
 Not key-based.
 Result ordering is by partition.
 Serializes pipelined computation.
 Useful for:
– creating serial flow from partitioned data
– appending headers and trailers
– writing DML
 Used infrequently
Merge
Round-robin partitioned and sorted by amount:
42John 02116 30 49Jane 02241 2 44Bob 02116 8
48Mary 02116 38 43Mark 02114 9 47Bill 02114 14
45Sue 02241 92 46Rick 02116 23

Sorted data, following merge on amount:

49Jane 02241 2
44Bob 02116 8
43Mark 02114 9
47Bill 02114 14
46Rick 02116 23
42John 02116 30
48Mary 02116 38
45Sue 02241 92
Merge
 Key-based.
 Result ordering is sorted if each input is sorted.
 Possibly synchronizes pipelined computation;
may even serialize.
 Useful for creating ordered data flows.
 Used more than concatenate, but still
infrequently
Interleave
Round-robin partitioned and scored:
42John 02116 30A 43Mark 02114 9C 44Bob 02116 8C
45Sue 02241 92A 46Rick 02116 23B 47Bill 02114 14B
48Mary 02116 38A 49Jane 02241 2C

Scored dataset in original order, following interleave:

42John 02116 30A
43Mark 02114 9C
44Bob 02116 8C
45Sue 02241 92A
46Rick 02116 23B
47Bill 02114 14B
48Mary 02116 38A
49Jane 02241 2C
Interleave
 Not key-based.
 Result ordering is inverse of round-robin.
 Synchronizes pipelined computation.
 Useful for restoring original order following a
record-independent parallel computation
partitioned by round-robin.
 Used in rare circumstances
Gather

Round-robin partitioned and scored:

42John 02116 30A 43Mark 02114 9C 44Bob 02116 8C
45Sue 02241 92A 46Rick 02116 23B 47Bill 02114 14B
48Mary 02116 38A 49Jane 02241 2C

Scored dataset in random order, following gather:

43Mark 02114 9C
46Rick 02116 23B
42John 02116 30A
45Sue 02241 92A
48Mary 02116 38A
44Bob 02116 8C
47Bill 02114 14B
49Jane 02241 2C
Gather

 Not key-based.
 Result ordering is unpredictable.
 Neither serializes nor synchronizes
pipelined computation.
 Useful for efficient collection of data from
multiple partitions and for repartitioning.
 Used most frequently
Layout

 Layout determines the location of a resource.

 A layout is either serial or parallel.
 A serial layout specifies one node and one
directory.
 A parallel layout specifies multiple nodes and
multiple directories. It is permissible for the
same node to be repeated.
Layout
 The location of a Dataset is one or more
places on one or more disks.

 The location of a computing component is one

or more directories on one or more nodes. By
default, the node and directory is unknown.

 Computing components propagate their

layouts from neighbors, unless specifically
given a layout by the user.
Session V
Join
Join Types
•Inner join — sets the record-required parameters for all ports
to True.
•Outer join — sets the record-required parameters for all ports
to False.
•Explicit — allows you to set the record-required parameter
for each port individually.
Join Types .. Contd.
Case 1: Inner Join join-type

Case 2: Full Outer Join join-type

Case 3: Explicit join-type:record-required0: false

record-required1: true

Case 4: Explicit join-type:record-required0: true

record-required1: false
Some key Join Parameters
key
Name(s) of the field(s) in the input records that must have
matching values for Join to call the transform function.
driving
Number of the port to which you want to connect the driving
input. The driving input is the largest input. All other inputs are
read into memory.
The driving parameter is only available when the sorted-input
parameter is set to In memory: Input need not be sorted.
Some key Join Parameters

dedupn
Set the dedupn parameter to true to remove duplicates from the
corresponding inn port before joining. This allows you to choose
only one record from a group with matching key values as the
argument to the transform function.
Default is false, which does not remove duplicates
override-keyn
Alternative name(s) for the key field(s) for a particular in port.
References
 Ab Initio Tutorial
 Ab Initio Online Help
 Website (abinitio.com)

Ab Initio Means
No ratings yet
Ab Initio Means
19 pages
Ab Initio Interview Questions and Answers
No ratings yet
Ab Initio Interview Questions and Answers
6 pages
Welcome To Ab Initio
No ratings yet
Welcome To Ab Initio
24 pages
Co - Operating System Administrator Guide
No ratings yet
Co - Operating System Administrator Guide
256 pages
Ab Initio ETL Guide for Developers
No ratings yet
Ab Initio ETL Guide for Developers
22 pages
01 - IBM Abinitio
No ratings yet
01 - IBM Abinitio
92 pages
Ab Initio Online Training Guide
No ratings yet
Ab Initio Online Training Guide
7 pages
Ab Initio Software Overview & Architecture
No ratings yet
Ab Initio Software Overview & Architecture
99 pages
Ab Initio Session1
100% (1)
Ab Initio Session1
21 pages
Ab Initio Training
No ratings yet
Ab Initio Training
100 pages
Abinisio GDE Help
No ratings yet
Abinisio GDE Help
221 pages
Abinitio Gde 3 0
50% (2)
Abinitio Gde 3 0
60 pages
Abinitio Gde 3 0 PDF
67% (3)
Abinitio Gde 3 0 PDF
60 pages
Ab Initio
No ratings yet
Ab Initio
31 pages
Ab Initio DML and XFR Guide
No ratings yet
Ab Initio DML and XFR Guide
34 pages
EME Project Promotion Guide
No ratings yet
EME Project Promotion Guide
8 pages
Ab Initio Metadata & Debug Guide
No ratings yet
Ab Initio Metadata & Debug Guide
4 pages
Ab Initio - Intro
100% (1)
Ab Initio - Intro
43 pages
Components
No ratings yet
Components
11 pages
01 Ab Initio Advance Concepts E2
No ratings yet
01 Ab Initio Advance Concepts E2
152 pages
Introduction To Ab Initio: Prepared By: Ashok Chanda
No ratings yet
Introduction To Ab Initio: Prepared By: Ashok Chanda
31 pages
Abinitio Interview Questions
No ratings yet
Abinitio Interview Questions
10 pages
Final
100% (1)
Final
40 pages
Data Flow Partitioning Techniques
100% (1)
Data Flow Partitioning Techniques
18 pages
Ab Initio 2.14: Dynamic Script & Folding
No ratings yet
Ab Initio 2.14: Dynamic Script & Folding
5 pages
Ab Initio EME Technical Repository
No ratings yet
Ab Initio EME Technical Repository
7 pages
Ab Initio Playbook 1
No ratings yet
Ab Initio Playbook 1
11 pages
Ab Initio
No ratings yet
Ab Initio
4 pages
02 Building Simple Ab Initio Graphs
100% (2)
02 Building Simple Ab Initio Graphs
31 pages
Ab Initio Technical Guide
No ratings yet
Ab Initio Technical Guide
51 pages
Ab Initio Components Guide
No ratings yet
Ab Initio Components Guide
15 pages
02 Building Simple Ab Initio Graphs
No ratings yet
02 Building Simple Ab Initio Graphs
31 pages
Partition
No ratings yet
Partition
5 pages
Air & M Commands in Ab Initio
No ratings yet
Air & M Commands in Ab Initio
2 pages
AbInitio String Functions Guide
No ratings yet
AbInitio String Functions Guide
13 pages
Ab Initio: AB_JOB Variable & Features
No ratings yet
Ab Initio: AB_JOB Variable & Features
8 pages
M - Shell Commands: Options
No ratings yet
M - Shell Commands: Options
4 pages
Abinitio Questions
No ratings yet
Abinitio Questions
62 pages
All Abinitio Interview Questions
No ratings yet
All Abinitio Interview Questions
29 pages
Abinitio
0% (1)
Abinitio
9 pages
Abinitio-Faqs
100% (1)
Abinitio-Faqs
14 pages
Questions
No ratings yet
Questions
83 pages
Abinitio Questions
0% (1)
Abinitio Questions
33 pages
Ab Initio Custom Component
100% (1)
Ab Initio Custom Component
33 pages
AbInitio Online Training Course Content
No ratings yet
AbInitio Online Training Course Content
4 pages
05 Rollup and Scan Components PDF
100% (1)
05 Rollup and Scan Components PDF
57 pages
Selector Web - Definitions
No ratings yet
Selector Web - Definitions
8 pages
Ab Initio Etl Tutorial For Beginners PDF
0% (2)
Ab Initio Etl Tutorial For Beginners PDF
3 pages
04 Join Component
No ratings yet
04 Join Component
37 pages
Ab Initio Architecture Guide
100% (2)
Ab Initio Architecture Guide
33 pages
AbInitio FAQs
No ratings yet
AbInitio FAQs
14 pages
Abinitio Course Content
No ratings yet
Abinitio Course Content
6 pages
Ab Initio AIR Commands
100% (1)
Ab Initio AIR Commands
3 pages
Ab Initio Interview Questions
100% (1)
Ab Initio Interview Questions
2 pages
Ab Initio - V1.2
No ratings yet
Ab Initio - V1.2
29 pages
Ab Initio: Data Processing Platform
100% (4)
Ab Initio: Data Processing Platform
115 pages
Ab Initio Training
No ratings yet
Ab Initio Training
100 pages
Ab Initio Introduction
No ratings yet
Ab Initio Introduction
7 pages
Ab Initio Introduction
No ratings yet
Ab Initio Introduction
13 pages
01 Ab I E0
100% (1)
01 Ab I E0
129 pages
All Accounts Balance Details: S. No. Account Number Account Type Branch Rate of Interest (% P.a.) Balance
No ratings yet
All Accounts Balance Details: S. No. Account Number Account Type Branch Rate of Interest (% P.a.) Balance
1 page
©2010, Cognizant Technology Solutions Confidential
No ratings yet
©2010, Cognizant Technology Solutions Confidential
31 pages
EME Developer Guide
No ratings yet
EME Developer Guide
2 pages
Ab Initio Interview Questions - 1
80% (5)
Ab Initio Interview Questions - 1
19 pages
Introduction To EME - AB INITIO
50% (2)
Introduction To EME - AB INITIO
32 pages
SDS Reference Architecture Distributed Training of Deep Learning Using Multi-node GPU Cluster v1.1 En
No ratings yet
SDS Reference Architecture Distributed Training of Deep Learning Using Multi-node GPU Cluster v1.1 En
4 pages
SAP DataSphere and BW Bridge. Much Easier Than You Think
No ratings yet
SAP DataSphere and BW Bridge. Much Easier Than You Think
50 pages
Eaton Ups Parallel Whitepaper Wp153026en
No ratings yet
Eaton Ups Parallel Whitepaper Wp153026en
6 pages
Load Balancing With LVS-NAT, Keepalived, and Iptables
No ratings yet
Load Balancing With LVS-NAT, Keepalived, and Iptables
22 pages
Microsoft Lync & Need Ports
No ratings yet
Microsoft Lync & Need Ports
35 pages
One-Copy Serializability With Snapshot Isolation Under The Hood
No ratings yet
One-Copy Serializability With Snapshot Isolation Under The Hood
12 pages
CE704B-Cloud Computing
No ratings yet
CE704B-Cloud Computing
201 pages
Haproxy Load Balancer
100% (3)
Haproxy Load Balancer
60 pages
How To Configure PfSense As Multi Wan (DUAL WAN) Load Balance Failover Router
No ratings yet
How To Configure PfSense As Multi Wan (DUAL WAN) Load Balance Failover Router
23 pages
TN 2041 Nutanix Files
No ratings yet
TN 2041 Nutanix Files
66 pages
Modify Load Balancing in Cloud Database Environment
No ratings yet
Modify Load Balancing in Cloud Database Environment
6 pages
HBase for Data Engineers
No ratings yet
HBase for Data Engineers
13 pages
SD Roadmap PDF
No ratings yet
SD Roadmap PDF
145 pages
Professional Cloud Architect Exam - GCP
100% (5)
Professional Cloud Architect Exam - GCP
216 pages
ACE Module 2 Planning and Configuring Cloud Solutions v2.0
No ratings yet
ACE Module 2 Planning and Configuring Cloud Solutions v2.0
53 pages
tb0-104 Tibco Software Certification Tibco Software Inc Tibco Enterprise Message Service 4 Exam
No ratings yet
tb0-104 Tibco Software Certification Tibco Software Inc Tibco Enterprise Message Service 4 Exam
6 pages
System Design For Cracking Interviews
No ratings yet
System Design For Cracking Interviews
15 pages
06 - Spring Into Kubernetes - Paul Czarkowski
No ratings yet
06 - Spring Into Kubernetes - Paul Czarkowski
66 pages
Citrix NetScaler 10.5 Essentials - Student Exercise Work Book
No ratings yet
Citrix NetScaler 10.5 Essentials - Student Exercise Work Book
230 pages
Exchange Server 2016 Deployment Guide
No ratings yet
Exchange Server 2016 Deployment Guide
155 pages
PlanetTogether IT Reference Guide
No ratings yet
PlanetTogether IT Reference Guide
7 pages
LTM Implementations 11 2 1
No ratings yet
LTM Implementations 11 2 1
212 pages
ACE 4710 HA Configuration
No ratings yet
ACE 4710 HA Configuration
14 pages
LSA Architecure Tips
No ratings yet
LSA Architecure Tips
33 pages
The Tao of Microservices Second Edition MEAP V02 Richard Rodger PDF Download
100% (4)
The Tao of Microservices Second Edition MEAP V02 Richard Rodger PDF Download
61 pages
Configuring The ADC-Thales Integration
No ratings yet
Configuring The ADC-Thales Integration
16 pages
7-Virtual Clusters-29-09-2023
No ratings yet
7-Virtual Clusters-29-09-2023
24 pages
BMS SOFTWARE - Desigo CC
No ratings yet
BMS SOFTWARE - Desigo CC
12 pages
Top Interview Questions For System Administrators (Microsoft)
0% (1)
Top Interview Questions For System Administrators (Microsoft)
11 pages
AX Series™ Advanced Traffic Manager Graphical User Interface Reference
No ratings yet
AX Series™ Advanced Traffic Manager Graphical User Interface Reference
276 pages

Ab Initio Training

Uploaded by

Ab Initio Training

Uploaded by

What Does Ab Initio Mean?

 Ab Initio is a Latin phrase that means:

 Ab Initio is a general purpose data processing platform for enterprise

 Ab Initio Co>Operating System and

Ab Initio Co>Operating System

Native Operating System

Store all variables

 The Co>Operating System is core software that unites a

 The Co>Operating System is layered on top of the native

Divided cylinders for parallel I/O files

Grid boxes for database tables

What happens when you push the “Run”

 Script is invoked, creating Host process .

Client Host Processing nodes

Client Host Processing nodes

Client Host Processing nodes

Client Host Processing nodes

Client Host Processing nodes

Client Host Processing nodes

Client Host Processing nodes

1. Choose settings from the run menu

Select Dedup criteria.

• Specify the key for join

 A file with a .dbc extension which provides the GDE with

 Types of Parallel Processing

 In the graph, instead of specifying the entire path for any

 The actual in_data dir is $AI_HOME/in_data in sandbox

 We can define parameters in sandbox those can

 The topmost variable $PROJECT_DIR contains

Divided cylinders for parallel I/O files

Grid boxes for database tables

 Requires data partitioning.

 Different partitioning methods for different

 For the various partitioning components:

Partition 0 Partition 1 Partition 2

Customers Customers Customers

 Not key based.

 Key-based, depending on the expression.

With splitter values of 9 and 23:

Sort following a partition by range:

if middle node highly loaded:

The next 16 records

Customers Customers Customers

Departitioning combines many flows of data to

Each departition component combines flows in a

 For the various departitioning components:

Sorted data, following merge on amount:

Scored dataset in original order, following interleave:

Round-robin partitioned and scored:

Scored dataset in random order, following gather:

 Layout determines the location of a resource.

 The location of a computing component is one

 Computing components propagate their

Case 2: Full Outer Join join-type

Case 3: Explicit join-type:record-required0: false

Case 4: Explicit join-type:record-required0: true

You might also like