Talend Tech Boot Camp
Job Design Patterns & Best Practices
                                                1
Agenda
•   Introduction & Overview
•   SDLC – Development Guidelines
•   Job Design Patterns
•   Best Practices 1-16
•   DDLC – Database Modeling
•   Best Practices 17-32
                                    2
Open Discussions & QA
Question
“What is the best way for me to write a Talend job”?
     • It needs to be:
         • EASY to READ
         • EASY to WRITE
         • EASY to MAINTAIN
                   ... but honestly, you can only effectively pick 2!
                                                                        4
Answer
If the Job will change over time:
     • Priority is to make it:
       - EASY to MAINTAIN
       - EASY to READ                    If there are multiple developers:
       - EASY to WRITE                        • Priority is to make it :
                                                - EASY to READ
                                                - EASY to MAINTAIN
      If the Job is unlikely to change:         - EASY to WRITE
            • Priority is to make it :
              - EASY to WRITE
              - EASY to READ
              - EASY to MAINTAIN
                                                                           5
Paint your Talend Code
       After several years of developing visual code,
       patterns started to emerge:
         •   Canvas Layout & Spacing                    But Basics Still Count:
         •   Process & Data Flow                          •   Functionality
         •   Modularized Code                             •   Error Handling
         •   Consistent Job Types                         •   Memory Management
               • Harness / Process Driven
               • Stateless / State-full                   •   Performance
               • Atomic (Parent/Child & Joblets)          •   Naming Conventions
                                                        Avoiding Complexity:
                                                          •   NO Monolithic Jobs
                                                          •   Minimize Depth Levels
                                                          •   Scrunched Components
                                                          •   Overlapping Links
                                                                                  6
Talend SDLC Best Practices Guide
                            Talend’s Software Architecture is:
                              •   Comprehensive
                              •   Multi-Faceted
                              •   Robust
                              •   Flexible
                              •   Serious Stuff
                                          …. so take it Seriously
                                                                    7
Continuous Integration/Deployment
         Talend Software Development Life Cycle Best Practices Guide
                                                                       8
SDLC – Development Guidelines
                                9
Formulating the Basics
Foundational Precepts
 ✓   Readability:            creating code that can be easily figured out and understood
 ✓   Writeability:           creating straightforward, simple, code in the least amount of time
 ✓   Maintainability:        creating appropriate complexity with minimal impact from change
 ✓   Functionality:          creating code that delivers on the requirements
 ✓   Reusability:            creating sharable objects and atomic units of work
 ✓   Conformity:             creating real discipline across teams, projects, repositories, and code
 ✓   Pliability:             creating code that will bend but not break
 ✓   Scalability:            creating elastic modules that adjust throughput on demand
 ✓   Consistency:            creating commonality across everything
 ✓   Efficiency:             creating optimized data flow and component utilization
 ✓   Compartmentalization:   creating atomic, focused modules that serve a single purpose
 ✓   Optimization:           creating the most functionality with the least amount of code
 ✓   Performance:            creating effective modules that provide the fastest throughput
                                                                                                       10
SDLC – Developer Guidelines
Guidelines NOT Standards – It’s about Discipline
 • Standards are Rigid leaving no room for the unexpected
 • Guidelines are pliable which can bend and rarely break
Create a Development Guidelines Document
 • Involvement and Adoption from all teams is essential
 • Incorporates Corporate SDLC process
 • Defines the foundation, structure, & context
Other Useful Documents
 • Code Module Library
 • Data Dictionary
 • Data Access Layer
                                                            11
Just One More Thing
Instill Good Habits
 • Start easy – something everyone can adopt
    • Agree to label every component for code readability; a foundational Precept!
 • Incrementally raise the bar – over time
    •   Organize Repository Folders
    •   Establish & utilize naming conventions; Conformity!
    •   Adopt logging, messaging, & error handlers
    •   Build out reusable/shared code modules; Several precepts used here!
As Development Guideline Document Evolves
 • Discipline improves
 • Project code becomes:
    • EASIER TO READ
    • EASIER TO WRITE
    • EASIER TO MAINTAIN
                                                                                     12
Job Design Patterns & Best Practices
                                       13
What Are Job Design Patterns?
Template or Skeleton Layouts
 • Focus on essential and/or required elements
 • Often bound around use case
 • Target Common and/or Reusable code modules
 • When identified and implemented properly they:
    • Strengthen overall code base
    • Condense overall effort
    • Reduce repetitive but similar code
Adopt a Repeatable Coding Style for Easy: R/W/M
 • So every developer can view and understand any other’s code
 • Jumpstarts development for any new project
 • This is where Best Practices come in…
                                                                 14
Best Practices
Best Practice #1
Consider carefully how to layout your Job Canvas
 • Don’t just splash objects on your canvas
   with the idea to clean it up later
    • Use discipline to paint it right the 1st time
    • Allow space for Readability & Maintainability
    • Line up the Flows & Components
 • Preferred layout is
   ‘Top-to-Bottom’ then ‘Left-to-Right’
    • A ‘Zig-Zag’ or ‘Snake’ layout can be easy to write,
      and maybe easy to follow along
    • But inserting new functionality can lead to
      re-factoring the whole job layout
                                                            16
Best Practice #2
Atomic Job Modules – Parent/Child Jobs
 • Avoid Big, Monolithic Jobs
    • They can be hard to read and maintain
    • Plus they can perform poorly
 • Break Big Process flows into smaller Jobs
    •   Establish a Parent/Child hierarchy
    •   But keep the nesting levels to a minimum
    •   Recommended maximum nesting is 5
    •   Consider Job Memory Settings at each level
    •   Carefully set the checkboxes:
          • ‘Use an independent process to run subjob’
          • ‘Die on child error’
          • ‘Transmit whole context’
                                                         17
Best Practice #3
Joblets versus tRunJob Component
 • INCLUDED code vs CALLED code
    • Joblets are common code you ‘Include’ in your job
    • tRunJob ‘Calls’ a Child job from a Parent job
 • Both promote code reusability
    • Establish a Parent/Child hierarchy
    • A highly effective strategy when used appropriately
                                                            18
Best Practice #4
Job Entry & Exit Points
 • Talend code needs to Start & Stop
   somewhere
    • tPreJob & tPostJob components are highly advised
    • tPreJob executes 1st, then continues
    • tPostJob wraps it up (like ‘finally’ for you OOP guys)
 • Use tWarn & tDie Components effectively
    • They provide programmable control over where and
      when a job should complete
    • Note that the tDie component can set IF the JVM will
      exit immediately or not!
                                                               19
Best Practice #5
Error Handling & Logging
 • One of the MOST IMPORTANT
   things you can incorporate into
   your Jobs
    • Creating a common Error Handler is
      highly advised
    • Incorporate well defined ‘Return Codes’
 • Use Project Settings>Log4J
    • Configure and use the Log Stash server
                                                20
Best Practice #6
OnSubJobOk/ERROR vs OnComponentOK/ERROR
 • Often Misunderstood
    • These ‘trigger’ links do affect job design
      flow and must be considered properly
    • OK vs ERROR is obvious
    • OnSubJob will pass control only after
      the current Sub Job has executed fully
    • OnComponent will pass control only
      after the component has processed a
      row or a data set (depending upon the
      component)
 • Also ‘Run If’ linkage
    • Quite useful when continuation of the
      process needs control programmatically
                                                   21
Best Practice #7
What is a Job Loop?
 • A Highly Significant Job Design
   Consideration!
    • These are decision points where control
      of the next step in the process is made
    • All Job designs should identify One (1)
      ‘Main Loop’ where exit control can be
      established
    • Again use established ‘Return Codes’
      and exit strategies
    • ‘Secondary Loops’ are OK, just ensure
      the process flow makes sense
                                                22
Best Practice #8
Software Development Life Cycle (SDLC)
                  • “People, Product, & Process” Marcus Leminos “The Profit” (CNBC)
                     • These 3 keys can determine the Success or Failure of any Business
                     • The same is true for Software Development
                     • Talend’s SDLC Best Practice Guide provides a deep look into the
                       concepts, principles, specifications, and details Continuous
                       Integration/Deployment practices available to Talend developers
                     • Incorporation of any SDLC Best practice into a ‘Development
                       Guidelines’ document is highly advised
                                                                                           23
Best Practice #9
Managing Workspaces
 • Talend Studio installations use a ‘Workspace’
    • Typically created on your local disk drive C:
    • As in many software installations a ‘Default’ location is assigned
    • Usually placed along side the Software executables
We Recommend you Change your Workspace!
 • The default location may not be the best place to store your code
    •   These directories are attached to a Source Code Control System (SVN or GIT)
    •   The TAC manages synchronization of these workspaces with the SCCS
    •   Backup/Restore & Import/Export operations are clunky when located with executables
    •   Might even be a good idea to place your workspace on a separate disk drive
                                                                                             24
Best Practice #10
Reference Projects
 • Do you know what they are?
    • We all want re-usable, common, or generic code that can be shared across projects
    • Avoid cut-and-paste and/or copying similar code; Use Reference Projects!
    • Limit the number of Reference Projects as too many defeat their purpose
                                                                                          25
Best Practice #11
Object Naming Conventions
 • “A rose by any other name is still a rose!” who said that anyway?
    •   The answer may not matter, but Naming Conventions do!
    •   All Talend Objects have unique internally used names
    •   Adopt Conventions of Object Naming in Talend
    •   Clearly define them in your ‘Development Guidelines’ document
    •   Have the entire team adopt these conventions
Objects To Consider
 • Directories, Folders, & Workspaces
 • Data File ‘root’ & I/O locations & names
 • Jobs, Joblets, Code Routines
 • Context Groups & Variables
 • Database Connections
                                                                        26
Best Practice #12
           Project Repository
            • Where all project objects reside
               • Several Important Sections include:
                  •   Job Designs         - where your jobs are located
                  •   Contexts            - groups reusable variables
                  •   Code                - add java code modules
                  •   Metadata            - variety of schema definitions
                  •   Documentation       - auto-generate project Wiki
                                                                            27
Best Practice #13
Version Control
 • Job Properties allow setting ‘M’ajor & ‘m’inor version numbering
    • Allows a status of ‘development’, ‘test’, ‘production’, or ‘user defined’
    • This is designed for Single User Environments ONLY!
    • When used in conjunction with a SCCS, considerable workspace ‘bloat’ occurs
 • Instead use Project Branching & Tagging with your SCCS
    •   Cooperative development and seamless source code control require a different method
    •   SVN and GIT both provide a strategy for Branching & Tagging code
    •   Talend v6.2.1 GIT supports a graphical Diff/Merge feature @ Job level
    •   Talend v6.3.1 GIT supports a graphical Diff/Merge feature @ Component level
    •   Clearly define your preferred method in your ‘Development Guidelines’ document
    •   Have the entire team adopt these conventions
                                                                                              28
Best Practice #14
Memory Management
 • So, you want to run your job?
 • Have you considered its Memory
   needs?
   • Is the data flow processing Millions of Rows or
     have lots of columns?
   • How many tMap Lookups are employed
   • Do you know how much memory your Job Server
     has?
   • How many levels of Parent/Child job nesting are
     there?
   • Are Child jobs run in separate JVM?
   • Are you using ESB Jobs? How many Routes?
   • Are you using Parallelization?
 • Check Job Run>Advanced Settings to
   make appropriate adjustments
                                                       29
Best Practice #15                                                                   SQL
Dynamic SQL Syntax
 • Talend Database Input components support SQL syntax
 • Developers can generate a query based upon the specified schema
 • Developers can also hard-code the query as desired
 • What about when the query is unknown until Run-Time?
    • ‘Context Variables’ to the rescue
    • Using a tJava component and context variables can construct the SQL syntax
    • Specified in the Database Input component, these variables will execute the constructed SQL
             ✓ sqlCOLUMNS                       ✓ sqlFROM
             ✓ sqlWHERE                         ✓ sqlGROUPBY
             ✓ sqlORDERBY                       ✓ sqlLIMITS
         “SELECT “ + context.sqlCOLUMNS + context.sqlFROM + context.sqlWHERE
                                                                                                    30
Best Practice #16
Parallelization Options
 • These are several mechanisms to enable code parallelization
 • Use them correctly, efficiently, and with serious consideration
 • Used inappropriately, may have negative impact to CPU & RAM utilization
 • Used properly, highly performing Job Design Patterns can be created
Common Sense Utilization
 • Use parallelization sparingly
 • Do not use parallelization for code segments that already perform well
 • Do use parallelization for code segments that need high throughput or
   where processing bottlenecks occur
                                                                            31
Best Practice #16
Parallelization Option Stack
       Execution Plan (TAC) Multiple job/tasks can be configured to run in parallel
    Multiple Job Flows (Job) Within a single job, multiple starting points can be created which will execute
                             simultaneously yet share the same thread; Preference should be to create separate
                             child jobs
          Parent/Child Jobs When calling a child job, the tRunJob component supports the ‘Use an independent
                            process to run subjob’, a check box which when checked will establish a separate
                            JVM heap/thread to run the child job in
               Components The tParallelize component links multiple process flows for simultaneous execution;
                          The tPartitioner, tDepartitioner, tCollector, and tRecollector components offer
                          direct control over the number of parallel threads for a specific data flow
           DB Components Most of the database components offer an advanced setting to enable
                         parallelization thread counts on specific SQL statements (like INSERT or UPDATE);
                         these can be highly efficient but setting the number too high may have the opposite
                         effect; 2-5 threads is a recommended Best Practice
                                                                                                                 32
BREAK TIME
             33
Best Practice #17
Code Routines
 • You can add tJava components that
   embed java code as needed into a flow
 OR
 • You can add custom Java methods to
   the project repository which can be
   used in a variety of ways
 • Many built-in functions, like:
  • getCurrentDate()
  • sequence(string seqName, int startVal, int step)
  • ISNULL(object variable)
 • Make sure to incorporate comments
   which provide function ‘helper’ text
                                                       34
Best Practice #18
Repository Schemas
 • Reusable Objects defined in the Project
   Repository Metadata provides significant
   opportunities to create reusable code
 • Repository Schemas include:
    • Files
       • Delimited / Positional / Regex
       • XML / JSON
       • Excel
    • Generic
    • WSDL
    • LDAP                                    ‘md_{objectname}’
    • UN/EDIFAC
                                                            35
Best Practice #19
Apache Log4J (Studio)
 • All components are Log4J enabled (v6+)
 • ‘Enable’ in the Studio Project Settings
 • Customize Log4J scripting paradigm
 • Works with ELK:
    • Elastic Search
    • Log Server
    • Kabana UI
 • Utilizes Talend Priorities:
    •   INFO
    •   WARNING
    •   ERROR
    •   FATAL
                                             36
Best Practice #19
Apache Log4J (TAC)
 • Also ‘enable’ in the TAC for each Task
 • Ensure to set appropriately for each
   environment:
    • DEV / TEST / UAT / PROD
 • Use in conjunction with your error
   handler
 • Ensure to utilize the components:
    • tDie
    • tWarn
    • tAssert
                                            37
Best Practice #20
Activity Monitoring Console: AMC (Studio)
 • ‘Enable’ database logging in the
   Studio Project Settings
 • Specify Database Connection
 • ‘Enable’ which information to catch
    • Java Runtime Errors
    • Job Errors
    • Job Warnings
 • Select AMC tables to use:
    • tStatCatcher
    • tLogCatcher
    • tFlowMeterCatcher
                                            38
Best Practice #20
Activity Monitoring Console: AMC (TAC)
 • Visualization available in
   both Talend Studio &
   the TAC
 • Establish ‘Return Codes’
   as discussed in #5 which
   provide a mechanism to
   query the tLogCatcher
   table externally
                                         39
Best Practice #21
Recovery Checkpoints (Studio)
 • When long running jobs or jobs with
   critical steps fail, starting over can be
   problematic
 • Restarting/Recovering these jobs from
   a specified checkpoint
 • With Talend you can set one or more
   Checkpoints on ‘OnSubJobOk’ links
    • ‘Enable’ Recovery Checkpoint
    • Give it a name
    • Document recovery information
                                               40
Best Practice #21
Recovery Checkpoints (TAC)
 • Tasks define how to recover
   a Job automatically on a Job
   Server
    •   Wait
    •   Reset Task
    •   Restart Task
    •   Recover Task
 • Error Recovery Manager
   provides the ability to
   manually restart at a
   selected checkpoint
                                  41
Best Practice #22
Joblets
 • We looked at Joblets in #3 & #5
 • ‘Included’ in Jobs, not called
 • Most have ‘Input’ & ‘Output’
   components to pass data flow
   through
 • Reusable Code within single job
   or across many jobs
 • Not all components should be
   used in a Joblet, like:
    • dB Connections
    • tJavaFlex, unless fully contained
      in Joblet
                                          42
Best Practice #23
Component Test Cases
 • Available since v6.0.1
 • Components allow creation of a
   ‘Test Case’ where data flow is
   involved
 • Test case is tied to component
 • Right click on component under
   test to generate a ‘Test Case’ job
 • When component schema
   changes the test case changes
   automatically
 • Generated but can be modified
                                        43
Best Practice #23
Test Case Job
 • Reads an ‘input data file’
 • Processes the data through the
   component under test
 • Writes out a ‘result file’
 • Compares to an expected result
   or ‘reference file’ for a match
       • PASS / FAIL
 • A test case ‘instance’ can support
   multiple ‘input’ and ‘reference’
   files
       • GOOD / BAD / UGLY
       • SMALL / MEDIUM / LARGE
                                        44
Best Practice #24
Data Flow Iterations
 • Normal link between components
   is either a ‘trigger’ or a ‘row’
 • Data Flow generally processes a                           tFlowToIterate
   ‘pipeline’ of records or list of files
                                            tIterateToFlow
 • Each pipeline between two
   components have unique object
   names
        • ie: row1; row2; etc..
 • Pipelines usually processes all
   rows until done, but sometimes
   logic requires direct control, row-
   by-row called ‘iterations’
                                                                              45
Best Practice #25
tMap Lookups
 • The highly essential tMap
   component is used for processing a
   data flow from a ’source’ to a
   ‘target’ where some remapping
   and/or transformation takes place
 • A compelling use for the tMap is for
   data Lookups that ‘join’ the primary
   data flow with one or more other
   data flows
 • These ‘lookups’ can originate from
   many kinds of source data
 • Notice the ‘Lookup Model’
                                          46
Best Practice #25
tMap Lookup Considerations
 • How you set up ‘joins’ for lookups
   impact both performance and
   memory
 • Choose the right ‘Lookup Model’
    • Load Once
    • Reload at each Row
    • Reload at each Row (cache)
 • Memory process will likely be
   much faster but may
   oversubscribe available RAM
 • Row by Row will use far less
   memory but will be slower
                                        47
Best Practice #25
tMap Row-by-Row Lookups
 • The ‘key’ required for row-by-row
   lookups are set in the tMap editor
   shown previously
 • The ‘lookup’ data flow then needs
   to use the variable in the join
   logic using the method
    (datatype)globalMap.get(“key”)
 • This method is limited to SQL
   database lookups
                                        48
Best Practice #26
Global Variables
 • ‘Context Variables’ are used in jobs
   to control programmatically, values
   at runtime; referenced as:
        context.variable
 • ‘Built In’ variables a available only
   within the job their created in
 • ‘Project Repository’ variables are
   available across all jobs in a project;
   this is the recommended practice
 • The tSetGlobalVar component
   defines them within a job at runtime
                                             50
Best Practice #26
More on Global Variables
 • The tGlobalVarLoad component is
   used for the same purpose in Big
   Data jobs
 • Use the globalMap to access them
   (datatype)globalMap.get(“gVar”)
 • Using global variables with
   components provides better
   memory management however
   requires more code
                                      51
Best Practice #26
System Global Variables
 • Use them where needed:
       ERROR_MESSAGE
       DIE_MESSAGE
       WARN_MESSAGE
       CHILD_RETURN_CODE
       DIE_CODE
       WARN_CODE
       NB_LINE
       NB_LINK_OK
       NB_LINE_REJECT
       NB_LINE_INSERTED
       NB_LINE_UPDATED
       NB_LINE_DELETED
 • Two more include:
       global.projectName
       global.jobName
                            52
Best Practice #27
Loading Contexts
 • Context Group variables
   can be loaded at runtime
   using the tContextLoad
   component
 • Storing them externally in a
   file can be highly effective
   and even support some
   security concerns
 • A corresponding
   tContextDump component
   can write out values from a
   database first
                                  53
Best Practice #28
Using Dynamic Schemas
 • Can a single job design cope with
   dynamic schemas?
 • 100 tables all need the same job
   design; is it possible to build one job
   for them all? NO!
 • But you can do in TWO jobs!
        One to DUMP Schema
        One to LOAD Schema
 • Here we use the ‘Information
   Schema’ of the database to retrieve
   a list of Tables & Columns;
   processing each through the
   tSetDymanicSchema component
                                             54
Best Practice #29
Dynamic SQL Components
 • Instead of using the ‘Information
   Schema’ to pull the list of Tables and
   Columns, specialized components
   for each DB are available:
        t{db}TableList
        t{db}ColumnList
 • TWO jobs perform the same process
   shown previously, yet differently
        One to DUMP Schema
        One to LOAD Schema
 • These components can be used for
   other job designs as well
                                            55
Best Practice #30
CDC – Change Data Capture
 • How a Job Design handles CDC is very important when data
   synchronization is required
 • Talend job designs can use the ‘Publish/Subscribe’ mechanism
   tied directly with the host database system involved, including:
   ✓Oracle                ✓MySQL
   ✓MS SQL Server         ✓PostgreSQL
   ✓Sybase                ✓Informix
   ✓Ingress               ✓DB2               • Each of these have a
   ✓Teradata              ✓AS/400              corresponding
                                                 t{db}CDC component to use
                                                                             56
Best Practice #30
How does CDC work in Talend?
 • Three CDC modes are available:
  ➢Trigger (default) - Uses DB Host triggers that tracks Inserts, Updates, & Deletes
  ➢Redo/Archive Log - Used with Oracle 11g and earlier versions only
  ➢XStream           - Used with Oracle 12 and OCI only
                      Talend User Guide, Chapter 11
                                                                                   57
Best Practice #31
Custom Components
 • With over 1000+ component in
   the Data Fabric Platform, is that
   enough?
 • Not if you want to incorporate
   specialized business logic into a
   repeatable object for jobs
 • Many 3rd Party components are
   available at exchange.talend.com
 • Setup the Preferences Dialog box
 • Install custom components from
   the Exchange menu link
                                       58
Best Practice #31
Custom Components – Build your Own
 • Building your own custom
   components is another choice
 • Most custom components are
   built on the ‘JavaJet’ framework
 • Use the help.talend.com to find
   the tutorial on how to build
   custom components
 • The new Talend Component
   Framework ‘TCOMP’ which is Java
   based is planned for release soon
                                       59
Best Practice #32
JobScript API
 • Normally we create Jobs in the
   ‘Designer’ tab of the Studio
 • Can a Job be GENERATED?
 • YES – with JobScripts!
 • All jobs created in the ‘Designer’
   have a corresponding ‘JobScript’
 • Go to help.talend.com for
   instructions on how to use this
   feature
                                        60
Some Parting Do’s & Don’ts
➢   Do Use Both The tPreJob & tPostJob Components
➢   Do Not Clutter Canvas With Tightly Grouped Components; Spread it out a bit
➢   Do Layout Your Code Nicely; Top-2-Bottom & Left-2-Right
➢   Do Not Expect To Get It Just Right The 1st Time You Code It
➢   Do Identify Your Main Job Loop & Control Your Exit
➢   Do Not Ignore Error Handling Techniques
➢   Do Use Context Groups Extensively (DEV/QA/UAT/PROD) & Wisely
➢   Do Not Create Massive Single Job Layouts
➢   Do Create Atomic Job Modules
➢   Do Not Force Complexity; Simplify
➢   Do Use Generic Schemas Everywhere (arguable exception is the single column schema)
➢   Do Not Forget To Name Your Objects
➢   Do Use Joblets Where Appropriate (there may only be a few)
➢   Do Not Over utilize The tJavaFlex Component; tJava or tJavaRow is likely enough
➢   Do Generate/Publish The Project Documentation When Done
➢   Do Not Skip Setting The Runtime Memory Heap
                                                                                         61