+
Data migration &
data remediation
for corporate
acquisition in Life
Science industry
About BlueSoft
BlueSoft specializes in bespoke modern IT solutions, focused on building business value together with our customers. One of BlueSoft’s core areas is data
integration and migration. BlueSoft has 20 years of experience with projects delivered for international clients of different scales. Hundreds of qualified specialists
in this domain and a wide range of modern ETL technologies make BlueSoft a solid choice for a vendor for any data project.
Introduction
BlueSoft engineers took up the challenge of global data migration for two international companies recognizable in the Life Science sector.
The project was complex because these two companies processed and stored their data differently, using different tools and approaches to customers, sales
models, services, and regionalization.
Those challenges called for modern and sophisticated set of reliable tools capable of providing secure and efficient data processing, delivering wide range of
external system connectors, allowing secure and reliable data manipulation and cleansing with high level of scalability.
This is why Talend suite supported by AWS RDS services has been chosen as the core ETL toolset for this challenge.
Challenge
BlueSoft consultants faced the challenge of a huge data volume and unknown data quality distributed among multiple sources in an initially unknown manner
(about 1.2 billion records to be processed in each project phase).
Knowledge about the source data shape and quality was limited. For business analysts to access raw data, several checks and metrics had to be built in the early
beginning of the design and analysis process.
The data had to be processed in a safe and efficient manner with extensive monitoring and troubleshooting capabilities, as well as process orchestration and
reporting on every stage of the project. Complex data processing mechanisms (ETL), non-trivial data matching (fuzzy matching), data cleansing and validation,
manual data stewardship, error reporting, and processing automation were all needed as well.
Choosing the proper tool to address this complex task was one of the most important success factors.
A tool with the following capabilities had to be leveraged:                                              M   ain systems on Source Side:                               M   ain systems on Target Side:
    Processing billion of records / TBs of dat                                                               SAP ECC (ERP                                                  ORACLE ER
    Data security and redundanc                                                                              SAP C4C (CRM                                                  SalesForceDC (CRM
    Accessing multiple systems with different interface technologie                                          SmartSolv                                                     Magic
    Core ETL capabilities easy/quick to set u                                                                Topa                                                          Market
    Out of the box connectors (SFDC, SAP, file, MSSQL etc                                                    Marketo                                                       Hybris
    Data processing orchestration, scheduling, logging, managemen
    Custom data manipulation using different programming language
    Data matching including fuzzy matching algorithm
    Parallel processing and processing optimizatio
    Data reporting and cleansing
Solution
To meet those requirements, Talend products supported by AWS RDS were chosen as the main ETL tool after a series of POC sprints. High level tech scope can be
summarized with this graphic:
                                                                                                                               IRA Con uence
        alend Cloud with
                                                                                                                              J             +   fl             +
          T
                                                                                                                                  Share oint
        a Talend Remote                       Amazon Web Services                  icrosoft SQL Server                                                                                    lic
                                                                                                                                                P     
                                                                                 M                                                                                                      Q
                                                                                                                                   (for project
                                                                                                                                                                                             k
    Engine implementation
                       (RDS - database                     (main database                                                                            (data reporting and test
                                                                                                                                management, test
     (main ETL and process                    servers in SaaS model)                   technology)                                                                               supporting extracts)
                                                                                                                                management, and
       orchestration tool).
                                                                                                                                 documentation)
At first, BlueSoft team acquired core datasets from source systems and prepared a set of detailed data quality reports in order to support Company’s business
analysts in deciding which data is eligible for migration and to be able to formulate detailed requirements.                       
The following challenges were addressed during the project’s lifetime.
Connectivit             y &       throughpu         t
Connectivity to multiple systems using different technical means has been established. The majority of the data from all the source systems had to be downloaded
in order to make data analysis, verification, and potential cleansing possible. Every stage of the project required this operation to be repeated due to a variety of
environments and data modifications. Billions of data records (TBytes) have been actively pulled from the sources multiple times and stored in RDS servers.
D   ata matchin             g
In addition to quite common data transformations, which most modern tools can handle, we ve extensively utilized data matching Talend components and a wide
                                                                                                               '
range of matching algorithms to achieve maximum data pairing and merging efficiency. In some case Talend Data Stewardship module came in handy as well.
D   evelopment proces                      s
In parallel to the general ETL process development and testing (multiple trial runs and test cycles), a wide range of error reports has been prepared in order to
identify and bring attention to the most important data problems. The processed data was sensitive in nature, falling under xP and personal data processing   G
regulations, which is why in the majority of cases data fixing had to be addressed by business users in the source/target systems instead of the middleware.
Technical data cleansing, such as encoding change, unwanted character removal and data re-formatting took place on the y. Production data was being pulledfl
and dozens of error reports were recalculated and loaded into SharePoint on a weekly basis by Talend.
Thanks to SaaS based architecture of both Talend Cloud and AWS services, the migration team was able to seamlessly scale the solution avoiding bottlenecks and
improve efficiency, while narrowing migration execution time.
P latform scalling in the pro ect                         j
                                                                           1 RDS medium server
                    nitial Talend
                    I           4                                          with 5 DB scaled to                         4                                      Team itself scaled
                    emote ngines
                    R               E                                      R DS large instances                                                                 from 5 to 2+        1
                     scaled to           10                                 with more than 6                       0                                               people
                                                                               MSSQL DBs
On top of scalability, which improves the performance and possibilities of the ETL process, significant pressure has been put on parallel processing. Both
leveraging multiple Talend Remote Engines and multi-thread processing in Talend code itself added to the solution’s quality.
Moreover, the project team leveraged custom component creation capabilities to inject Python and Java code in order to optimize tasks executions even more.
The exibility of database setup and replication on RDS combined with easy connectivity and job execution orchestration on Talend Cloud made data separation
     fl
between environments straightforward. The main aim was to keep data tidy and independent through all the data migration testing and rehearsal phases.
T   alend Components/Services used in the pro ect:                                                                                      j
                        alend Studio
                        T                                                      T alend emote ngines
                                                                                         R           E                                                    alend Management Console
                                                                                                                                                          T
                Talend Stewardship Module                                            alend Academy
                                                                                     T                                                                             alend Professionals
                                                                                                                                                                   T
R s e ults
As a result of the two years of the project, a Talend based solution was successfully implemented for this customer by a team including Bluesoft consultants and
employees of merged companies. More than 6 TB of data have been migrated via AWS & Talend cloud, while more than 14 Talend knowledgeable and certified
engineers worked on the migration and analysis of over a billion records database.              
Once this project is finalized teh client plans to leverage Talend to
  Improve data quality and perform data cleansing in core system
  Automate manual processes concerning verification across system
  Build several system integration
  Perform next data migrations of different scale
                    6 TB                                      1.2 billion                                                  10                                                        2
              of cloud stored data                               records                                           Talend Remote
                                                  years
                 at the moment                                                                                         Engines                                                    projects
                                        At BlueSoft, we don’t just code. We understand your business.
                                                         each out to us
                                                                   R
                                               for collaboration with a trusted
                                                partner who supports our IT                                                  y
                                              growth and our business goals.   y
                                                                                             Łukasz Bober 
                                                                                             Business Unit Director
                                                 lukasz.bober@bluesoft.com                                                 +48 603 911 131
                                                                                                                                                               www.bluesoft.com | Powered by BlueSoft