0% found this document useful (0 votes)
288 views6 pages

Data Scientist Expertise Overview

KetulKumar Polara is a data scientist with over 3 years of experience applying machine learning and deep learning techniques across various domains. He has expertise in Python, R, SQL, Spark, Hadoop, and data visualization tools like Tableau. Currently he works as a data scientist at Florence Healthcare, where he develops predictive models and applications to analyze clinical trial data using machine learning algorithms and PySpark.

Uploaded by

harsh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
288 views6 pages

Data Scientist Expertise Overview

KetulKumar Polara is a data scientist with over 3 years of experience applying machine learning and deep learning techniques across various domains. He has expertise in Python, R, SQL, Spark, Hadoop, and data visualization tools like Tableau. Currently he works as a data scientist at Florence Healthcare, where he develops predictive models and applications to analyze clinical trial data using machine learning algorithms and PySpark.

Uploaded by

harsh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

KetulKumar Polara

Data Scientist
Email: sravya@silverxis.com
Phone: 214-903-0242
Data Scientist with 3+ years of professional experience, performing Statistical Modelling, Data Mining, Data
Exploration and Data Visualization of structured and unstructured datasets and implementing Machine
Learning and Deep Learning models based on business understanding to deliver insights that drive key
business decisions to provide value to the business.
PROFESSIONAL SUMMARY:
• Experience in working in various domains facilitating the entire lifecycle of a data science project: Data
Extraction, Data Pre-Processing, Feature Engineering, Dimensionality Reduction, Algorithm Implementation,
Back-Testing, and Validation.
• Good practical knowledge Unsupervised learning techniques such as clustering, dimensionality reduction,
recommender systems and deep learning and applying regression, classification, and clustering techniques.
• Good practical knowledge in Statistical analysis and Machine learning techniques - supervised learning
techniques such as knn, support vector machines, kernels and neural networks
• Experience in data-wrangling, loading in Big Data platforms such as Apache Spark, working efficiently
through SQL server after doing enough Data Extracting process from several sources, transforming in transit
and loading into the relevant platform to perform actions.
• Experience cloud deployment in Azure ML studios, AWS Sage Maker and Docker containers
• Expertise in using Machine Learning Techniques in Python & R (R Studio) such as Linear Models,
Polynomial, Support Vector; classification models such as Logistic Regression, Decision Trees, Support
Vector Machine and K-NN (K Nearest Neighbor) also in clustering like K-means.
• Knowledge of interactive ML tools such as TensorFlow, Keras, SciKit-Learn and PyTorch, and expertise in
using strong Coding Platforms such as Spyder, Jupyter Notebook, R Studio offered by Anaconda
Navigator.
• Hands on experience in Predictive Modeling of large Structured and Unstructured data.
• Particularly good experience and knowledge in provisioning virtual clusters under the AWS cloud which
includes services like EC2, S3, VPC, RDS, Glacier, RedShift and EMR.
• Experience in the Hadoop ecosystem and Apache Spark framework such as HDFS, MapReduce, HiveQL,
and Pyspark.
• Experience in text mining and topic modeling using NLP & Neural Network, tokenizing, stemming, and
lemmatizing, tagging part of speech using TextBlob, Natural Language Toolkit (NLTK), and Spacy while
building Sentiment Analysis.
• Knowledge of AI & Deep Learning techniques such as Convolutional Neural Network (CNN) for Computer
Vision, Recurrent Neural Network (RNN), Deep Neural Network with applications of Backpropagation,
Stochastic Gradient Descent (SGD), Long Short-Term Memory (LSTM) and Continuous Bag of words, Text
Analytics, etc.
• Hands on working experience with Tensorflow for Deep learning (Deep Neural Networks, and Convolutional
neural Networks (CNN)).
• Proficient in using PostgreSQL, Microsoft SQL Server, and MySQL databass to extract data using multiple
types of SQL Queries including Create Table, Join, Conditionals, Drop, Case, etc.
• Good knowledge of Data modelling skills including Star Schema and Snow - Flake models, Fact Tables,
Dimensional Tables and E-R modelling, Dimensional modelling.
• Skilled in creating executive Tableau Dashboards for Data visualization and deploying it to the servers.
• Experience with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, PivotTables and
OLAP reporting
• Hands-on experience on Apache Hive, Apache Spark using Python for Big Data. Collect insights from data
using Hive queries to make business decisions.
• Proficient in Data Visualization tools such as Tableau and PowerBI, Big Data tools such as Hadoop HDFS,
Spark (PySpark), and MapReduce.
• Experience using Matplotlib and Seaborn in Python for visualization and Pandas in Python for performing
exploratory data analysis.
• Experience with NoSQL databases, such as MongoDB, Cassandra, HBase and Utilized SQL, NoSQL
databases, Python programming and API interaction
• Excellent hands on experience in building time series analysis like ARMA, ARIMA for predictive analytics
and forecasting
• Experience in Web Data Mining using Python’s NLTK, ScraPy, Beautiful Soup packages, and REST APIs
along with working knowledge of Natural Language Processing (NLP) to analyze text patterns.
• Knowledge in creating and developing Power BI, and Tableau Dashboards into the rich look.
• Experience with Python libraries including NumPy, Pandas, SciPy, Scikit-learn, NLTK, and SpaCy.
• Experienced with working on A/B testing design and Execution. deploying machine learning models into
production for the teams.
• Deep knowledge on SQL languages for writing Queries, Stored Procedures, User-Defined Functions, Views,
Triggers, Indexes and etc
• Excellent communication skills and experience in daily scrum meetings with cross teams.

EDUCATION:
• Bachelors in Information Technology, Florida International University, Miami, FL

CERTIFICATIONS:
1. Coursera – IBM Data Analysis with Python
2. Coursera – IBM Machine Learning with Python
3. Coursera – IBM Statistics for Data Science with Python
4. Coursera – IBM Data Science Methodology
5. Coursera – IBM What is Data Science
6. Coursera – IBM Tools for Data Science
7. Coursera – UCSanDiego Machine Learning With Big Data
8. Coursera – UCSanDiego Big Data Modeling and Management Systems
9. Coursera – UCSanDiego Introduction to Big Data
10. Coursera – UCSanDiego Big Data Integration and Processing

TECHNICAL SKILLS:
Operating System Windows, Linux
Methodologies Waterfall, Agile/ Scrum
Regression Methods Linear, polynomial, decision trees
Classification Logistic Regression, K-NN, Decision Trees, Naïve Bayes, Support Vector Machines
(SVM)
Clustering K-means clustering, Hierarchical clustering
Deep Learning Artificial Neural Networks, Computer Vision (Convolutional Neural Networks);
PyTorch, TensorFlow
Dimensionality Reduction Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA)
Machine Learning (ML) / Deep Tensor Flow, Keras, Scikit-Learn, Classification, Regression, Feature
Learning (DL) Engineering, One hot coding, Clustering, Regression analysis, Naive Bayes,
Decision Tree, Random Forest, Support Vector Machine, KNN, Ensemble
Methods, K-Means Clustering, Time Series Analysis, Confidence Intervals,
Principal Component Analysis, NLP - LSTM and Dimensionality Reduction
Recommendation Engine Association Rule learning Market Basket Analysis, Collaborative filtering,
Segmentations
Natural Language Processing - NLP Stemming, NLTK, Spacy, TFIDF, Word2Vec, Doc2Vec, Topic Modelling,
Sentiment Analysis
Ensemble Learning Random Forests, Gradient Boosting, etc.
Statistical Analysis Hypothesis testing, A/B analysis, ANOVA, MANOVA, normal distribution,
mean, median, mode, standard deviation, regressions, f-test
Times Series ARIMA, SARIMA, Multiplicative & Additive Decomposition
Data Visualization/ ETL Tool Tableau, ggplot, Plotly, PowerBI, Python matplotlib, seaborn, SSIS, Informatica
Languages Python (Jupyter Notebook, Spyder, Google Colab), R (Shiny, Statistical Analysis),
R Studio, PostGRESQL, HIVE, MySQL
Database Systems SQL Server, Oracle, MYSQL, Teradata, NoSQL (MongoDB, HBase, Cassandra),
AWS (DynamoDB, Elastic ache)
Big Data Analysis Apache Spark (Using PySpark), Hadoop (HDFS, MapReduce), Hive, Scoop, Spark
MLlib
Cloud Services Google Cloud Platform (GCP), AWS (S3, EC2, Sage Maker)

PROFESSIONAL EXPERIENCE:

Client: Florence Healthcare, Atlanta, GA Jan 2020 – Till Date


Role: Data Scientist

Description:
Florence advances clinical trials through software for managing document and data flow between research sites and
sponsors. Florence healthcare specialised in Clinical Trials, Electronic Trial Master File (eTMF), Clinical Trial
Sites, Clinical Trial Management, 21 CFR Part II, eRegulatory, eSource, Remote Monitoring, Cancer Centers,
Academic Medical Centers, CROs, Pharmaceuticals, Medical Device

Responsibilities:
• Analyzed business requirements and developed the applications, models, used appropriate algorithms
for arriving at the required insights.
• Designed application components in an Agile environment utilizing a test-driven development approach
• Developed and implemented predictive models using machine learning algorithms such as linear regression,
classification, multivariate regression, Naive Bayes, Random Forests, K-means clustering and KNN.
• Utilized PySpark, Spark Streaming, MLlib, a broad variety of machine learning methods including
classifications, regressions, dimensionally reduction etc
• tested survival model for data, including state of the art neural networks for survival analysis using Python
deep learning packages TensorFlow, and Keras.
• Build linear regression models for predictions and used linear methods for statistical significance tests and
correlations in R
• Collaborated with data engineers and operation team to implement the ETL process, wrote and optimized
SQL queries to perform data extraction to fit the analytical requirements.
• Implemented and tested the model on AWS EC2 and collaborated with development team to get the best
algorithms and parameters
• Build machine learning model for market segmentation using k-means clustering analysis in python.
• Performed data analysis by using Hive to retrieve the data from the Hadoop cluster, SQL to retrieve data
from RedShift.
• Explored and analyzed the customer-specific features by using Spark SQL.
• Used AWS Sagemaker to train model using protobuf and deploy the model owing to its relative simplicity
and computational efficiency over AWS Beanstalk
• Involved in various pre-processing phases of text-data like Tokenization, Stemming, Lemmatizationand
converting the raw text data to structured data
• Worked on Statistical methods like data driven Hypothesis Testing and A/B Testing to draw inferences,
determined significance level.
• developing time series, regression, decision trees, artificial neural networks, algorithms using Python and R.
• Performed Data Visualization using RStudio, used ggplot2, lattice, highcharter, Leaflet
• Evaluate, test statistical model and Machine Learning models using residual graphical analysis, test harness
and k-fold cross validation techniques
• Develop forecast models using statistical methods like Auto Regressive Integrated Moving Averages
(ARIMA) and Auto-correlation function (ACF).
• Worked on AWS which includes Amazon Kinesis, Amazon Simple Storage Service (Amazon S3),
Spark Streaming, PySpark and Spark SQL on top of an Amazon EMR cluster.
• Performed Data Cleaning, Data Exploration, Data Visualization, Feature Selection, and Engineering using
Python libraries such as Pandas, Numpy, Sklearn, Matplotlib, and Seaborn.
• Worked on text parsing, NLP and stemming using python package NLTK.
• Prepared data-visualization designed dashboards with Tableau, and generated complex reports including
summaries and graphs to interpret the findings to the team and stakeholders
Environment: SDLC, Agile, Scrum, AWS RedShift, EC2, EMR, Hadoop, S3, HDFS, Spark (Pyspark, MLlib,
Spark SQL), Python (Scikit-Learn/ Scipy/ Numpy/ Pandas/ Matplotlib/ Seaborn), R, RStudio, ARIMA, ACF,
Tableau Desktop, Tableau Server, Machine Learning (Regressions, KNN, SVM, Decision Tree, Random Forest,
XGBoost, LightGBM, Collaborative filtering, Ensemble), AWS Sagemaker, ETL, SQL, AWS Beanstalk, Teradata,
Git, TensorFlow, Keras, A/B Testing, NLTK, Tableau.

Client: Tower Hill Insurance, Gainesville, FL Oct 2018 - Dec 2019


Role: Data Scientist

Description:
Tower Hill Insurance is a leader among residential and commercial property insurers in the Southeast. Financial
strength, product expertise, a comprehensive reinsurance program, and exceptional claims service are core business
strategies of the organization. Tower Hill offers homeowners, mobile homeowners, rental property, renters,
commercial, flood, and equipment breakdown coverage.

Responsibilities:
• Developed Sentiment Analysis using Machine Learning & NLP by training historical Data provided by
organizations to understand the sentiment of end-users.
• Performed Data Collection, Data Cleaning, and Data Visualization and, Text Feature Extraction, and
performed key statistical findings to develop business strategies.
• Used various python libraries such : Pandas , Numpy , Sikit learn, Seaborn , Matplotlib , Spacy , Keras,
SciPy , NumPy, to perform dataset manipulation, data mapping, data cleansing and feature engineering
• Employed NLP to classify text within the dataset. Categorization involved labeling natural language texts with
relevant categories from a predefined set.
• Trained model in Python to predict the sentiments on word embeddings of the reviews using Word2Vec.
• Build classification models based on Logistic Regression, Decision Trees, Random Forest to classify the texts
through labels
• Analyzed large, noisy datasets and identify meaningful patterns that provide actionable results.
• Built Machine Learning models using python like Logistic Regression, Naïve Bayes, Random Forests and built
Deep Learning Neural Networks like Recurrent Neural Networks (LSTM) and Artificial Neural Networks
(ANN) to predict the sentiments.
• Performed Feature Engineering techniques in Natural Language Processing (NLP) like TF-IDF, Word2Vec,
Doc2Vec
• Performed univariate, bivariate, and multivariate analysis to check how the features were related in conjunction
with each other and the risk factor.
• Worked with various deep learning frameworks such as Tensorflow and keras in order to build models such
as ANN, CNN and LSTM
• Applied PCA to reduce the correlation between features and high dimensionality of the standardized data so
that maximum variance is preserved along with relevant features.
• Used Logistic Regression, Support Vector Classifiers & ensemble learning like Random Forests & Gradient
Boosting Machine and XGB to train the model & the models were optimized using Grid Search & the
predictions were made on the test set using each trained model.
• Performed confusion matrix and classification report to evaluate the accuracy and performance of different
models used.
• Involved in extracting data from various sources and performed data cleansing, data integration, data
transformation, data mapping and loading data into hadoop with Apache Spark using PySpark and
SparkSQL.
• Practically engaged in Evaluating Models performance using A/B Testing, K-fold cross validation, R-
Square, and Confusion Matrix.
Environment: AWS RedShift, EC2, EMR, Hadoop Framework, S3, HDFS, Spark (Pyspark, MLlib, Spark SQL),
Python (Scikit-Learn, Scipy, Numpy, Pandas, Matplotlib, Seaborn), Mlxtend ,Tableau Desktop, Tableau Server,
Machine Learning (Regressions, KNN, SVM, Decision Tree, Random Forest, XGBoost, LightGBM, Ensemble),
Teradata, Git, Agile/SCRUM, NLP, Word2Vec, Tensorflow, Keras, ANN, CNN, LSTM, PCA, Apache Spark,
PySpark, SparkSQL.

Client: Odyssey Manufacturing, Tampa, FL Aug 2017 - Sep 2018


Role: Junior Data Scientist

Description: Odyssey Manufacturing Co. manufactures and sells bulk sodium hypochlorite products. It also stocks
and sells various disinfection equipment and bulk storage tanks; chemical feed pumps, as well as supporting piping
products, valves, and equipment; equipment to support disinfection, including chlorine analyzers; and on-site
hypochlorite generation systems for municipal applications.

Responsibilities:
• Participated in all phases of the project life cycle including data mining, data cleaning, Data Exploration and
developing models, validation, and creating reports.
• Performed data cleansing on a huge dataset that had missing data and extreme outliers from and explored data
to draw relationships and correlations between variables.
• Tackled a highly imbalanced dataset using under-sampling, oversampling with SMOTE, and cost-sensitive
algorithms with Python Scikit-learn.
• Wrote complex Spark SQL queries for data analysis to meet the business requirement.
• Developed MapReduce/ Spark Python modules for predictive analytics & machine learning in Hadoop on
AWS.
• Worked on data cleaning and ensured data quality, consistency, integrity using Python - Pandas, Numpy.
• Participated in feature engineerings such as feature intersection generating, feature normalization, and label
encoding with Scikit-learn preprocessing.
• Performed data-preprocessing on messy data including imputation, normalization, scaling, and feature
engineering using Pandas.
• Utilized random under-sampling to create a training dataset with a balanced class distribution.
• Conducted exploratory data analysis using Python Matplotlib and Seaborn to identify underlying patterns
and correlations between features.
• Linear Discriminant Analysis (LDA) is used as a dimensionality reduction technique in the pre-processing
step for pattern-classification and machine learning models.
• Used t-SNE to project these higher dimensional distributions into lower-dimensional visualizations.
• Build classification models based on KNN, Random Forest, XGBoost Classifier to predict the default of the
loan.
• Used Neural Network as a classification model using Keras with TensorFlow backend on Google Colab
GPUs.
• Used various metrics such as F-Score, ROC, and AUC to evaluate the performance of each model and Cross-
Validation to test the models with different batches of data to optimize the models.
• Worked on ETL package that included data conversions, dynamic variable expressions, sequence containers,
conditional data flow using SSIS
• Applied comprehensive tuning of the Random Forest algorithm Found specific factors that are most important
for detecting fraudulent transactions.
• Implemented and tested the model and collaborated with the development team to get the best algorithms and
parameters.
• Used big data tools Spark (PySpark, SparkSQL, MLlib) to conduct real-time analysis of loan default based
on AWS.
• Conducted Data blending, Data preparation using Alteryx and SQL for tableau consumption, and publishing
data sources to Tableau Server.
• Created multiple custom SQL queries in Teradata SQL Workbench to prepare the right data sets for Tableau
dashboards. Queries involved retrieving data from multiple tables using various join conditions that enabled to
utilize efficiently optimized data extracts for Tableau workbooks.

Environment: AWS, MS SQL Server, Teradata, ETL, SSIS, Tableau (Desktop/ Server), Python (Scikit-Learn/
Scipy/ Numpy/ Pandas), Linear Discriminant Analysis (LDA), Machine Learning (Naïve Bayes, KNN, Regressions,
Random Forest, SVM, XGBoost, Ensemble, Neural Network), AWS Redshift, Spark (PySpark, MLlib, Spark SQL),
Hadoop, MapReduce, HDFS, SharePoint, Spark SQL, LDA, Spark (PySpark, SparkSQL, MLlib).

You might also like