0 ratings 0% found this document useful (0 votes) 176 views 14 pages Win Kaggle Competition Course
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here .
Available Formats
Download as PDF or read online on Scribd
Go to previous items Go to next items
Save win kaggle competition course For Later 1015/2019 How to win a Kaggle competion in Data Science (via Coursera: part 1/5
How to win a Kaggle competition in
Data Science (via Coursera): part 1/5
Eric Perbos-Brindk
Apr 25,2018 -9 mintead
Model Example
‘Level Stackingin Homesite
Fava Marios Mica
Input Data ‘Mathie Mer
ye -lobber: Ming Ss
‘Source: Coursera
‘These are my notes from the 5-weeks course on Coursera, as taught by a team of data
scientists and Kaggle GrandMasters.
## Week 1 ##
ntpsifmedium.com/@eric-perbos how-to-win-a-kaggle-compettion-n-data-science-via-coursera-pat-1-5-592"tabad624 ana10152019 How to win a Kaggle compaiion in Data Sciance (via Coursera): pat 5
by Alexander Guschin, GM #5, Yandex, lecturer at MIPT
Mikhail Trafimo, PhD student at CCAS
Learning Objectives
Describe competition mechanics
Compare real life applications and competitions
Summarize reasons to participate in data science competitions
« Describe main types of ML algorithms
* Describe typical hardware and software requirements
« Analyze decision boundries of different classifiers
¢ Use standard ML libraries
1. Introduction and course overview
Among all topics of data science, competitive data analysis is especially interesting.
For an experienced specialist this is a great area to try his skills against other people
and learn some new tricks; and for a novice this a good start to quickly and playfully
earn basics of practical data science. For both, engaging in a competition is a good
chance to expand the knowledge and get acquainted with new people.
+ Week #1:
. Describe competition mechanics
. Compare real life applications and competitions
. Summarize reasons to participate in data science competitions
. Describe main types of ML algorithms
. Describe typical hardware and software requirements
. Analyze decision boundries of different classifiers
. Feature preprocessing and generation with respect to models
. Feature extractions from text and images
© Week #2:
. Fxploratory Data Analysis (EDA)
. EDA examples and visualizations
ntpsifmedium.com/@eric-perbos how-to-win-a-kaggle-compettion-n-data-science-via-coursera-pat-1-5-592"tabad624 ana10152019 How to win a Kaggle compeiion in Data Science (via Courser): part 115
. Inspect the data and find golden features
. Validation: risk of overfitting, strategies and problems
. Data leakages
© Week #3:
. Metrics optimization in a competition, new metrics
. Advanced Feature Engineering I: mean encoding, regularization, generalizations
« Week #4:
. Hyperparameter Optimization
. Tips and Tricks
. Advanced Feature Engineering II: matrix factorization for feature extraction,
tSNE, feature interactions
. Ensembling
‘Week #5:
« Competition “walkethrough” examples
. Final project
2. Competition mechanics
2.1. There is a great variety of competitions: NLP, Time-Series, Computer Vision.
But they all share the same structure:
. Data is supplied with description.
. An Evaluation function is given
« You build a model and use the Submission file
- Your submission is scored in a Leaderboard with Public and Private Test sets
‘The Public set is used during the competition, the Private one for final ranking
- You can submit between 2 and 5 files per day.
Why participate in a competition ?
« Great opportunity for learning and networking
. Interesting non-trivial tasks and state-of-the-art approaches
. Away to get recognition inside the Data Science community, and possible job offers
2.2. Kaggle overview
ntpsifmedium.com/@eric-perbos how-to-win-a-kaggle-compettion-n-data-science-via-coursera-pat-1-5-592"tabad624 ana10152019 How to win a Kaggle compeiion in Data Science (via Courser): part 115
‘Walkthrough of a Kaggle competition (Zillow home evaluation):
. Overview with description, evaluation, prizes and timeline
. Data provided by the organizer with description
« Public kernels created by participants, can be used as a starting point, especially the
EDA.
. Discussion: the organizer can provide additional information and answer questions
. Leaderboard: shaws the best score of each participant, and number of submissions.
Calculated on Public set during competition.
- Rules
. Team: you can create a team with other participants, check the rules and beware of
the max number of submissions allowed (unique participants vs team)
2.3. Real-World Applications vs Competitions
« Real world ML pipeline is a complicated process, including:
. Understanding the business problem
. Formalize the problem (what is a spam ?)
«Collect the data
. Clean and preprocess the data
. Choose a model
. Define an evaluation of the model in real life
. Inference speed
. Deploy the model to users
Competitions focus only on:
. Clean and preprocess the data
. Choose a model
ML competitions are a great way to learn but they don’t address the questions of
formalization, deployment and testing.
Don’t limit yourself: it’s OK to use Heuristics and Manual Data Analysis.
Don't be afraid of complex solutions, advanced feature engineering, huge calculations,
ensembling.
‘The ultimate goal is to achieve the highest score in the Metric value.
ntpsifmedium.com/@eric-perbos how-to-win-a-kaggle-compettion-n-data-science-via-coursera-pat-1-5-592"tabad624 ana0152019 How to win a Kaggle competion in Data Science (via Coursera: part 1/5
3. Recap of main ML algorithms
3.1. Main ML algorithms
¢ Linear models: try to separate data points with a plane, into 2 subspaces
ex: Logistic regression, Support Vector Machines (SVM)
Available in Scikit-Learn or Vowpal Wabbit
* ‘Tree-based: use Decision Trees (DT) like Random Forest and Gradient Boosted
Decision Trees (GBDT)
Applies a “Divide and Conquer” approach by splitting the data into sub-spaces or
boxes based on probabilities of outcome
In general, DT models are very powerful for tabular data; but rather weak to
capture linear dependencies as it requires a lot of splits.
Available in Sickit-Learn, XGBoost, LightGBM
¢ KNN: K-Nearest-Neighbors, looks for nearest data points. Close objects are likely to
have the same labels.
Neural Networks: often seen as a “black-bax”, can be very efficient for Images,
Sounds, Text and Sequences.
Available in TensorFlow, PyTorch, Keras
No Free Lunch Theorem: there’s not a single method that outperforms all the others
forall the tasks.
3.2. Disclaimer
Tf you don’t know much about basic ML algorithms, check thase links before taking the
quizz..
« Random Forest: http://www.datasciencecentral.com/profiles/blogs/random-
forests-explained-intuitively
« Gradient Boosting:
http://arogozhnikav.github.io/2016/06/24/gradient_boosting_explained.html
« kNN:
https://www.analyticsvidhya.com/blog/2014/10/introduction-leneighbours-
algorithm-clustering/
ntpsifmedium.com/@eric-perbos how-to-win-a-kaggle-compettion-n-data-science-via-coursera-pat-1-5-592"tabad624 540152019 How to win a Kaggle competion in Data Science (via Coursera: part 1/5
3.3. Additional Materials and Links
Covers Scikit-Learn library with kNN, Linear Models, Decision Trees.
Plus H20 documentation on algorithms and parameters.
.« Vowpal Wabbit
« XGBoost
« LightGBM
. Neural Nets with Keras, PyTorch, TensorFlow, MXNet & Lasagne
https://www.coursera.org/learn/competitive-data-
science/supplement/AgAOD/additional-materials-and-links
4. Software and Hardware requirements
4.1. Hardware
Get a PC with a recent Nvidia GPU, a CPU with 6-cares and 32gb of RAM.
A fast storage (hard drive) is critical, especially for Computer Vision, so a SSD is a must,
a NVMe even better.
Otherwise use cloud services like AWS but beware of the operating costs vs. a
dedicated PC.
4.2. Software
‘Linux (Ubuntu with Anaconda) is best, some key libraries aren't available on Windows.
. Python is today’s favorite as it supports a massive pool of libraries for ML.
. Numpy for linear algebra, Pandas for dataframes (like SQL), Scikit-Learn for classic
MLalgorithms.
« Matplotlib for plotting.
. Jupyter Notebook as an IDE (Integrated Development Environment).
. XGBoost and LightGBM for gradient-boosted decision trees.
« TensorFlow/Keras and PyTorch for Neural Networks,
4.3. Links for installation and documentations
https://www.coursera.org/learn/competitive-data-
science/supplement/Djqi7 /additional-material-and-links
ntpsifmedium.com/@eric-perbos how-to-win-a-kaggle-compettion-n-data-science-via-coursera-pat-1-5-592"tabad624 anasorszore How win Kage compton in Data Sconce a Course) at 1
5. Feature preprocessing and generation with respect to
models
5.1. Overview with Titanic on Kaggle
« Features: numeric, categorical (Red, Green, Blue), ordinal (old 0.99; 2.49€ -> 0.49
« Advanced one: generating time interval by a user typing a message (for spambot
detection)
Conclusion: DT don’t depend on scaling but non-DT hugely depend on it.
Most used preprocessings: MinMaxScaler, StandardScaler, Rank, np.log(+x) and
np.sqrt(1 +x)
Generation is powered by EDA and business knowledge.
5.3. Categorical and ordinal features
5.3.1. Feature Preprocessing:
‘There are three Categorical features in the Titanic dataset: Sex, Cabin, Fmbarked
(ort’s name)
Reminder on Ordinal classification examples:
Pelass (1,2,3) as ordered categorical feature or
Driver's license type (A, B, CD) or
Education level (kindergarden, school, college, bachelor, master, doctoral)
‘A. One technique is Label Encoding (replaces categories by numbers)
Good for DT, not so for non-DT.
ntpsifmedium.com/@eric-perbos how-to-win-a-kaggle-compettion-n-data-science-via-coursera-pat-1-5-592"tabad624 ana10152019 How to win a Kaggle compeiion in Data Science (via Courser): part 115
For Embarked (S for Southampton, C for Cherbourg, Q for Queenstawn)
- Alphabetical (sorted): [S,C,Q] -> [2,1,3] with sklearn. preprocessing. LabelEncoder
- Order of Appearance: [S,C,Q] -> [1,2,3] with Fandas.factorize
- Frequency encoding: [S,C,Q] -> [0.5, 0.3, 0.2], better for non-DT as it preserves
information about value distribution, but still great for DT.
B. Another technique is One-hat Encoding, (0,0,1) ar (0,1,0) for each raw
pandas, get_dummies, sklearn. preprocessing.OneHotEncoder
Great for non-DT, plus it’s scaled (min=0, max=1).
Warning: if too many unique values in category, then one-hot generates too many
columns with lots of zero-values.
‘Then to save RAM, maybe use sparse matrices and store only non-zero elements (tip: if
non-zero values far less than 50% total).
5.3.2. Feature Generation for categorical feature:
(more in next lessons)
5.4. Datetime and Coordinates features
A. Date & Time:
¢ Periodicity’ (Day number in Week, Month, Year, Season) is used to capture
repetitive patterns.
‘Time since’ drug was taken, or last holidays, or numbers of days left before etc.
Can be Rowsindependent moment (ex: since 00:00:00 UTC, 1 january 1970) or
Row-dependent (since last drug taken, last holidays, numbers of days left before
etc.)
Difference between dates’ for Churn prediction, like “Last_purchase_date—
Last_call_date = Date_diff”
B. Coordinates:
¢ Distance to nearest POI (subway, school, hospital, police etc)
* You can also use Clusters based on new features and use “Distance to cluster’s
center coords”,
ntpsifmedium.com/@eric-perbos how-to-win-a-kaggle-compettion-n-data-science-via-coursera-pat-1-5-592"tabad624 ona0152019 How to win a Kaggle competion in Data Science (via Coursera: part 1/5
* Orcteate Aggregate Stats, such as “Number of Flats in Area” or “Mean Realty Price
in Area”
Advanced tip: look for Rotate on coords
5.5. Handling missing values
Types of missing values: NaN, empty string, “1’ (replacing missing values in (0,11),
very large number, “99999”, 999’ etc.
« Fillna approaches:
+999, -1or
Mean & median or
“(snull” binary feature can be beneficial or
Reconstruct the missing value if possible (best approach)
Do not fill NaNs before Feature generation: this can pollute the data (ex: “Time
since” or Frequency/Label Encoding) and screw the model.
XGboost can handle “NaN”, to try.
‘Treating Test values not present in train data: Frequency encoding in Train can help
as it will look for Frequency in Test as well.
6. Feature extraction from text and images
6.1. Bag of Words (BOW)
ntpsifmedium.com/@eric-perbos how-to-win-a-kaggle-compettion-n-data-science-via-coursera-part-1-5-592ftabad624 10140152019 How to win a Kaggle competion in Data Science (via Coursera: part 115
Source: Coursera
For Titanic, we can extract information/patterns from the passengers’ names such as
their family members/siblings or their titles (Lord, Princess)
How-to: sklearn.feature_extraction.text.CountVectorizer
Creates 1 column per unique word, and counts its occurence per row (phrase).
A. Text preprocessing
« Lowercase: Very->very
« Lemmatization: democracy, democratic, democratization -> democracy (requires
good dictionary, corpus )
Stemming: democracy, democratic, democratization -> democr
Stopwords: get rid of articles, prepositions and very common words, uses NLTK
(Natural Language ToolKit)
‘sklearn, feature_extraction.text.CountVectorizer’ with max_df
B. N-grams for sequences of wards or characters, can help to use local context
‘sklearn.feature_extraction.text.CountVectorizer’ with Ngram_range and analyzer
ntpsimedium.com/@eric-perbos how-to-win-s-kaggle-compettion-n-data-sclence-via-coursera-part-1-5-592"tabad624 wna0152019 How to win a Kaggle competion in Data Science (via Coursera: part 1/5
C. THIDF for postprocessing (required to scale features for non-DT)
* TF; Term Frequency (in % per row, sum = 1), followed by
iDF: Inverse Document Frequency (to boost rare words vs frequent words)
‘sklearn.feature_extraction. text. TfidfVectorizer’
ntpsifmedium.com/@eric-perbos how-to-win-s-kaggle-compettion-n-data-science-via-coursera-pat-1-5-S92"tabad624 ran0152019 How to win a Kaggle competion in Data Science (via Coursera: part 1/5
6.2. Using Word Vectors and ConvNets
A, Word Vectors
¢ Word2vec converts each word to some vector in a space with hundreds of
dimensions, creates embeddings between words oftn used together in the same
context.
King with Man, Queen with Woman.
King-Queen = Man-Woman (in vector size)
Other Word Vectors: Glove, FastText
« Sentences: Doc2vec
‘There are pretrained models, like on Wikipedia.
Note: preprocessing can be applied BEFORE using Word2vec
B. Comparing BOW vs w2v (Word2vec)
« BOW: very large vectors, meaning of each value in vector is known
ntpsifmedium.com/@eric-perbos how-to-win-a-kaggle-compettion-n-data-science-via-coursera-pat-1-5-592"tabad624 131410152019 How to win a Kaggle compaiion in Data Sciance (via Coursera): pat 5
w2v: smaller vectors, values in vector rarely interpreted, words with similar
meaning often have similar embeddings
C. Quick intro on extracting features from Images with CNNs
(covered in details in later lessons)
¢ Finetuning or transfer-learning
« Data augmentation
Next week : Exploratory Data Analysis (EDA) and Data Leakages
MachineLearning Kaggle_-DataScience Artificial Intelligence Deep Learring
ntpsifmedium.com/@eric-perbos how-to-win-a-kaggle-compettion-n-data-science-via-coursera-pat-1-5-592"tabad624
sana