0% found this document useful (0 votes)
22 views41 pages

Pricing Mercari

The document outlines a machine learning challenge to create a pricing recommendation model for Mercari, Japan's largest community-driven shopping app. It details the dataset features, preprocessing techniques, and the steps necessary to prepare the data for analysis, including handling missing values and text processing. The goal is to develop an algorithm that suggests optimal prices based on product descriptions and characteristics, ultimately enhancing user transactions on the platform.

Uploaded by

takunda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views41 pages

Pricing Mercari

The document outlines a machine learning challenge to create a pricing recommendation model for Mercari, Japan's largest community-driven shopping app. It details the dataset features, preprocessing techniques, and the steps necessary to prepare the data for analysis, including handling missing values and text processing. The goal is to develop an algorithm that suggests optimal prices based on product descriptions and characteristics, ultimately enhancing user transactions on the platform.

Uploaded by

takunda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

randylaosat / Price-Recommendation-Mercari-Challenge

Create a machine learning model that will suggest prices for certain items

5 stars 9 forks

Star Watch

Code Issues Pull requests Actions Projects Wiki Security Insights

master

Randy Lao … on 8 Jun 2018

View code

Mercari Price Suggestion Challenge

Can you automatically suggest product prices to online sellers?


Product pricing gets even harder at scale, considering just how many products
are sold online. Clothing has strong seasonal pricing trends and is heavily
influenced by brand names, while electronics have fluctuating prices based on
product specs.

Mercari, Japan’s biggest community-powered shopping app, knows this problem


deeply. They’d like to offer pricing suggestions to sellers, but this is tough because
their sellers are enabled to put just about anything, or any bundle of things, on
Mercari's marketplace.

In this competition, Mercari’s challenging you to build an algorithm that


automatically suggests the right product prices. You’ll be provided user-
inputted text descriptions of their products, including details like product category
name, brand name, and item condition.

Dataset Features
ID: the id of the listing
Name: the title of the listing
Item Condition: the condition of the items provided by the seller
Category Name: category of the listing
Brand Name: brand of the listing
Shipping: whether or not shipping cost was provided
Item Description: the full description of the item
Price: the price that the item was sold for. This is the target variable that you
will predict. The unit is USD.

Work on supply and demand

Source: https://www.kaggle.com/c/mercari-price-suggestion-challenge

Representing and Mining Text

Since, text is the most unstructured form of all the available data, various types of
noise are present in it and the data is not readily analyzable without any pre-
processing. The entire process of cleaning and standardization of text, making it
noise-free and ready for analysis is known as text pre-processing.
Fundamental Concepts
The importance of constructing mining-friendly data representations;
Representation of text for data mining.

Important Terminologies
Document: One piece of text. It could be a single sentence, a paragraph, or
even a full page report.
Tokens: Also known as terms. It is simply just a word. So many tokens form a
document.
Corpus: A collection of documents.
Term Frequency (TF): Measures how often a term is in a single document
Inverse Document Frequency (IDF): distribution of a term over a corpus

Pre-Processing Techniques
Stop Word Removal: stop words are terms that have little no meaning in a
given text. Think of it as the "noise" of data. Such terms include the words,
"the", "a", "an", "to", and etc...

**Bag of Words Representation: ** treats each word as a feature of the


document

TFIDF: a common value representation of terms. It boosts or weighs words that


have low occurences. For example, if the word "play" is common, then there is
little to no boost. But if the word "mercari" is rare, then it has more
boosts/weight.

N-grams: Sequences of adjacent words as terms. For example, since a word by


itself may have little to no value, but if you were to put two words together and
analyze it as a pair, then it might add more meaning.

Stemming and Lemmatization:

Named Entity Extraction: A pre-processing technique used to know when


word sequences constitute proper names. Example, "HP", "H-P", and
"Hewlett-Packard" all represent the Hewlett-Packard Corporation.

Topic Models: A type of model that represents a set of topics from a sequence
of words.
MileStone Report

A. Define the objective in business terms: The objective is to come up with the
right pricing algorithm that can we can use as a pricing recommendation to the
users.

B. How will your solution be used?: Allowing the users to see a suggest price
before purchasing or selling will hopefully allow more transaction within Mercari's
business.

C. How should you frame this problem?: This problem can be solved using a
supervised learning approach, and possible some unsupervised learning methods
as well for clustering analysis.

D. How should performance be measured?: Since its a regression problem, the


evaluation metric that should be used is RMSE (Root Mean Squared Error). But in
this case for the competition, we'll be using the

E. Are there any other data sets that you could use?: To get a more accurate
understanding and prediction for this problem, a potential dataset that we can
gather would be more about the user. Features such as user location, user gender,
and time could affect it.

General Steps

1. Handle Missing Values—Replaced “missing” values with NA.

2. Lemmatization performed on item_description—Aiming to remove inflectional


endings only and to return the base or dictionary form of a word

3. Label encoding has been performed on categorical values—Encode labels with


value between 0 and n_classes-1.

4. Tokenization—Given a character sequence, tokenization is the task of chopping


it up into pieces, called tokens and remove punctuation.

5. Maximum length of all sequences has been specified

6. Scaling performed on target variable (price)

7. Sentiment scored computed on item_description


8. Scaling performed on item description length as well

Import Packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import gc

from string import punctuation

# vstack - adds rows, hstack - adds columns


# csr_matrix - used to handle sparse matrix
from scipy.sparse import vstack, hstack, csr_matrix

# CountVectorizer - Simply, counts word frequencies


# TFIDF - More importance/weights on "rare" words. Less importance/weights on "f
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# LabelBinarizer - Converts labels into numerical representation "G,B,R" -> [1,2


from sklearn.preprocessing import LabelBinarizer

# Ridge - Reduces multicollinearity in regression. Applies L2 Regularization


from sklearn.linear_model import Ridge

Import Train / Test Data

# Create training set


train = pd.read_csv('C:/Users/Randy/Desktop/training/train.tsv', sep = '\t
train.head()
<style> .dataframe thead tr:only-child th { text-align: right; }

.dataframe thead th {
text-align: left;
}

.dataframe tbody tr th {
vertical-align: top;
}

</style>
train_id name item_condition_id category_name

MLB
Cincinnati
0 0 3 Men/Tops/T-shirts
Reds T Shirt
Size XL

Razer
BlackWidow Electronics/Computers &
1 1 3
Chroma Tablets/Components & P...
Keyboard

AVA-VIV Women/Tops &


2 2 1
Blouse Blouses/Blouse

Leather
Home/Home Décor/Home
3 3 Horse 1
Décor Accents
Statues

24K GOLD
4 4 1 Women/Jewelry/Necklaces
plated rose

# Create testing set


test = pd.read_csv('C:/Users/Randy/Desktop/training/test.tsv', sep = '\t',

# Create log price variable (Transformation)


y = np.log1p(train['price'])
Combine Test and Train Set

# Create combined set. You would want to apply count vectorizer on combined set
combined = pd.concat([train,test])

# Create the submission set (Only contains the test ID)


submission = test[['test_id']]

# Create size of train


train_size = len(train)

combined.shape

(1286735, 9)

combined_ML = combined.sample(frac=0.1).reset_index(drop=True)

combined_ML.shape

(128674, 9)

Part 2: Preparing the Corpus for Analysis

a. Remove Puncuations

b. Remove Digits

c. Remove stop words

d. Lower case words

e. Lemmatization or Stemming
Remove Puncuation

punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

# Create a list of punctuation replacements


punctuation_symbols = []
for symbol in punctuation:
punctuation_symbols.append((symbol, ''))

punctuation_symbols

[('!', ''),
('"', ''),
('#', ''),
('$', ''),
('%', ''),
('&', ''),
("'", ''),
('(', ''),
(')', ''),
('*', ''),
('+', ''),
(',', ''),
('-', ''),
('.', ''),
('/', ''),
(':', ''),
(';', ''),
('<', ''),
('=', ''),
('>', ''),
('?', ''),
('@', ''),
('[', ''),
('\\', ''),
(']', ''),
('^', ''),
('_', ''),
('`', ''),
('{', ''),
('|', ''),
('}', ''),
('~', '')]

Create a remove punctuation method

import string
def remove_punctuation(sentence: str) -> str:
return sentence.translate(str.maketrans('', '', string.punctuation))

Remove Digits

def remove_digits(x):
x = ''.join([i for i in x if not i.isdigit()])
return x

Remove Stop Words

from nltk.corpus import stopwords

stop = stopwords.words('english')

def remove_stop_words(x):
x = ' '.join([i for i in x.lower().split(' ') if i not in stop])
return x

Lower Case Words

def to_lower(x):
return x.lower()

Part 3: Explore Training Set

MIssing Values:

Category_name
Brand_name
Item_description

Categorical Variables (Need to do Encoding):

name
category_name
brand_name
item_description

Check Missing Values

train.count()

train_id 593376
name 593376
item_condition_id 593376
category_name 590835
brand_name 340359
price 593376
shipping 593376
item_description 593375
dtype: int64

Check Data Types

train.dtypes

train_id int64
name object
item_condition_id int64
category_name object
brand_name object
price float64
shipping int64
item_description object
dtype: object

3a. Price Distribution


Why Do Price Vary?

Supply and Demand


Brand Name
Fabric Terms
"Quality"-Type Words (Check to see if quality plays a role in price)
Condition

Summary:

The mean price in the dataset is 26 Dollars


The median price in the dataset is 17 Dollars
The max price in the dataset is 2000 Dollars
Due to the skewed dataset, the median price is a more reliable price to gauge
off of.

train.price.describe()

count 593376.000000
mean 26.689003
std 38.340061
min 0.000000
25% 10.000000
50% 17.000000
75% 29.000000
max 2000.000000
Name: price, dtype: float64

# Could we use these as features? Look at median price for each quantile
bins = [0, 10, 17, 29, 2001]
labels = ['q1','q2','q3','q4']
train['price_bin'] = pd.cut(train['price'], bins=bins, labels=labels)
train.groupby('price_bin')['price'].describe()

<style> .dataframe thead tr:only-child th { text-align: right; }

.dataframe thead th {
text-align: left;
}

.dataframe tbody tr th {
vertical-align: top;
}

</style>
count mean std min 25% 50% 75%

price_bin

q1 149944.0 7.710178 2.083100 3.0 6.0 8.0 10.0

q2 151863.0 13.834845 1.795258 10.5 12.0 14.0 15.0

q3
README.md 144043.0 22.539551 3.335075 17.5 20.0 22.0 25.0

q4 147215.0 63.396077 63.271190 30.0 35.0 45.0 66.0

plt.figure(figsize=(12, 7))
plt.hist(train['price'], bins=50, range=[0,250], label='price')
plt.title('Price Distribution', fontsize=15)
plt.xlabel('Price', fontsize=15)
plt.ylabel('Samples', fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize=15)
plt.show()
shipping = train[train['shipping']==1]['price']
no_shipping = train[train['shipping']==0]['price']

plt.figure(figsize=(12,7))
plt.hist(shipping, bins=50, normed=True, range=[0,250], alpha=0.7, label='Price
plt.hist(no_shipping, bins=50, normed=True, range=[0,250], alpha=0.7, label
plt.title('Price Distrubtion With/Without Shipping', fontsize=15)
plt.xlabel('Price')
plt.ylabel('Normalized Samples')
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize=15)
plt.show()

3b. Brand Analysis

# Amount of unique brand names


train['brand_name'].nunique()
3751

# Top 20 Brand Distribution


b20 = train['brand_name'].value_counts()[0:20].reset_index().rename(columns
ax = sns.barplot(x="brand_name", y="count", data=b20)
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
ax.set_title('Top 20 Brand Distribution', fontsize=15)
plt.show()

# Display Top 20 Expensive Brands By Mean Price


top20_brand = train.groupby('brand_name', axis=0).mean()
df_expPrice = pd.DataFrame(top20_brand.sort_values('price', ascending = False

ax = sns.barplot(x="brand_name", y="price", data=df_expPrice)


ax.set_xticklabels(ax.get_xticklabels(),rotation=90, fontsize=15)
ax.set_title('Top 20 Expensive Brand', fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.show()# Displayd frequency ratio of brand names
train['brand_name'].value_counts(1)
PINK 0.063659
Nike 0.062687
Victoria's Secret 0.056910
LuLaRoe 0.036462
Apple 0.020337
Nintendo 0.017958
FOREVER 21 0.017649
Lululemon 0.017047
Michael Kors 0.016415
American Eagle 0.015604
Rae Dunn 0.014676
Sephora 0.014303
Coach 0.012372
Adidas 0.012311
Bath & Body Works 0.012167
Disney 0.012014
Funko 0.011030
Under Armour 0.009807
Sony 0.009587
Old Navy 0.009208
Hollister 0.008106
Carter's 0.007698
Urban Decay 0.007383
Urban Decay 0.007383
The North Face 0.007110
Too Faced 0.006828
Xbox 0.006728
Independent 0.006711
MAC 0.006514
Brandy Melville 0.006464
Kate Spade 0.006396
...
Sock It to Me 0.000003
Cocomo 0.000003
Onque Casuals 0.000003
Kaii 0.000003

GoGroove Pal 0.000003


Com 0.000003
Honda 0.000003
White + Warren 0.000003
Elomi 0.000003
Mecca 0.000003
Lulu Frost 0.000003
True Rock 0.000003
Christian Lacroix 0.000003
Acne Jeans 0.000003
First Act 0.000003
Neil Allyn 0.000003
Foundry 0.000003
Dog MD 0.000003
Armani Exchange 0.000003
Oxford Golf 0.000003
Yakima 0.000003
Bacco Bucci 0.000003
Bostonian 0.000003
BedHead 0.000003
Moose Mountain 0.000003
Tootsie 0.000003
Catit 0.000003
Lisa Maree 0.000003
Ecco Bella 0.000003
Custo Barcelona 0.000003
Name: brand_name, Length: 3751, dtype: float64

3c. Category Distribution

def transform_category_name(category_name):
try:
main, sub1, sub2= category_name.split('/')
return main, sub1, sub2
except:
return np.nan, np.nan, np.nan

train['category_main'], train['category_sub1'], train['category_sub2'] = zip

cat_train = train[['category_main','category_sub1','category_sub2', 'price'

cat_train.head()

<style> .dataframe thead tr:only-child th { text-align: right; }

.dataframe thead th {
text-align: left;
}

.dataframe tbody tr th {
vertical-align: top;
}

</style>
category_main category_sub1 category_sub2 price

0 Men Tops T-shirts 10.0

1 Electronics Computers & Tablets Components & Parts 52.0

2 Women Tops & Blouses Blouse 10.0

3 Home Home Décor Home Décor Accents 35.0

4 Women Jewelry Necklaces 44.0

3c. Main Category

Interesting findings:

Women and Beauty take up majority of the distribution


Women and Beauty take up 56% of the distribution

Questions to ask:

Can we create a gender category (Female, Male, Nuetral). Example: Three


categories means three gender types. If two of them are female, then we
classify as a female purchaser. If two of them are male, then we classify as
male. If male/female/neutral then?
Does gender play a role in price?
Can we create an age category?

# Electronics have the highest std


train.groupby('category_main')['price'].describe()

<style> .dataframe thead tr:only-child th { text-align: right; }

.dataframe thead th {
text-align: left;
}

.dataframe tbody tr th {
vertical-align: top;
}

</style>
count mean std min 25% 50%

category_main

Beauty 83315.0 19.727468 20.708703 0.0 10.0 15.0

Electronics 47986.0 33.763889 63.485958 0.0 9.0 15.0

Handmade 12257.0 18.325365 27.484725 0.0 6.0 12.0

Home 27331.0 24.845798 25.203925 0.0 12.0 18.0

Kids 68404.0 20.664983 22.877467 0.0 10.0 14.0

Men 37382.0 34.532369 39.729618 0.0 14.0 21.0

Other 18251.0 20.821434 31.046225 0.0 9.0 13.0

Sports &
9632.0 25.140365 27.388032 0.0 11.0 16.0
Outdoors

Vintage &
18673.0 27.158732 52.338051 0.0 10.0 16.0
Collectibles

Women 265870.0 28.843331 39.435913 0.0 12.0 19.0


# Display distribution
train['category_main'].value_counts(1)

Women 0.451315
Beauty 0.141427
Kids 0.116116
Electronics 0.081456
Men 0.063456
Home 0.046394
Vintage & Collectibles 0.031697
Other 0.030981
Handmade 0.020806
Sports & Outdoors 0.016350
Name: category_main, dtype: float64

plt.figure(figsize=(17,10))
sns.countplot(y = train['category_main'], order = train['category_main'].value_c
plt.title('Top 10 Categories', fontsize = 25)
plt.ylabel('Main Category', fontsize = 20)
plt.xlabel('Number of Items in Main Category', fontsize = 20)
plt.show()

png

#main = pd.DataFrame(cat_train['category_main'].value_counts()).reset_index().re
fig, axes = plt.subplots(figsize=(12, 7))
main = cat_train[cat_train["price"]<100]
# Use a color palette
ax = sns.boxplot( x=main["category_main"], y=main["price"], palette="Blues"
ax.set_xticklabels(ax.get_xticklabels(),rotation=90, fontsize=12)

sns.plt.show()
# Create a "no_brand" column
train['no_brand'] = train['brand_name'].isnull()

f, ax = plt.subplots(figsize=(15, 4))
sns.countplot(y='category_main', hue='no_brand', data=train).set_title('Category
plt.show()

3c. Category_2 Distribution


df = cat_train.groupby(['category_sub2'])['price'].agg(['mean']).reset_index
df= df.sort_values('mean', ascending=False).head(20)

plt.figure(figsize=(20, 15))
plt.barh(range(0,len(df)), df['mean'], align='center', alpha=0.5, color='r'
plt.yticks(range(0,len(df)), df['category_sub2'], fontsize=15)

plt.xlabel('Price', fontsize=15)
plt.ylabel('Sub Category 2', fontsize=15)
plt.title('Top 20 2nd Category (Mean Price)', fontsize=20)
plt.show()

3c. Category_1 Distribution

df = cat_train.groupby(['category_sub1'])['price'].agg(['mean']).reset_index
df= df.sort_values('mean', ascending=False)[0:20]

plt.figure(figsize=(20, 15))
plt.barh(range(0,len(df)), df['mean'], align='center', alpha=0.5, color='b'
plt.yticks(range(0,len(df)), df['category_sub1'], fontsize=15)

plt.xlabel('Price', fontsize=15)
plt.ylabel('Sub Category 1', fontsize=15)
plt.title('Top 20 1st Category (Mean Price)', fontsize=20)
plt.show()

3d. Item Description Analysis

Hypothesis:

Does length play a role in price?


Does certain descriptions make a fake item?
Lenghthier descriptions mean more effort in item, more authentic, more
valuable?

# Remove Punctuation
combined.item_description = combined.item_description.astype(str)

descr = combined[['item_description', 'price']]


descr['count'] = descr['item_description'].apply(lambda x : len(str(x)))

descr['item_description'] = descr['item_description'].apply(remove_digits)
descr['item_description'] = descr['item_description'].apply(remove_punctuation
descr['item_description'] = descr['item_description'].apply(remove_stop_words
descr.head(3)
C:\Users\Randy\Anaconda3\lib\site-packages\ipykernel_launcher.py:5:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-


docs/stable/indexing.html#indexing-view-versus-copy
"""
C:\Users\Randy\Anaconda3\lib\site-packages\ipykernel_launcher.py:7:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-


docs/stable/indexing.html#indexing-view-versus-copy
import sys
C:\Users\Randy\Anaconda3\lib\site-packages\ipykernel_launcher.py:8:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-


docs/stable/indexing.html#indexing-view-versus-copy

C:\Users\Randy\Anaconda3\lib\site-packages\ipykernel_launcher.py:9:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-


docs/stable/indexing.html#indexing-view-versus-copy
if __name__ == '__main__':

<style> .dataframe thead tr:only-child th { text-align: right; }

.dataframe thead th {
text-align: left;
}

.dataframe tbody tr th {
vertical-align: top;
}

</style>
item_description price count
0 description yet 10.0 18

1 keyboard great condition works like came box p... 52.0 188

2 adorable top hint lace key hole back pale pink... 10.0 124

from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

descr['item_description'] = descr['item_description'].apply(porter.stem)

C:\Users\Randy\Anaconda3\lib\site-packages\ipykernel_launcher.py:5:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-


docs/stable/indexing.html#indexing-view-versus-copy
"""

descr.tail(10)

<style> .dataframe thead tr:only-child th { text-align: right; }

.dataframe thead th {
text-align: left;
}

.dataframe tbody tr th {
vertical-align: top;
}

</style>
item_description price count

��listing �� brand new shades piu


693349 NaN 71
scar...

693350 rm free shipping new highly addict NaN 47

693351 purple boys polo shirt size old navy never worn NaN 59

693352 express deep olive green cardigan ultra thin ... NaN 121
693352 express deep olive green cardigan ultra thin ... NaN 121

693353 shade medium neutral barley us NaN 41

693354 flintquartz cluster self mined ✨measures xin... NaN 243

cosmetics travel bundle includes brow power


693355 NaN 968
un...

693356 new free shipping basstop cas NaN 31

693357 floral kimono tropical print open front hi low... NaN 94

693358 floral scrub tops worn less times brown belt ti NaN 71

df = descr.groupby('count')['price'].mean().reset_index()
sns.regplot(x=df["count"], y=(df["price"]))
plt.xlabel("word count")
plt.show()

png

Create Pre-Processing Functions

combined.head()

<style> .dataframe thead tr:only-child th { text-align: right; }

.dataframe thead th {
text-align: left;
}

.dataframe tbody tr th {
vertical-align: top;
}

</style>
brand_name category_name item_condition_id item_descrip

No description
0 NaN Men/Tops/T-shirts 3 yet

This keyboard
Electronics/Computers &
1 Razer 3 in great condit
Tablets/Components & P...
and works ...

Adorable top w
Women/Tops &
2 Target 1 a hint of lace a
Blouses/Blouse
a key hol...

New with tags


Home/Home Décor/Home
3 NaN 1 Leather horses
Décor Accents
Retail for [rm]

Complete with
4 NaN Women/Jewelry/Necklaces 1 certificate of
authenticity

# handle_missing_values - Basic data imputation of missing values


def handle_missing_values(df):
df['category_name'].fillna(value='missing', inplace=True)
df['brand_name'].fillna(value='None', inplace=True)
df['item_description'].fillna(value='None', inplace=True)

# to_categorical - Converts Categorical Features


def to_categorical(df):
df['brand_name'] = df['brand_name'].astype('category')
df['category_name'] = df['category_name'].astype('category')
df['item_condition_id'] = df['item_condition_id'].astype('category')

Apply Pre-Processing Functions

# Applying the pre-processing functions


handle_missing_values(combined)
to_categorical(combined)
# apply the pre-processing function to ML combined
handle_missing_values(combined_ML)
to_categorical(combined_ML)

# Remove Punctuation
combined_ML.item_description = combined_ML.item_description.astype(str)

combined_ML['item_description'] = combined_ML['item_description'].apply(remove_d
combined_ML['item_description'] = combined_ML['item_description'].apply(remove_p
combined_ML['item_description'] = combined_ML['item_description'].apply(remove_s
combined_ML['item_description'] = combined_ML['item_description'].apply(to_lower

combined_ML['name'] = combined_ML['name'].apply(remove_digits)
combined_ML['name'] = combined_ML['name'].apply(remove_punctuation)
combined_ML['name'] = combined_ML['name'].apply(remove_stop_words)
combined_ML['name'] = combined_ML['name'].apply(to_lower)

combined_ML.head(3)

<style> .dataframe thead tr:only-child th { text-align: right; }

.dataframe thead th {
text-align: left;
}

.dataframe tbody tr th {
vertical-align: top;
}

</style>
brand_name category_name item_condition_id item_description

wore big size


small normally
0 Rue21 Women/Skirts/Maxi 2
wear extra small
...

Women/Tops & pink size xs


1 PINK Blouses/Tank, 3 racerback free
Cami shipping

Women/Athletic
2 LuLaRoe Apparel/Pants, 1 description yet
Tights, Leggings

# Remove Punctuation
combined.item_description = combined.item_description.astype(str)

combined['item_description'] = combined['item_description'].apply(remove_digits
combined['item_description'] = combined['item_description'].apply(remove_punctua
combined['item_description'] = combined['item_description'].apply(remove_stop_wo
combined['item_description'] = combined['item_description'].apply(to_lower

combined['name'] = combined['name'].apply(remove_digits)
combined['name'] = combined['name'].apply(remove_punctuation)
combined['name'] = combined['name'].apply(remove_stop_words)
combined['name'] = combined['name'].apply(to_lower)

Create three new features from


Categories (Main, Sub1, Sub2)

combined.isnull().any()

brand_name False
category_name False
item_condition_id False
item_description False
name False
price True
shipping False
test_id True
train_id True
dtype: bool

combined.head()

<style> .dataframe thead tr:only-child th { text-align: right; }

.dataframe thead th {
text-align: left;
}

.dataframe tbody tr th {
vertical-align: top;
}

</style>
brand_name category_name item_condition_id item_descrip

0 None Men/Tops/T-shirts 3 description ye

keyboard grea
Electronics/Computers &
1 Razer 3 condition work
Tablets/Components & P...
like came box

adorable top h
Women/Tops &
2 Target 1 lace key hole
Blouses/Blouse
back pale pink

new tags leath


Home/Home Décor/Home
3 None 1 horses retail rm
Décor Accents
stand foot h...

complete
4 None Women/Jewelry/Necklaces 1 certificate
authenticity

combined_ML.head()

<style> .dataframe thead tr:only-child th { text-align: right; }

.dataframe thead th {
text-align: left;
}

.dataframe tbody tr th {
vertical-align: top;
}
</style>
brand_name category_name item_condition_id item_description

wore big size


small normally
0 Rue21 Women/Skirts/Maxi 2
wear extra small
...

pink size xs
Women/Tops &
1 PINK 3 racerback free
Blouses/Tank, Cami
shipping

Women/Athletic
2 LuLaRoe Apparel/Pants, 1 description yet
Tights, Leggings

silk express
Beauty/Hair
shampoo silk
3 None Care/Shampoo & 1
conditioner leave
Conditioner Sets
co...

deluxe samples
ysl smashbox
4 Sephora Beauty/Makeup/Face 1
hourglass
biossance

combined.head()

<style> .dataframe thead tr:only-child th { text-align: right; }

.dataframe thead th {
text-align: left;
}

.dataframe tbody tr th {
vertical-align: top;
}

</style>
brand_name category_name item_condition_id item_descrip
0 None Men/Tops/T-shirts 3 description ye

keyboard grea
Electronics/Computers &
1 Razer 3 condition work
Tablets/Components & P...
like came box

adorable top h
Women/Tops &
2 Target 1 lace key hole
Blouses/Blouse
back pale pink

new tags leath


Home/Home Décor/Home
3 None 1 horses retail rm
Décor Accents
stand foot h...

complete
4 None Women/Jewelry/Necklaces 1 certificate
authenticity

Apply CountVectorizer / TfidfVectorizer /


LabelBinarizer

Encode labels into categorical variables: Pandas factorize and scikit-learn


LabelEncoder.

The result will have 1 dimension.

Encode categorical variable into dummy/indicator (binary) variables: Pandas


get_dummies and scikit-learn OneHotEncoder.

The result will have n dimensions, one by distinct value of the encoded
categorical variable.

Create new Feature (Binning Price Into


Two Categories)
#bins = [0, 64, 5000]
#labels = ['less','more']
#combined['lt65'] = pd.cut(combined['price'], bins=bins, labels=labels)

combined.head()

<style> .dataframe thead tr:only-child th { text-align: right; }

.dataframe thead th {
text-align: left;
}

.dataframe tbody tr th {
vertical-align: top;
}

</style>
brand_name category_name item_condition_id item_descrip

0 None Men/Tops/T-shirts 3 description ye

keyboard grea
Electronics/Computers &
1 Razer 3 condition work
Tablets/Components & P...
like came box

adorable top h
Women/Tops &
2 Target 1 lace key hole
Blouses/Blouse
back pale pink

new tags leath


Home/Home Décor/Home
3 None 1 horses retail rm
Décor Accents
stand foot h...

complete
4 None Women/Jewelry/Necklaces 1 certificate
authenticity

# Apply Count Vectorizer to "name", this converts it into a sparse matrix


cv = CountVectorizer(min_df=10)
X_name = cv.fit_transform(combined['name'])
X_name

<1286735x15973 sparse matrix of type '<class 'numpy.int64'>'


with 4789374 stored elements in Compressed Sparse Row format>

# Apply Count Vectorizer to "category_name", this converts it into a sparse matr


cv = CountVectorizer()
X_category = cv.fit_transform(combined['category_name'])
#X_sub1 = cv.fit_transform(combined['sub_category_1'])
#X_sub2 = cv.fit_transform(combined['sub_category_2'])
X_category

<1286735x1007 sparse matrix of type '<class 'numpy.int64'>'


with 5165431 stored elements in Compressed Sparse Row format>

# Apply TFIDF to "item_description",


tv = TfidfVectorizer(max_features=55000, ngram_range=(1, 2), stop_words='english
X_description = tv.fit_transform(combined['item_description'])

# Apply LabelBinarizer to "brand_name"


lb = LabelBinarizer(sparse_output=True)
X_brand = lb.fit_transform(combined['brand_name'])

Create CSR_Matrix & Merge the Sparse


Matrices

# Create our final sparse matrix


X_dummies = csr_matrix(pd.get_dummies(combined[['item_condition_id', 'shipping'

# Combine everything together


sparse_merge = hstack((X_dummies, X_description, X_brand, X_category, X_name

Train and Test Split


X_train_sparse = sparse_merge[:train_size]

X_test = sparse_merge[train_size:]

#X_train = sparse_merge[:len(combined_ML)]

#X_test = sparse_merge[len(combined_ML):]

combined.columns

Index(['brand_name', 'category_name', 'item_condition_id',


'item_description',
'name', 'price', 'shipping', 'test_id', 'train_id'],
dtype='object')

Cross Validation

from sklearn.cross_validation import KFold


eval_size = .10
kf = KFold(len(y), round(1. / eval_size))
train_indicies, valid_indicies = next(iter(kf))
X_train, y_train = X_train_sparse[train_indicies], y[train_indicies]
X_valid, y_valid = X_train_sparse[valid_indicies], y[valid_indicies]
X_valid, y_valid = X_train_sparse[valid_indicies], y[valid_indicies]

C:\Users\Randy\Anaconda3\lib\site-
packages\sklearn\cross_validation.py:44: DeprecationWarning: This
module was deprecated in version 0.18 in favor of the model_selection
module into which all the refactored classes and functions are moved.
Also note that the interface of the new CV iterators are different from
that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)

Creat RMSLE Function

Since the errors are squared before they are averaged, the RMSE gives a relatively
high weight to large errors. This means the RMSE should be more useful when
large errors are particularly undesirable.

RMSE has the benefit of penalizing large errors more so can be more appropriate in
some cases, for example, if being off by 10 is more than twice as bad as being off
by 5. But if being off by 10 is just twice as bad as being off by 5, then MAE is more
appropriate.

def rmsle(y, y0):


assert len(y) == len(y0)
return np.sqrt(np.mean(np.power(np.log1p(y)-np.log1p(y0), 2)))

Train with LGBM

The reason why I used this algorithm is because it’s a good model to use on big data
sets.

It has fast:

training sped and high efficiency


low memory usage
good accuracy
good compatibility with large datasets.

The RMSLE of LGBM is: 0.5406

import lightgbm as lgb


d_train = lgb.Dataset(X_train, label=y_train)
d_train = lgb.Dataset(X_train, label=y_train)

params = {}
#params['learning_rate'] = 0.003
params['boosting_type'] = 'gbdt'
params['objective'] = 'regression'
params['metric'] = 'rmse'

clf = lgb.train(params, d_train, 100)

#Prediction
lgbm_pred=clf.predict(X_valid)

# Transform the prices back to its original price


np.expm1(lgbm_pred)

array([ 10.7214129 , 29.60000413, 13.02518987, ..., 12.85446125,


19.31981685, 15.34901857])

import time
start_time = time.time()
print('[{}] LGBM completed.'.format(time.time() - start_time))
print("LGBM rmsle: "+str(rmsle(np.expm1(y_valid), np.expm1(lgbm_pred))))

[0.0] LGBM completed.


LGBM rmsle: 0.540597319376

Train with Ridge Regression

The RMSLE of Ridge Regression is: 0.4829

import time

start_time = time.time()

model = Ridge(solver = "sag", fit_intercept=False)

print("Fitting Ridge Model")


model.fit(X_train, y_train)

preds_valid = model.predict(X_valid)

print('[{}] Ridge completed.'.format(time.time() - start_time))


print("Ridge rmsle: "+str(rmsle(np.expm1(y_valid), np.expm1(preds_valid))))

Fitting Ridge Model


[32.998536586761475] Ridge completed.
Ridge rmsle: 0.482907420753

np.expm1(preds_valid)

array([ 9.09673618, 83.84303118, 11.78868638, ..., 12.66779351,


23.59042071, 11.21630299])

Interesting Note
The feature 'lt65' that I created made a significant impact on the model's
performance. I binned the items into either two categories based on their price:
'Less than 65' or 'More than 65'.

The Ridge Regression model's RMSLE dropped from .4829 to .4215 with the
addition of this feature.

The LGBM model's RMSLE dropped from .5406 to .4533

Predict on Test Set

# Predicting on never seen test set


preds = model.predict(X_test)

submission["price"] = np.expm1(preds)
submission.to_csv("submission_ridge.csv", index = False)

C:\Users\Randy\Anaconda3\lib\site-packages\ipykernel_launcher.py:4:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-


docs/stable/indexing.html#indexing-view-versus-copy
after removing the cwd from sys.path.

submission

<style> .dataframe thead tr:only-child th { text-align: right; }

.dataframe thead th {
text-align: left;
}

.dataframe tbody tr th {
vertical-align: top;
}

</style>
test_id price

0 0 11.162749

1 1 12.555600

2 2 53.157534

3 3 17.925542

4 4 7.363347

5 5 9.959583

6 6 9.521093

7 7 33.185204

8 8 45.666661

9 9 6.283195

10 10 52.478731

11 11 9.582656

12 12 33.508056
13 13 49.353728

14 14 24.605690

15 15 8.701512

16 16 24.601817

17 17 17.042603

18 18 41.234336

19 19 7.499206

20 20 6.479075

21 21 10.071822

22 22 11.011129

23 23 13.974448

24 24 43.893530

25 25 7.502994

26 26 20.210556

27 27 8.181354

28 28 53.241879

29 29 7.257089

... ... ...

693329 693329 28.354434

693330 693330 20.699427

693331 693331 14.273625

693332 693332 43.814946

693333 693333 16.261933

693334 693334 81.738151

693335 693335 6.648241

693336 693336 15.763434

693337 693337 17.041259


693338 693338 14.168938

693339 693339 13.545883

693340 693340 3.973624

693341 693341 219.385862

693342 693342 4.910538

693343 693343 17.301045

693344 693344 21.939166

693345 693345 13.895224

693346 693346 29.366304

693347 693347 45.978350

693348 693348 67.798859

693349 693349 10.384145

693350 693350 8.716605

693351 693351 9.640998

693352 693352 14.098605

693353 693353 14.731720

693354 693354 19.804941

693355 693355 26.429490

693356 693356 6.252158

693357 693357 15.377224

693358 693358 9.742211

693359 rows × 2 columns

LeaderBoard Result (Top 36%)


<img src = "http://i63.tinypic.com/14ccuv6.jpg" /img>
Conclusion

I am happy to have done this competition because it has opened up my mind into
the realm of NLP and it showed me how much pre-processing steps are involved for
text data. I learned the most common steps for text pre-processing and this allowed
me to prepare myself for future work whenever I’m against text data again. Another
concept that I really learned to value more is the choice of algorithms and how
important computation is whenever you’re dealing with large datasets. It took me a
couple of minutes to even perform some data visualizations and modeling. Text
data is everywhere and it can get messy. Understanding the fundamentals on how
to tackle these problems will definitely help me out in the future.

Releases

No releases published

Packages

No packages published

Languages

Jupyter Notebook 100.0%

You might also like