Pricing Mercari
Pricing Mercari
Create a machine learning model that will suggest prices for certain items
5 stars 9 forks
Star Watch
master
View code
Dataset Features
ID: the id of the listing
Name: the title of the listing
Item Condition: the condition of the items provided by the seller
Category Name: category of the listing
Brand Name: brand of the listing
Shipping: whether or not shipping cost was provided
Item Description: the full description of the item
Price: the price that the item was sold for. This is the target variable that you
will predict. The unit is USD.
Source: https://www.kaggle.com/c/mercari-price-suggestion-challenge
Since, text is the most unstructured form of all the available data, various types of
noise are present in it and the data is not readily analyzable without any pre-
processing. The entire process of cleaning and standardization of text, making it
noise-free and ready for analysis is known as text pre-processing.
Fundamental Concepts
The importance of constructing mining-friendly data representations;
Representation of text for data mining.
Important Terminologies
Document: One piece of text. It could be a single sentence, a paragraph, or
even a full page report.
Tokens: Also known as terms. It is simply just a word. So many tokens form a
document.
Corpus: A collection of documents.
Term Frequency (TF): Measures how often a term is in a single document
Inverse Document Frequency (IDF): distribution of a term over a corpus
Pre-Processing Techniques
Stop Word Removal: stop words are terms that have little no meaning in a
given text. Think of it as the "noise" of data. Such terms include the words,
"the", "a", "an", "to", and etc...
Topic Models: A type of model that represents a set of topics from a sequence
of words.
MileStone Report
A. Define the objective in business terms: The objective is to come up with the
right pricing algorithm that can we can use as a pricing recommendation to the
users.
B. How will your solution be used?: Allowing the users to see a suggest price
before purchasing or selling will hopefully allow more transaction within Mercari's
business.
C. How should you frame this problem?: This problem can be solved using a
supervised learning approach, and possible some unsupervised learning methods
as well for clustering analysis.
E. Are there any other data sets that you could use?: To get a more accurate
understanding and prediction for this problem, a potential dataset that we can
gather would be more about the user. Features such as user location, user gender,
and time could affect it.
General Steps
Import Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import gc
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
train_id name item_condition_id category_name
MLB
Cincinnati
0 0 3 Men/Tops/T-shirts
Reds T Shirt
Size XL
Razer
BlackWidow Electronics/Computers &
1 1 3
Chroma Tablets/Components & P...
Keyboard
Leather
Home/Home Décor/Home
3 3 Horse 1
Décor Accents
Statues
24K GOLD
4 4 1 Women/Jewelry/Necklaces
plated rose
# Create combined set. You would want to apply count vectorizer on combined set
combined = pd.concat([train,test])
combined.shape
(1286735, 9)
combined_ML = combined.sample(frac=0.1).reset_index(drop=True)
combined_ML.shape
(128674, 9)
a. Remove Puncuations
b. Remove Digits
e. Lemmatization or Stemming
Remove Puncuation
punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
punctuation_symbols
[('!', ''),
('"', ''),
('#', ''),
('$', ''),
('%', ''),
('&', ''),
("'", ''),
('(', ''),
(')', ''),
('*', ''),
('+', ''),
(',', ''),
('-', ''),
('.', ''),
('/', ''),
(':', ''),
(';', ''),
('<', ''),
('=', ''),
('>', ''),
('?', ''),
('@', ''),
('[', ''),
('\\', ''),
(']', ''),
('^', ''),
('_', ''),
('`', ''),
('{', ''),
('|', ''),
('}', ''),
('~', '')]
import string
def remove_punctuation(sentence: str) -> str:
return sentence.translate(str.maketrans('', '', string.punctuation))
Remove Digits
def remove_digits(x):
x = ''.join([i for i in x if not i.isdigit()])
return x
stop = stopwords.words('english')
def remove_stop_words(x):
x = ' '.join([i for i in x.lower().split(' ') if i not in stop])
return x
def to_lower(x):
return x.lower()
MIssing Values:
Category_name
Brand_name
Item_description
name
category_name
brand_name
item_description
train.count()
train_id 593376
name 593376
item_condition_id 593376
category_name 590835
brand_name 340359
price 593376
shipping 593376
item_description 593375
dtype: int64
train.dtypes
train_id int64
name object
item_condition_id int64
category_name object
brand_name object
price float64
shipping int64
item_description object
dtype: object
Summary:
train.price.describe()
count 593376.000000
mean 26.689003
std 38.340061
min 0.000000
25% 10.000000
50% 17.000000
75% 29.000000
max 2000.000000
Name: price, dtype: float64
# Could we use these as features? Look at median price for each quantile
bins = [0, 10, 17, 29, 2001]
labels = ['q1','q2','q3','q4']
train['price_bin'] = pd.cut(train['price'], bins=bins, labels=labels)
train.groupby('price_bin')['price'].describe()
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
count mean std min 25% 50% 75%
price_bin
q3
README.md 144043.0 22.539551 3.335075 17.5 20.0 22.0 25.0
plt.figure(figsize=(12, 7))
plt.hist(train['price'], bins=50, range=[0,250], label='price')
plt.title('Price Distribution', fontsize=15)
plt.xlabel('Price', fontsize=15)
plt.ylabel('Samples', fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize=15)
plt.show()
shipping = train[train['shipping']==1]['price']
no_shipping = train[train['shipping']==0]['price']
plt.figure(figsize=(12,7))
plt.hist(shipping, bins=50, normed=True, range=[0,250], alpha=0.7, label='Price
plt.hist(no_shipping, bins=50, normed=True, range=[0,250], alpha=0.7, label
plt.title('Price Distrubtion With/Without Shipping', fontsize=15)
plt.xlabel('Price')
plt.ylabel('Normalized Samples')
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize=15)
plt.show()
def transform_category_name(category_name):
try:
main, sub1, sub2= category_name.split('/')
return main, sub1, sub2
except:
return np.nan, np.nan, np.nan
cat_train.head()
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
category_main category_sub1 category_sub2 price
Interesting findings:
Questions to ask:
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
count mean std min 25% 50%
category_main
Sports &
9632.0 25.140365 27.388032 0.0 11.0 16.0
Outdoors
Vintage &
18673.0 27.158732 52.338051 0.0 10.0 16.0
Collectibles
Women 0.451315
Beauty 0.141427
Kids 0.116116
Electronics 0.081456
Men 0.063456
Home 0.046394
Vintage & Collectibles 0.031697
Other 0.030981
Handmade 0.020806
Sports & Outdoors 0.016350
Name: category_main, dtype: float64
plt.figure(figsize=(17,10))
sns.countplot(y = train['category_main'], order = train['category_main'].value_c
plt.title('Top 10 Categories', fontsize = 25)
plt.ylabel('Main Category', fontsize = 20)
plt.xlabel('Number of Items in Main Category', fontsize = 20)
plt.show()
png
#main = pd.DataFrame(cat_train['category_main'].value_counts()).reset_index().re
fig, axes = plt.subplots(figsize=(12, 7))
main = cat_train[cat_train["price"]<100]
# Use a color palette
ax = sns.boxplot( x=main["category_main"], y=main["price"], palette="Blues"
ax.set_xticklabels(ax.get_xticklabels(),rotation=90, fontsize=12)
sns.plt.show()
# Create a "no_brand" column
train['no_brand'] = train['brand_name'].isnull()
f, ax = plt.subplots(figsize=(15, 4))
sns.countplot(y='category_main', hue='no_brand', data=train).set_title('Category
plt.show()
plt.figure(figsize=(20, 15))
plt.barh(range(0,len(df)), df['mean'], align='center', alpha=0.5, color='r'
plt.yticks(range(0,len(df)), df['category_sub2'], fontsize=15)
plt.xlabel('Price', fontsize=15)
plt.ylabel('Sub Category 2', fontsize=15)
plt.title('Top 20 2nd Category (Mean Price)', fontsize=20)
plt.show()
df = cat_train.groupby(['category_sub1'])['price'].agg(['mean']).reset_index
df= df.sort_values('mean', ascending=False)[0:20]
plt.figure(figsize=(20, 15))
plt.barh(range(0,len(df)), df['mean'], align='center', alpha=0.5, color='b'
plt.yticks(range(0,len(df)), df['category_sub1'], fontsize=15)
plt.xlabel('Price', fontsize=15)
plt.ylabel('Sub Category 1', fontsize=15)
plt.title('Top 20 1st Category (Mean Price)', fontsize=20)
plt.show()
Hypothesis:
# Remove Punctuation
combined.item_description = combined.item_description.astype(str)
descr['item_description'] = descr['item_description'].apply(remove_digits)
descr['item_description'] = descr['item_description'].apply(remove_punctuation
descr['item_description'] = descr['item_description'].apply(remove_stop_words
descr.head(3)
C:\Users\Randy\Anaconda3\lib\site-packages\ipykernel_launcher.py:5:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
C:\Users\Randy\Anaconda3\lib\site-packages\ipykernel_launcher.py:9:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
item_description price count
0 description yet 10.0 18
1 keyboard great condition works like came box p... 52.0 188
2 adorable top hint lace key hole back pale pink... 10.0 124
porter = PorterStemmer()
descr['item_description'] = descr['item_description'].apply(porter.stem)
C:\Users\Randy\Anaconda3\lib\site-packages\ipykernel_launcher.py:5:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
descr.tail(10)
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
item_description price count
693351 purple boys polo shirt size old navy never worn NaN 59
693352 express deep olive green cardigan ultra thin ... NaN 121
693352 express deep olive green cardigan ultra thin ... NaN 121
693358 floral scrub tops worn less times brown belt ti NaN 71
df = descr.groupby('count')['price'].mean().reset_index()
sns.regplot(x=df["count"], y=(df["price"]))
plt.xlabel("word count")
plt.show()
png
combined.head()
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
brand_name category_name item_condition_id item_descrip
No description
0 NaN Men/Tops/T-shirts 3 yet
This keyboard
Electronics/Computers &
1 Razer 3 in great condit
Tablets/Components & P...
and works ...
Adorable top w
Women/Tops &
2 Target 1 a hint of lace a
Blouses/Blouse
a key hol...
Complete with
4 NaN Women/Jewelry/Necklaces 1 certificate of
authenticity
# Remove Punctuation
combined_ML.item_description = combined_ML.item_description.astype(str)
combined_ML['item_description'] = combined_ML['item_description'].apply(remove_d
combined_ML['item_description'] = combined_ML['item_description'].apply(remove_p
combined_ML['item_description'] = combined_ML['item_description'].apply(remove_s
combined_ML['item_description'] = combined_ML['item_description'].apply(to_lower
combined_ML['name'] = combined_ML['name'].apply(remove_digits)
combined_ML['name'] = combined_ML['name'].apply(remove_punctuation)
combined_ML['name'] = combined_ML['name'].apply(remove_stop_words)
combined_ML['name'] = combined_ML['name'].apply(to_lower)
combined_ML.head(3)
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
brand_name category_name item_condition_id item_description
Women/Athletic
2 LuLaRoe Apparel/Pants, 1 description yet
Tights, Leggings
# Remove Punctuation
combined.item_description = combined.item_description.astype(str)
combined['item_description'] = combined['item_description'].apply(remove_digits
combined['item_description'] = combined['item_description'].apply(remove_punctua
combined['item_description'] = combined['item_description'].apply(remove_stop_wo
combined['item_description'] = combined['item_description'].apply(to_lower
combined['name'] = combined['name'].apply(remove_digits)
combined['name'] = combined['name'].apply(remove_punctuation)
combined['name'] = combined['name'].apply(remove_stop_words)
combined['name'] = combined['name'].apply(to_lower)
combined.isnull().any()
brand_name False
category_name False
item_condition_id False
item_description False
name False
price True
shipping False
test_id True
train_id True
dtype: bool
combined.head()
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
brand_name category_name item_condition_id item_descrip
keyboard grea
Electronics/Computers &
1 Razer 3 condition work
Tablets/Components & P...
like came box
adorable top h
Women/Tops &
2 Target 1 lace key hole
Blouses/Blouse
back pale pink
complete
4 None Women/Jewelry/Necklaces 1 certificate
authenticity
combined_ML.head()
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
brand_name category_name item_condition_id item_description
pink size xs
Women/Tops &
1 PINK 3 racerback free
Blouses/Tank, Cami
shipping
Women/Athletic
2 LuLaRoe Apparel/Pants, 1 description yet
Tights, Leggings
silk express
Beauty/Hair
shampoo silk
3 None Care/Shampoo & 1
conditioner leave
Conditioner Sets
co...
deluxe samples
ysl smashbox
4 Sephora Beauty/Makeup/Face 1
hourglass
biossance
combined.head()
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
brand_name category_name item_condition_id item_descrip
0 None Men/Tops/T-shirts 3 description ye
keyboard grea
Electronics/Computers &
1 Razer 3 condition work
Tablets/Components & P...
like came box
adorable top h
Women/Tops &
2 Target 1 lace key hole
Blouses/Blouse
back pale pink
complete
4 None Women/Jewelry/Necklaces 1 certificate
authenticity
The result will have n dimensions, one by distinct value of the encoded
categorical variable.
combined.head()
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
brand_name category_name item_condition_id item_descrip
keyboard grea
Electronics/Computers &
1 Razer 3 condition work
Tablets/Components & P...
like came box
adorable top h
Women/Tops &
2 Target 1 lace key hole
Blouses/Blouse
back pale pink
complete
4 None Women/Jewelry/Necklaces 1 certificate
authenticity
X_test = sparse_merge[train_size:]
#X_train = sparse_merge[:len(combined_ML)]
#X_test = sparse_merge[len(combined_ML):]
combined.columns
Cross Validation
C:\Users\Randy\Anaconda3\lib\site-
packages\sklearn\cross_validation.py:44: DeprecationWarning: This
module was deprecated in version 0.18 in favor of the model_selection
module into which all the refactored classes and functions are moved.
Also note that the interface of the new CV iterators are different from
that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
Since the errors are squared before they are averaged, the RMSE gives a relatively
high weight to large errors. This means the RMSE should be more useful when
large errors are particularly undesirable.
RMSE has the benefit of penalizing large errors more so can be more appropriate in
some cases, for example, if being off by 10 is more than twice as bad as being off
by 5. But if being off by 10 is just twice as bad as being off by 5, then MAE is more
appropriate.
The reason why I used this algorithm is because it’s a good model to use on big data
sets.
It has fast:
params = {}
#params['learning_rate'] = 0.003
params['boosting_type'] = 'gbdt'
params['objective'] = 'regression'
params['metric'] = 'rmse'
#Prediction
lgbm_pred=clf.predict(X_valid)
import time
start_time = time.time()
print('[{}] LGBM completed.'.format(time.time() - start_time))
print("LGBM rmsle: "+str(rmsle(np.expm1(y_valid), np.expm1(lgbm_pred))))
import time
start_time = time.time()
preds_valid = model.predict(X_valid)
np.expm1(preds_valid)
Interesting Note
The feature 'lt65' that I created made a significant impact on the model's
performance. I binned the items into either two categories based on their price:
'Less than 65' or 'More than 65'.
The Ridge Regression model's RMSLE dropped from .4829 to .4215 with the
addition of this feature.
submission["price"] = np.expm1(preds)
submission.to_csv("submission_ridge.csv", index = False)
C:\Users\Randy\Anaconda3\lib\site-packages\ipykernel_launcher.py:4:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Try using .loc[row_indexer,col_indexer] = value instead
submission
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
test_id price
0 0 11.162749
1 1 12.555600
2 2 53.157534
3 3 17.925542
4 4 7.363347
5 5 9.959583
6 6 9.521093
7 7 33.185204
8 8 45.666661
9 9 6.283195
10 10 52.478731
11 11 9.582656
12 12 33.508056
13 13 49.353728
14 14 24.605690
15 15 8.701512
16 16 24.601817
17 17 17.042603
18 18 41.234336
19 19 7.499206
20 20 6.479075
21 21 10.071822
22 22 11.011129
23 23 13.974448
24 24 43.893530
25 25 7.502994
26 26 20.210556
27 27 8.181354
28 28 53.241879
29 29 7.257089
I am happy to have done this competition because it has opened up my mind into
the realm of NLP and it showed me how much pre-processing steps are involved for
text data. I learned the most common steps for text pre-processing and this allowed
me to prepare myself for future work whenever I’m against text data again. Another
concept that I really learned to value more is the choice of algorithms and how
important computation is whenever you’re dealing with large datasets. It took me a
couple of minutes to even perform some data visualizations and modeling. Text
data is everywhere and it can get messy. Understanding the fundamentals on how
to tackle these problems will definitely help me out in the future.
Releases
No releases published
Packages
No packages published
Languages