0% found this document useful (0 votes)

22 views41 pages

Pricing Mercari

The document outlines a machine learning challenge to create a pricing recommendation model for Mercari, Japan's largest community-driven shopping app. It details the dataset features, preprocessing techniques, and the steps necessary to prepare the data for analysis, including handling missing values and text processing. The goal is to develop an algorithm that suggests optimal prices based on product descriptions and characteristics, ultimately enhancing user transactions on the platform.

Uploaded by

takunda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views41 pages

Pricing Mercari

Uploaded by

takunda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

randylaosat / Price-Recommendation-Mercari-Challenge

Create a machine learning model that will suggest prices for certain items

5 stars 9 forks

Star Watch

Code Issues Pull requests Actions Projects Wiki Security Insights

master

Randy Lao … on 8 Jun 2018

View code

Mercari Price Suggestion Challenge

Can you automatically suggest product prices to online sellers?

Product pricing gets even harder at scale, considering just how many products
are sold online. Clothing has strong seasonal pricing trends and is heavily
influenced by brand names, while electronics have fluctuating prices based on
product specs.

Mercari, Japan’s biggest community-powered shopping app, knows this problem

deeply. They’d like to offer pricing suggestions to sellers, but this is tough because
their sellers are enabled to put just about anything, or any bundle of things, on
Mercari's marketplace.

In this competition, Mercari’s challenging you to build an algorithm that

automatically suggests the right product prices. You’ll be provided user-
inputted text descriptions of their products, including details like product category
name, brand name, and item condition.

Dataset Features
ID: the id of the listing
Name: the title of the listing
Item Condition: the condition of the items provided by the seller
Category Name: category of the listing
Brand Name: brand of the listing
Shipping: whether or not shipping cost was provided
Item Description: the full description of the item
Price: the price that the item was sold for. This is the target variable that you
will predict. The unit is USD.

Work on supply and demand

Source: https://www.kaggle.com/c/mercari-price-suggestion-challenge

Representing and Mining Text

Since, text is the most unstructured form of all the available data, various types of
noise are present in it and the data is not readily analyzable without any pre-
processing. The entire process of cleaning and standardization of text, making it
noise-free and ready for analysis is known as text pre-processing.
Fundamental Concepts
The importance of constructing mining-friendly data representations;
Representation of text for data mining.

Important Terminologies
Document: One piece of text. It could be a single sentence, a paragraph, or
even a full page report.
Tokens: Also known as terms. It is simply just a word. So many tokens form a
document.
Corpus: A collection of documents.
Term Frequency (TF): Measures how often a term is in a single document
Inverse Document Frequency (IDF): distribution of a term over a corpus

Pre-Processing Techniques
Stop Word Removal: stop words are terms that have little no meaning in a
given text. Think of it as the "noise" of data. Such terms include the words,
"the", "a", "an", "to", and etc...

Bag of Words Representation: treats each word as a feature of the

document

TFIDF: a common value representation of terms. It boosts or weighs words that

have low occurences. For example, if the word "play" is common, then there is
little to no boost. But if the word "mercari" is rare, then it has more
boosts/weight.

N-grams: Sequences of adjacent words as terms. For example, since a word by

itself may have little to no value, but if you were to put two words together and
analyze it as a pair, then it might add more meaning.

Stemming and Lemmatization:

Named Entity Extraction: A pre-processing technique used to know when

word sequences constitute proper names. Example, "HP", "H-P", and
"Hewlett-Packard" all represent the Hewlett-Packard Corporation.

Topic Models: A type of model that represents a set of topics from a sequence
of words.
MileStone Report

A. Define the objective in business terms: The objective is to come up with the
right pricing algorithm that can we can use as a pricing recommendation to the
users.

B. How will your solution be used?: Allowing the users to see a suggest price
before purchasing or selling will hopefully allow more transaction within Mercari's
business.

C. How should you frame this problem?: This problem can be solved using a
supervised learning approach, and possible some unsupervised learning methods
as well for clustering analysis.

D. How should performance be measured?: Since its a regression problem, the

evaluation metric that should be used is RMSE (Root Mean Squared Error). But in
this case for the competition, we'll be using the

E. Are there any other data sets that you could use?: To get a more accurate
understanding and prediction for this problem, a potential dataset that we can
gather would be more about the user. Features such as user location, user gender,
and time could affect it.

General Steps

1. Handle Missing Values—Replaced “missing” values with NA.

2. Lemmatization performed on item_description—Aiming to remove inflectional

endings only and to return the base or dictionary form of a word

3. Label encoding has been performed on categorical values—Encode labels with

value between 0 and n_classes-1.

4. Tokenization—Given a character sequence, tokenization is the task of chopping

it up into pieces, called tokens and remove punctuation.

5. Maximum length of all sequences has been specified

6. Scaling performed on target variable (price)

7. Sentiment scored computed on item_description

8. Scaling performed on item description length as well

Import Packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import gc

from string import punctuation

# vstack - adds rows, hstack - adds columns

# csr_matrix - used to handle sparse matrix
from scipy.sparse import vstack, hstack, csr_matrix

# CountVectorizer - Simply, counts word frequencies

# TFIDF - More importance/weights on "rare" words. Less importance/weights on "f
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# LabelBinarizer - Converts labels into numerical representation "G,B,R" -> [1,2

from sklearn.preprocessing import LabelBinarizer

# Ridge - Reduces multicollinearity in regression. Applies L2 Regularization

from sklearn.linear_model import Ridge

Import Train / Test Data

# Create training set

train = pd.read_csv('C:/Users/Randy/Desktop/training/train.tsv', sep = '\t
train.head()
<style> .dataframe thead tr:only-child th { text-align: right; }

.dataframe thead th {
text-align: left;
}

.dataframe tbody tr th {
vertical-align: top;
}

</style>
train_id name item_condition_id category_name

MLB
Cincinnati
0 0 3 Men/Tops/T-shirts
Reds T Shirt
Size XL

Razer
BlackWidow Electronics/Computers &
1 1 3
Chroma Tablets/Components & P...
Keyboard

AVA-VIV Women/Tops &

2 2 1
Blouse Blouses/Blouse

Leather
Home/Home Décor/Home
3 3 Horse 1
Décor Accents
Statues

24K GOLD
4 4 1 Women/Jewelry/Necklaces
plated rose

# Create testing set

test = pd.read_csv('C:/Users/Randy/Desktop/training/test.tsv', sep = '\t',

# Create log price variable (Transformation)

y = np.log1p(train['price'])
Combine Test and Train Set

# Create combined set. You would want to apply count vectorizer on combined set
combined = pd.concat([train,test])

# Create the submission set (Only contains the test ID)

submission = test[['test_id']]

# Create size of train

train_size = len(train)

combined.shape

(1286735, 9)

combined_ML = combined.sample(frac=0.1).reset_index(drop=True)

combined_ML.shape

(128674, 9)

Part 2: Preparing the Corpus for Analysis

a. Remove Puncuations

b. Remove Digits

c. Remove stop words

d. Lower case words

e. Lemmatization or Stemming
Remove Puncuation

punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

# Create a list of punctuation replacements

punctuation_symbols = []
for symbol in punctuation:
punctuation_symbols.append((symbol, ''))

punctuation_symbols

[('!', ''),
('"', ''),
('#', ''),
('$', ''),
('%', ''),
('&', ''),
("'", ''),
('(', ''),
(')', ''),
('*', ''),
('+', ''),
(',', ''),
('-', ''),
('.', ''),
('/', ''),
(':', ''),
(';', ''),
('<', ''),
('=', ''),
('>', ''),
('?', ''),
('@', ''),
('[', ''),
('\\', ''),
(']', ''),
('^', ''),
('_', ''),
('`', ''),
('{', ''),
('|', ''),
('}', ''),
('~', '')]

Create a remove punctuation method

import string
def remove_punctuation(sentence: str) -> str:
return sentence.translate(str.maketrans('', '', string.punctuation))

Remove Digits

def remove_digits(x):
x = ''.join([i for i in x if not i.isdigit()])
return x

Remove Stop Words

from nltk.corpus import stopwords

stop = stopwords.words('english')

def remove_stop_words(x):
x = ' '.join([i for i in x.lower().split(' ') if i not in stop])
return x

Lower Case Words

def to_lower(x):
return x.lower()

Part 3: Explore Training Set

MIssing Values:

Category_name
Brand_name
Item_description

Categorical Variables (Need to do Encoding):

name
category_name
brand_name
item_description

Check Missing Values

train.count()

train_id 593376
name 593376
item_condition_id 593376
category_name 590835
brand_name 340359
price 593376
shipping 593376
item_description 593375
dtype: int64

Check Data Types

train.dtypes

train_id int64
name object
item_condition_id int64
category_name object
brand_name object
price float64
shipping int64
item_description object
dtype: object

3a. Price Distribution

Why Do Price Vary?

Supply and Demand

Brand Name
Fabric Terms
"Quality"-Type Words (Check to see if quality plays a role in price)
Condition

Summary:

The mean price in the dataset is 26 Dollars

The median price in the dataset is 17 Dollars
The max price in the dataset is 2000 Dollars
Due to the skewed dataset, the median price is a more reliable price to gauge
off of.

train.price.describe()

count 593376.000000
mean 26.689003
std 38.340061
min 0.000000
25% 10.000000
50% 17.000000
75% 29.000000
max 2000.000000
Name: price, dtype: float64

# Could we use these as features? Look at median price for each quantile
bins = [0, 10, 17, 29, 2001]
labels = ['q1','q2','q3','q4']
train['price_bin'] = pd.cut(train['price'], bins=bins, labels=labels)
train.groupby('price_bin')['price'].describe()

<style> .dataframe thead tr:only-child th { text-align: right; }

.dataframe thead th {
text-align: left;
}

.dataframe tbody tr th {
vertical-align: top;
}

</style>
count mean std min 25% 50% 75%

price_bin

q1 149944.0 7.710178 2.083100 3.0 6.0 8.0 10.0

q2 151863.0 13.834845 1.795258 10.5 12.0 14.0 15.0

q3
README.md 144043.0 22.539551 3.335075 17.5 20.0 22.0 25.0

q4 147215.0 63.396077 63.271190 30.0 35.0 45.0 66.0

plt.figure(figsize=(12, 7))
plt.hist(train['price'], bins=50, range=[0,250], label='price')
plt.title('Price Distribution', fontsize=15)
plt.xlabel('Price', fontsize=15)
plt.ylabel('Samples', fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize=15)
plt.show()
shipping = train[train['shipping']==1]['price']
no_shipping = train[train['shipping']==0]['price']

plt.figure(figsize=(12,7))
plt.hist(shipping, bins=50, normed=True, range=[0,250], alpha=0.7, label='Price
plt.hist(no_shipping, bins=50, normed=True, range=[0,250], alpha=0.7, label
plt.title('Price Distrubtion With/Without Shipping', fontsize=15)
plt.xlabel('Price')
plt.ylabel('Normalized Samples')
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize=15)
plt.show()

3b. Brand Analysis

# Amount of unique brand names

train['brand_name'].nunique()
3751

# Top 20 Brand Distribution

b20 = train['brand_name'].value_counts()[0:20].reset_index().rename(columns
ax = sns.barplot(x="brand_name", y="count", data=b20)
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
ax.set_title('Top 20 Brand Distribution', fontsize=15)
plt.show()

# Display Top 20 Expensive Brands By Mean Price

top20_brand = train.groupby('brand_name', axis=0).mean()
df_expPrice = pd.DataFrame(top20_brand.sort_values('price', ascending = False

ax = sns.barplot(x="brand_name", y="price", data=df_expPrice)

ax.set_xticklabels(ax.get_xticklabels(),rotation=90, fontsize=15)
ax.set_title('Top 20 Expensive Brand', fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.show()# Displayd frequency ratio of brand names
train['brand_name'].value_counts(1)
PINK 0.063659
Nike 0.062687
Victoria's Secret 0.056910
LuLaRoe 0.036462
Apple 0.020337
Nintendo 0.017958
FOREVER 21 0.017649
Lululemon 0.017047
Michael Kors 0.016415
American Eagle 0.015604
Rae Dunn 0.014676
Sephora 0.014303
Coach 0.012372
Adidas 0.012311
Bath & Body Works 0.012167
Disney 0.012014
Funko 0.011030
Under Armour 0.009807
Sony 0.009587
Old Navy 0.009208
Hollister 0.008106
Carter's 0.007698
Urban Decay 0.007383
Urban Decay 0.007383
The North Face 0.007110
Too Faced 0.006828
Xbox 0.006728
Independent 0.006711
MAC 0.006514
Brandy Melville 0.006464
Kate Spade 0.006396
...
Sock It to Me 0.000003
Cocomo 0.000003
Onque Casuals 0.000003
Kaii 0.000003

GoGroove Pal 0.000003

Com 0.000003
Honda 0.000003
White + Warren 0.000003
Elomi 0.000003
Mecca 0.000003
Lulu Frost 0.000003
True Rock 0.000003
Christian Lacroix 0.000003
Acne Jeans 0.000003
First Act 0.000003
Neil Allyn 0.000003
Foundry 0.000003
Dog MD 0.000003
Armani Exchange 0.000003
Oxford Golf 0.000003
Yakima 0.000003
Bacco Bucci 0.000003
Bostonian 0.000003
BedHead 0.000003
Moose Mountain 0.000003
Tootsie 0.000003
Catit 0.000003
Lisa Maree 0.000003
Ecco Bella 0.000003
Custo Barcelona 0.000003
Name: brand_name, Length: 3751, dtype: float64

3c. Category Distribution

def transform_category_name(category_name):
try:
main, sub1, sub2= category_name.split('/')
return main, sub1, sub2
except:
return np.nan, np.nan, np.nan

train['category_main'], train['category_sub1'], train['category_sub2'] = zip

cat_train = train[['category_main','category_sub1','category_sub2', 'price'

cat_train.head()

<style> .dataframe thead tr:only-child th { text-align: right; }

.dataframe thead th {
text-align: left;
}

.dataframe tbody tr th {
vertical-align: top;
}

</style>
category_main category_sub1 category_sub2 price

0 Men Tops T-shirts 10.0

1 Electronics Computers & Tablets Components & Parts 52.0

2 Women Tops & Blouses Blouse 10.0

3 Home Home Décor Home Décor Accents 35.0

4 Women Jewelry Necklaces 44.0

3c. Main Category

Interesting findings:

Women and Beauty take up majority of the distribution

Women and Beauty take up 56% of the distribution

Questions to ask:

Can we create a gender category (Female, Male, Nuetral). Example: Three

categories means three gender types. If two of them are female, then we
classify as a female purchaser. If two of them are male, then we classify as
male. If male/female/neutral then?
Does gender play a role in price?
Can we create an age category?

# Electronics have the highest std

train.groupby('category_main')['price'].describe()

<style> .dataframe thead tr:only-child th { text-align: right; }

.dataframe thead th {
text-align: left;
}

.dataframe tbody tr th {
vertical-align: top;
}

</style>
count mean std min 25% 50%

category_main

Beauty 83315.0 19.727468 20.708703 0.0 10.0 15.0

Electronics 47986.0 33.763889 63.485958 0.0 9.0 15.0

Handmade 12257.0 18.325365 27.484725 0.0 6.0 12.0

Home 27331.0 24.845798 25.203925 0.0 12.0 18.0

Kids 68404.0 20.664983 22.877467 0.0 10.0 14.0

Men 37382.0 34.532369 39.729618 0.0 14.0 21.0

Other 18251.0 20.821434 31.046225 0.0 9.0 13.0

Sports &
9632.0 25.140365 27.388032 0.0 11.0 16.0
Outdoors

Vintage &
18673.0 27.158732 52.338051 0.0 10.0 16.0
Collectibles

Women 265870.0 28.843331 39.435913 0.0 12.0 19.0

# Display distribution
train['category_main'].value_counts(1)

Women 0.451315
Beauty 0.141427
Kids 0.116116
Electronics 0.081456
Men 0.063456
Home 0.046394
Vintage & Collectibles 0.031697
Other 0.030981
Handmade 0.020806
Sports & Outdoors 0.016350
Name: category_main, dtype: float64

plt.figure(figsize=(17,10))
sns.countplot(y = train['category_main'], order = train['category_main'].value_c
plt.title('Top 10 Categories', fontsize = 25)
plt.ylabel('Main Category', fontsize = 20)
plt.xlabel('Number of Items in Main Category', fontsize = 20)
plt.show()

png

#main = pd.DataFrame(cat_train['category_main'].value_counts()).reset_index().re
fig, axes = plt.subplots(figsize=(12, 7))
main = cat_train[cat_train["price"]<100]
# Use a color palette
ax = sns.boxplot( x=main["category_main"], y=main["price"], palette="Blues"
ax.set_xticklabels(ax.get_xticklabels(),rotation=90, fontsize=12)

sns.plt.show()
# Create a "no_brand" column
train['no_brand'] = train['brand_name'].isnull()

f, ax = plt.subplots(figsize=(15, 4))
sns.countplot(y='category_main', hue='no_brand', data=train).set_title('Category
plt.show()

3c. Category_2 Distribution

df = cat_train.groupby(['category_sub2'])['price'].agg(['mean']).reset_index
df= df.sort_values('mean', ascending=False).head(20)

plt.figure(figsize=(20, 15))
plt.barh(range(0,len(df)), df['mean'], align='center', alpha=0.5, color='r'
plt.yticks(range(0,len(df)), df['category_sub2'], fontsize=15)

plt.xlabel('Price', fontsize=15)
plt.ylabel('Sub Category 2', fontsize=15)
plt.title('Top 20 2nd Category (Mean Price)', fontsize=20)
plt.show()

3c. Category_1 Distribution

df = cat_train.groupby(['category_sub1'])['price'].agg(['mean']).reset_index
df= df.sort_values('mean', ascending=False)[0:20]

plt.figure(figsize=(20, 15))
plt.barh(range(0,len(df)), df['mean'], align='center', alpha=0.5, color='b'
plt.yticks(range(0,len(df)), df['category_sub1'], fontsize=15)

plt.xlabel('Price', fontsize=15)
plt.ylabel('Sub Category 1', fontsize=15)
plt.title('Top 20 1st Category (Mean Price)', fontsize=20)
plt.show()

3d. Item Description Analysis

Hypothesis:

Does length play a role in price?

Does certain descriptions make a fake item?
Lenghthier descriptions mean more effort in item, more authentic, more
valuable?

# Remove Punctuation
combined.item_description = combined.item_description.astype(str)

descr = combined[['item_description', 'price']]

descr['count'] = descr['item_description'].apply(lambda x : len(str(x)))

descr['item_description'] = descr['item_description'].apply(remove_digits)
descr['item_description'] = descr['item_description'].apply(remove_punctuation
descr['item_description'] = descr['item_description'].apply(remove_stop_words
descr.head(3)
C:\Users\Randy\Anaconda3\lib\site-packages\ipykernel_launcher.py:5:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-

docs/stable/indexing.html#indexing-view-versus-copy
"""
C:\Users\Randy\Anaconda3\lib\site-packages\ipykernel_launcher.py:7:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-

docs/stable/indexing.html#indexing-view-versus-copy
import sys
C:\Users\Randy\Anaconda3\lib\site-packages\ipykernel_launcher.py:8:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-

docs/stable/indexing.html#indexing-view-versus-copy

C:\Users\Randy\Anaconda3\lib\site-packages\ipykernel_launcher.py:9:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-

docs/stable/indexing.html#indexing-view-versus-copy
if __name__ == '__main__':

<style> .dataframe thead tr:only-child th { text-align: right; }

.dataframe thead th {
text-align: left;
}

.dataframe tbody tr th {
vertical-align: top;
}

</style>
item_description price count
0 description yet 10.0 18

1 keyboard great condition works like came box p... 52.0 188

2 adorable top hint lace key hole back pale pink... 10.0 124

from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

descr['item_description'] = descr['item_description'].apply(porter.stem)

C:\Users\Randy\Anaconda3\lib\site-packages\ipykernel_launcher.py:5:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-

docs/stable/indexing.html#indexing-view-versus-copy
"""

descr.tail(10)

<style> .dataframe thead tr:only-child th { text-align: right; }

.dataframe thead th {
text-align: left;
}

.dataframe tbody tr th {
vertical-align: top;
}

</style>
item_description price count

â�—ï¸�listing â�—ï¸� brand new shades piu

693349 NaN 71
scar...

693350 rm free shipping new highly addict NaN 47

693351 purple boys polo shirt size old navy never worn NaN 59

693352 express deep olive green cardigan ultra thin ... NaN 121
693352 express deep olive green cardigan ultra thin ... NaN 121

693353 shade medium neutral barley us NaN 41

693354 flintquartz cluster self mined âœ¨measures xin... NaN 243

cosmetics travel bundle includes brow power

693355 NaN 968
un...

693356 new free shipping basstop cas NaN 31

693357 floral kimono tropical print open front hi low... NaN 94

693358 floral scrub tops worn less times brown belt ti NaN 71

df = descr.groupby('count')['price'].mean().reset_index()
sns.regplot(x=df["count"], y=(df["price"]))
plt.xlabel("word count")
plt.show()

png

Create Pre-Processing Functions

combined.head()

<style> .dataframe thead tr:only-child th { text-align: right; }

.dataframe thead th {
text-align: left;
}

.dataframe tbody tr th {
vertical-align: top;
}

</style>
brand_name category_name item_condition_id item_descrip

No description
0 NaN Men/Tops/T-shirts 3 yet

This keyboard
Electronics/Computers &
1 Razer 3 in great condit
Tablets/Components & P...
and works ...

Adorable top w
Women/Tops &
2 Target 1 a hint of lace a
Blouses/Blouse
a key hol...

New with tags

Home/Home Décor/Home
3 NaN 1 Leather horses
Décor Accents
Retail for [rm]

Complete with
4 NaN Women/Jewelry/Necklaces 1 certificate of
authenticity

# handle_missing_values - Basic data imputation of missing values

def handle_missing_values(df):
df['category_name'].fillna(value='missing', inplace=True)
df['brand_name'].fillna(value='None', inplace=True)
df['item_description'].fillna(value='None', inplace=True)

# to_categorical - Converts Categorical Features

def to_categorical(df):
df['brand_name'] = df['brand_name'].astype('category')
df['category_name'] = df['category_name'].astype('category')
df['item_condition_id'] = df['item_condition_id'].astype('category')

Apply Pre-Processing Functions

# Applying the pre-processing functions

handle_missing_values(combined)
to_categorical(combined)
# apply the pre-processing function to ML combined
handle_missing_values(combined_ML)
to_categorical(combined_ML)

# Remove Punctuation
combined_ML.item_description = combined_ML.item_description.astype(str)

combined_ML['item_description'] = combined_ML['item_description'].apply(remove_d
combined_ML['item_description'] = combined_ML['item_description'].apply(remove_p
combined_ML['item_description'] = combined_ML['item_description'].apply(remove_s
combined_ML['item_description'] = combined_ML['item_description'].apply(to_lower

combined_ML['name'] = combined_ML['name'].apply(remove_digits)
combined_ML['name'] = combined_ML['name'].apply(remove_punctuation)
combined_ML['name'] = combined_ML['name'].apply(remove_stop_words)
combined_ML['name'] = combined_ML['name'].apply(to_lower)

combined_ML.head(3)

<style> .dataframe thead tr:only-child th { text-align: right; }

.dataframe thead th {
text-align: left;
}

.dataframe tbody tr th {
vertical-align: top;
}

</style>
brand_name category_name item_condition_id item_description

wore big size

small normally
0 Rue21 Women/Skirts/Maxi 2
wear extra small
...

Women/Tops & pink size xs

1 PINK Blouses/Tank, 3 racerback free
Cami shipping

Women/Athletic
2 LuLaRoe Apparel/Pants, 1 description yet
Tights, Leggings

# Remove Punctuation
combined.item_description = combined.item_description.astype(str)

combined['item_description'] = combined['item_description'].apply(remove_digits
combined['item_description'] = combined['item_description'].apply(remove_punctua
combined['item_description'] = combined['item_description'].apply(remove_stop_wo
combined['item_description'] = combined['item_description'].apply(to_lower

combined['name'] = combined['name'].apply(remove_digits)
combined['name'] = combined['name'].apply(remove_punctuation)
combined['name'] = combined['name'].apply(remove_stop_words)
combined['name'] = combined['name'].apply(to_lower)

Create three new features from

Categories (Main, Sub1, Sub2)

combined.isnull().any()

brand_name False
category_name False
item_condition_id False
item_description False
name False
price True
shipping False
test_id True
train_id True
dtype: bool

combined.head()

<style> .dataframe thead tr:only-child th { text-align: right; }

.dataframe thead th {
text-align: left;
}

.dataframe tbody tr th {
vertical-align: top;
}

</style>
brand_name category_name item_condition_id item_descrip

0 None Men/Tops/T-shirts 3 description ye

keyboard grea
Electronics/Computers &
1 Razer 3 condition work
Tablets/Components & P...
like came box

adorable top h
Women/Tops &
2 Target 1 lace key hole
Blouses/Blouse
back pale pink

new tags leath

Home/Home Décor/Home
3 None 1 horses retail rm
Décor Accents
stand foot h...

complete
4 None Women/Jewelry/Necklaces 1 certificate
authenticity

combined_ML.head()

<style> .dataframe thead tr:only-child th { text-align: right; }

.dataframe thead th {
text-align: left;
}

.dataframe tbody tr th {
vertical-align: top;
}
</style>
brand_name category_name item_condition_id item_description

wore big size

small normally
0 Rue21 Women/Skirts/Maxi 2
wear extra small
...

pink size xs
Women/Tops &
1 PINK 3 racerback free
Blouses/Tank, Cami
shipping

Women/Athletic
2 LuLaRoe Apparel/Pants, 1 description yet
Tights, Leggings

silk express
Beauty/Hair
shampoo silk
3 None Care/Shampoo & 1
conditioner leave
Conditioner Sets
co...

deluxe samples
ysl smashbox
4 Sephora Beauty/Makeup/Face 1
hourglass
biossance

combined.head()

<style> .dataframe thead tr:only-child th { text-align: right; }

.dataframe thead th {
text-align: left;
}

.dataframe tbody tr th {
vertical-align: top;
}

</style>
brand_name category_name item_condition_id item_descrip
0 None Men/Tops/T-shirts 3 description ye

keyboard grea
Electronics/Computers &
1 Razer 3 condition work
Tablets/Components & P...
like came box

adorable top h
Women/Tops &
2 Target 1 lace key hole
Blouses/Blouse
back pale pink

new tags leath

Home/Home Décor/Home
3 None 1 horses retail rm
Décor Accents
stand foot h...

complete
4 None Women/Jewelry/Necklaces 1 certificate
authenticity

Apply CountVectorizer / TfidfVectorizer /

LabelBinarizer

Encode labels into categorical variables: Pandas factorize and scikit-learn

LabelEncoder.

The result will have 1 dimension.

Encode categorical variable into dummy/indicator (binary) variables: Pandas

get_dummies and scikit-learn OneHotEncoder.

The result will have n dimensions, one by distinct value of the encoded
categorical variable.

Create new Feature (Binning Price Into

Two Categories)
#bins = [0, 64, 5000]
#labels = ['less','more']
#combined['lt65'] = pd.cut(combined['price'], bins=bins, labels=labels)

combined.head()

<style> .dataframe thead tr:only-child th { text-align: right; }

.dataframe thead th {
text-align: left;
}

.dataframe tbody tr th {
vertical-align: top;
}

</style>
brand_name category_name item_condition_id item_descrip

0 None Men/Tops/T-shirts 3 description ye

keyboard grea
Electronics/Computers &
1 Razer 3 condition work
Tablets/Components & P...
like came box

adorable top h
Women/Tops &
2 Target 1 lace key hole
Blouses/Blouse
back pale pink

new tags leath

Home/Home Décor/Home
3 None 1 horses retail rm
Décor Accents
stand foot h...

complete
4 None Women/Jewelry/Necklaces 1 certificate
authenticity

# Apply Count Vectorizer to "name", this converts it into a sparse matrix

cv = CountVectorizer(min_df=10)
X_name = cv.fit_transform(combined['name'])
X_name

<1286735x15973 sparse matrix of type '<class 'numpy.int64'>'

with 4789374 stored elements in Compressed Sparse Row format>

# Apply Count Vectorizer to "category_name", this converts it into a sparse matr

cv = CountVectorizer()
X_category = cv.fit_transform(combined['category_name'])
#X_sub1 = cv.fit_transform(combined['sub_category_1'])
#X_sub2 = cv.fit_transform(combined['sub_category_2'])
X_category

<1286735x1007 sparse matrix of type '<class 'numpy.int64'>'

with 5165431 stored elements in Compressed Sparse Row format>

# Apply TFIDF to "item_description",

tv = TfidfVectorizer(max_features=55000, ngram_range=(1, 2), stop_words='english
X_description = tv.fit_transform(combined['item_description'])

# Apply LabelBinarizer to "brand_name"

lb = LabelBinarizer(sparse_output=True)
X_brand = lb.fit_transform(combined['brand_name'])

Create CSR_Matrix & Merge the Sparse

Matrices

# Create our final sparse matrix

X_dummies = csr_matrix(pd.get_dummies(combined[['item_condition_id', 'shipping'

# Combine everything together

sparse_merge = hstack((X_dummies, X_description, X_brand, X_category, X_name

Train and Test Split

X_train_sparse = sparse_merge[:train_size]

X_test = sparse_merge[train_size:]

#X_train = sparse_merge[:len(combined_ML)]

#X_test = sparse_merge[len(combined_ML):]

combined.columns

Index(['brand_name', 'category_name', 'item_condition_id',

'item_description',
'name', 'price', 'shipping', 'test_id', 'train_id'],
dtype='object')

Cross Validation

from sklearn.cross_validation import KFold

eval_size = .10
kf = KFold(len(y), round(1. / eval_size))
train_indicies, valid_indicies = next(iter(kf))
X_train, y_train = X_train_sparse[train_indicies], y[train_indicies]
X_valid, y_valid = X_train_sparse[valid_indicies], y[valid_indicies]
X_valid, y_valid = X_train_sparse[valid_indicies], y[valid_indicies]

C:\Users\Randy\Anaconda3\lib\site-
packages\sklearn\cross_validation.py:44: DeprecationWarning: This
module was deprecated in version 0.18 in favor of the model_selection
module into which all the refactored classes and functions are moved.
Also note that the interface of the new CV iterators are different from
that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)

Creat RMSLE Function

Since the errors are squared before they are averaged, the RMSE gives a relatively
high weight to large errors. This means the RMSE should be more useful when
large errors are particularly undesirable.

RMSE has the benefit of penalizing large errors more so can be more appropriate in
some cases, for example, if being off by 10 is more than twice as bad as being off
by 5. But if being off by 10 is just twice as bad as being off by 5, then MAE is more
appropriate.

def rmsle(y, y0):

assert len(y) == len(y0)
return np.sqrt(np.mean(np.power(np.log1p(y)-np.log1p(y0), 2)))

Train with LGBM

The reason why I used this algorithm is because it’s a good model to use on big data
sets.

It has fast:

training sped and high efficiency

low memory usage
good accuracy
good compatibility with large datasets.

The RMSLE of LGBM is: 0.5406

import lightgbm as lgb

d_train = lgb.Dataset(X_train, label=y_train)
d_train = lgb.Dataset(X_train, label=y_train)

params = {}
#params['learning_rate'] = 0.003
params['boosting_type'] = 'gbdt'
params['objective'] = 'regression'
params['metric'] = 'rmse'

clf = lgb.train(params, d_train, 100)

#Prediction
lgbm_pred=clf.predict(X_valid)

# Transform the prices back to its original price

np.expm1(lgbm_pred)

array([ 10.7214129 , 29.60000413, 13.02518987, ..., 12.85446125,

19.31981685, 15.34901857])

import time
start_time = time.time()
print('[{}] LGBM completed.'.format(time.time() - start_time))
print("LGBM rmsle: "+str(rmsle(np.expm1(y_valid), np.expm1(lgbm_pred))))

[0.0] LGBM completed.

LGBM rmsle: 0.540597319376

Train with Ridge Regression

The RMSLE of Ridge Regression is: 0.4829

import time

start_time = time.time()

model = Ridge(solver = "sag", fit_intercept=False)

print("Fitting Ridge Model")

model.fit(X_train, y_train)

preds_valid = model.predict(X_valid)

print('[{}] Ridge completed.'.format(time.time() - start_time))

print("Ridge rmsle: "+str(rmsle(np.expm1(y_valid), np.expm1(preds_valid))))

Fitting Ridge Model

[32.998536586761475] Ridge completed.
Ridge rmsle: 0.482907420753

np.expm1(preds_valid)

array([ 9.09673618, 83.84303118, 11.78868638, ..., 12.66779351,

23.59042071, 11.21630299])

Interesting Note
The feature 'lt65' that I created made a significant impact on the model's
performance. I binned the items into either two categories based on their price:
'Less than 65' or 'More than 65'.

The Ridge Regression model's RMSLE dropped from .4829 to .4215 with the
addition of this feature.

The LGBM model's RMSLE dropped from .5406 to .4533

Predict on Test Set

# Predicting on never seen test set

preds = model.predict(X_test)

submission["price"] = np.expm1(preds)
submission.to_csv("submission_ridge.csv", index = False)

C:\Users\Randy\Anaconda3\lib\site-packages\ipykernel_launcher.py:4:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-

docs/stable/indexing.html#indexing-view-versus-copy
after removing the cwd from sys.path.

submission

<style> .dataframe thead tr:only-child th { text-align: right; }

.dataframe thead th {
text-align: left;
}

.dataframe tbody tr th {
vertical-align: top;
}

</style>
test_id price

0 0 11.162749

1 1 12.555600

2 2 53.157534

3 3 17.925542

4 4 7.363347

5 5 9.959583

6 6 9.521093

7 7 33.185204

8 8 45.666661

9 9 6.283195

10 10 52.478731

11 11 9.582656

12 12 33.508056
13 13 49.353728

14 14 24.605690

15 15 8.701512

16 16 24.601817

17 17 17.042603

18 18 41.234336

19 19 7.499206

20 20 6.479075

21 21 10.071822

22 22 11.011129

23 23 13.974448

24 24 43.893530

25 25 7.502994

26 26 20.210556

27 27 8.181354

28 28 53.241879

29 29 7.257089

... ... ...

693329 693329 28.354434

693330 693330 20.699427

693331 693331 14.273625

693332 693332 43.814946

693333 693333 16.261933

693334 693334 81.738151

693335 693335 6.648241

693336 693336 15.763434

693337 693337 17.041259

693338 693338 14.168938

693339 693339 13.545883

693340 693340 3.973624

693341 693341 219.385862

693342 693342 4.910538

693343 693343 17.301045

693344 693344 21.939166

693345 693345 13.895224

693346 693346 29.366304

693347 693347 45.978350

693348 693348 67.798859

693349 693349 10.384145

693350 693350 8.716605

693351 693351 9.640998

693352 693352 14.098605

693353 693353 14.731720

693354 693354 19.804941

693355 693355 26.429490

693356 693356 6.252158

693357 693357 15.377224

693358 693358 9.742211

693359 rows × 2 columns

LeaderBoard Result (Top 36%)

<img src = "http://i63.tinypic.com/14ccuv6.jpg" /img>
Conclusion

I am happy to have done this competition because it has opened up my mind into
the realm of NLP and it showed me how much pre-processing steps are involved for
text data. I learned the most common steps for text pre-processing and this allowed
me to prepare myself for future work whenever I’m against text data again. Another
concept that I really learned to value more is the choice of algorithms and how
important computation is whenever you’re dealing with large datasets. It took me a
couple of minutes to even perform some data visualizations and modeling. Text
data is everywhere and it can get messy. Understanding the fundamentals on how
to tackle these problems will definitely help me out in the future.

Releases

No releases published

Packages

No packages published

Languages

Jupyter Notebook 100.0%

Sentiment Analysis On Amazon Fine Food Reviews by Using Linear Machine Learning Models
No ratings yet
Sentiment Analysis On Amazon Fine Food Reviews by Using Linear Machine Learning Models
6 pages
5.2 Feature Engineering
No ratings yet
5.2 Feature Engineering
57 pages
DWDM Pavan Final
No ratings yet
DWDM Pavan Final
10 pages
Kavin
No ratings yet
Kavin
13 pages
Multi-Class Text Classification With Scikit-Learn
No ratings yet
Multi-Class Text Classification With Scikit-Learn
20 pages
Python Text Classification Guide
No ratings yet
Python Text Classification Guide
34 pages
Python CA 4
No ratings yet
Python CA 4
9 pages
SampleQuestion - AIOL 2024
No ratings yet
SampleQuestion - AIOL 2024
5 pages
ML Week 8
No ratings yet
ML Week 8
12 pages
MLA TAB Lecture2
No ratings yet
MLA TAB Lecture2
84 pages
Capstone Project - Jaro-Prof. Babji
No ratings yet
Capstone Project - Jaro-Prof. Babji
5 pages
Foundations of Python For AI
No ratings yet
Foundations of Python For AI
67 pages
Jadavpur University: Assignment Submission
No ratings yet
Jadavpur University: Assignment Submission
9 pages
Ie ML Project (Getting Started)
No ratings yet
Ie ML Project (Getting Started)
3 pages
Sentiment Analysis of Reviews Using Machine Learning
100% (1)
Sentiment Analysis of Reviews Using Machine Learning
33 pages
01 Phan Tich Dau Tu Nang Cao - CRISP Trong KHDL
No ratings yet
01 Phan Tich Dau Tu Nang Cao - CRISP Trong KHDL
37 pages
BAET Record
No ratings yet
BAET Record
19 pages
Methodology
No ratings yet
Methodology
9 pages
AAM PR QB
No ratings yet
AAM PR QB
13 pages
Data Science
No ratings yet
Data Science
25 pages
Machine Learning Lecture - 4 and Lecture - 5
No ratings yet
Machine Learning Lecture - 4 and Lecture - 5
73 pages
Data Analysis and Machine Learning Essentials
No ratings yet
Data Analysis and Machine Learning Essentials
14 pages
CH 3
No ratings yet
CH 3
33 pages
DM Practical File
No ratings yet
DM Practical File
21 pages
AI Projects: Search, Prediction, Sentiment, and More
No ratings yet
AI Projects: Search, Prediction, Sentiment, and More
17 pages
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
No ratings yet
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
38 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
A3 Classification and Feature Engineering
No ratings yet
A3 Classification and Feature Engineering
2 pages
Study Material For Machine Learning - 1 - 1754721598318
No ratings yet
Study Material For Machine Learning - 1 - 1754721598318
18 pages
Machine Learning Algorithms Lab
No ratings yet
Machine Learning Algorithms Lab
48 pages
L2 Data Crawling Preprocessing
No ratings yet
L2 Data Crawling Preprocessing
30 pages
21bce3701 Senti K9ar
No ratings yet
21bce3701 Senti K9ar
28 pages
UT-1-Machine Learning Lecture Notes-2
No ratings yet
UT-1-Machine Learning Lecture Notes-2
11 pages
Practical File OF Machine Learning
No ratings yet
Practical File OF Machine Learning
31 pages
Tushar ML
No ratings yet
Tushar ML
52 pages
Feature Engineering Guide
No ratings yet
Feature Engineering Guide
51 pages
Answer
No ratings yet
Answer
5 pages
Text Classification - Movie Review - News Wires
No ratings yet
Text Classification - Movie Review - News Wires
5 pages
Ai Lab Final
No ratings yet
Ai Lab Final
21 pages
Data Pre-Processing Steps
No ratings yet
Data Pre-Processing Steps
32 pages
ML File Syllabus
No ratings yet
ML File Syllabus
43 pages
Data Preprocessing
No ratings yet
Data Preprocessing
38 pages
Assignmnet
No ratings yet
Assignmnet
25 pages
1 s2.0 S2214785321052305 Main
No ratings yet
1 s2.0 S2214785321052305 Main
14 pages
Supervised ML with Flask & Docker
No ratings yet
Supervised ML with Flask & Docker
30 pages
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
No ratings yet
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
11 pages
NLU Final
No ratings yet
NLU Final
23 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
22K61A0654 2 Sasi Auto
No ratings yet
22K61A0654 2 Sasi Auto
24 pages
ML SIG - Day 1
No ratings yet
ML SIG - Day 1
55 pages
305 BA PYTHON - APR 2022 ANSWER Key
No ratings yet
305 BA PYTHON - APR 2022 ANSWER Key
14 pages
Final ML
No ratings yet
Final ML
2 pages
Shopping Cart Recommendation System
No ratings yet
Shopping Cart Recommendation System
8 pages
03 Machine Learning Overview
No ratings yet
03 Machine Learning Overview
24 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Advanced Feature Engineering and Data Preprocessing in Machine Learning
No ratings yet
Advanced Feature Engineering and Data Preprocessing in Machine Learning
7 pages
Northbay Summarizes Data Pre-Processing Algorithms
No ratings yet
Northbay Summarizes Data Pre-Processing Algorithms
10 pages
Class Xii PDF For Practical
No ratings yet
Class Xii PDF For Practical
24 pages
Amazon Food Reviews Analysis
No ratings yet
Amazon Food Reviews Analysis
37 pages
Financial Markets Assignment
No ratings yet
Financial Markets Assignment
7 pages
Sponsorship Resume Template
No ratings yet
Sponsorship Resume Template
12 pages
College Information
No ratings yet
College Information
29 pages
Foundations of Deep Learning
No ratings yet
Foundations of Deep Learning
48 pages
Business Research Methods Guide
No ratings yet
Business Research Methods Guide
11 pages
Foundations of Machine Learning - 3
No ratings yet
Foundations of Machine Learning - 3
38 pages
Telecom Churn Analysis with Logistic Regression
No ratings yet
Telecom Churn Analysis with Logistic Regression
6 pages
Chapt. 1 Syntactic Pyramid
No ratings yet
Chapt. 1 Syntactic Pyramid
16 pages
4C - Rivaldi Umara Batistuta 2223200209 Translation Idioms
No ratings yet
4C - Rivaldi Umara Batistuta 2223200209 Translation Idioms
5 pages
NSTSE Class 4 Question Paper 2016 Part 1: Examrace
No ratings yet
NSTSE Class 4 Question Paper 2016 Part 1: Examrace
5 pages
Makalah Psycholinguistics
100% (2)
Makalah Psycholinguistics
28 pages
Ayse Nihal Akbulut Ozerk Bir Bilim Dali Olarak Ceviribilim
100% (1)
Ayse Nihal Akbulut Ozerk Bir Bilim Dali Olarak Ceviribilim
12 pages
Serquina Flores Estoquia - 3BSE A - Eng Ed 313. Reflective and Journal Activity
No ratings yet
Serquina Flores Estoquia - 3BSE A - Eng Ed 313. Reflective and Journal Activity
2 pages
Making Offer and Suggetions
No ratings yet
Making Offer and Suggetions
4 pages
Shi Ji - Graded Chinese Reader - 2000 Words - 2018
100% (1)
Shi Ji - Graded Chinese Reader - 2000 Words - 2018
270 pages
9 CU SİNİF 9 SİNAQ - Aynura - Hamidova - 0515867727
No ratings yet
9 CU SİNİF 9 SİNAQ - Aynura - Hamidova - 0515867727
3 pages
James MacKillop - Dictionary of Celtic Mythology-Oxford University Press (1998)
100% (2)
James MacKillop - Dictionary of Celtic Mythology-Oxford University Press (1998)
436 pages
Adjective Clause Exercises
No ratings yet
Adjective Clause Exercises
5 pages
Macro and Micro
No ratings yet
Macro and Micro
43 pages
Syllabus Asstt Registrar - 0
No ratings yet
Syllabus Asstt Registrar - 0
1 page
Resources To Learn Greek
No ratings yet
Resources To Learn Greek
3 pages
Prism Reading L1
100% (2)
Prism Reading L1
23 pages
Cracking The Comprehensible Input Code - Luca Lampariello Artículo
No ratings yet
Cracking The Comprehensible Input Code - Luca Lampariello Artículo
11 pages
QT5 Discrete Trials
No ratings yet
QT5 Discrete Trials
2 pages
Midterm in OACC 201 Advanced Shorthand
No ratings yet
Midterm in OACC 201 Advanced Shorthand
4 pages
Prayer Before Birth Flipped HW
No ratings yet
Prayer Before Birth Flipped HW
15 pages
Global Writing Systems Overview
No ratings yet
Global Writing Systems Overview
3 pages
2.ĐỀ thi thử - (Lần 2)
No ratings yet
2.ĐỀ thi thử - (Lần 2)
4 pages
Etymological Dictionary of Latin
No ratings yet
Etymological Dictionary of Latin
4 pages
Conjugation أَتَى - Conjugate verb أَتَى
No ratings yet
Conjugation أَتَى - Conjugate verb أَتَى
1 page
6 To 8 Grammer Key
No ratings yet
6 To 8 Grammer Key
149 pages
FRENCH
No ratings yet
FRENCH
12 pages
First Grading First Quiz
No ratings yet
First Grading First Quiz
10 pages
Writing Assessment Rubrics PDF
No ratings yet
Writing Assessment Rubrics PDF
1 page
English Guide
No ratings yet
English Guide
32 pages
TC Unit 3 Exercise 3
No ratings yet
TC Unit 3 Exercise 3
17 pages
The Semantics of Silence in Biblical Hebrew - Sonja Noll
No ratings yet
The Semantics of Silence in Biblical Hebrew - Sonja Noll
363 pages

Pricing Mercari

Uploaded by

Pricing Mercari

Uploaded by

randylaosat / Price-Recommendation-Mercari-Challenge

Code Issues Pull requests Actions Projects Wiki Security Insights

Randy Lao … on 8 Jun 2018

Mercari Price Suggestion Challenge

Can you automatically suggest product prices to online sellers?

Mercari, Japan’s biggest community-powered shopping app, knows this problem

In this competition, Mercari’s challenging you to build an algorithm that

Work on supply and demand

Representing and Mining Text

**Bag of Words Representation: ** treats each word as a feature of the

TFIDF: a common value representation of terms. It boosts or weighs words that

N-grams: Sequences of adjacent words as terms. For example, since a word by

Stemming and Lemmatization:

Named Entity Extraction: A pre-processing technique used to know when

D. How should performance be measured?: Since its a regression problem, the

1. Handle Missing Values—Replaced “missing” values with NA.

2. Lemmatization performed on item_description—Aiming to remove inflectional

3. Label encoding has been performed on categorical values—Encode labels with

4. Tokenization—Given a character sequence, tokenization is the task of chopping

5. Maximum length of all sequences has been specified

6. Scaling performed on target variable (price)

7. Sentiment scored computed on item_description

from string import punctuation

# vstack - adds rows, hstack - adds columns

# CountVectorizer - Simply, counts word frequencies

# LabelBinarizer - Converts labels into numerical representation "G,B,R" -> [1,2

# Ridge - Reduces multicollinearity in regression. Applies L2 Regularization

Import Train / Test Data

# Create training set

AVA-VIV Women/Tops &

# Create testing set

# Create log price variable (Transformation)

# Create the submission set (Only contains the test ID)

# Create size of train

Part 2: Preparing the Corpus for Analysis

c. Remove stop words

d. Lower case words

# Create a list of punctuation replacements

Create a remove punctuation method

Remove Stop Words

from nltk.corpus import stopwords

Lower Case Words

Part 3: Explore Training Set

Categorical Variables (Need to do Encoding):

Check Missing Values

Check Data Types

3a. Price Distribution

Supply and Demand

The mean price in the dataset is 26 Dollars

<style> .dataframe thead tr:only-child th { text-align: right; }

q1 149944.0 7.710178 2.083100 3.0 6.0 8.0 10.0

q2 151863.0 13.834845 1.795258 10.5 12.0 14.0 15.0

q4 147215.0 63.396077 63.271190 30.0 35.0 45.0 66.0

3b. Brand Analysis

# Amount of unique brand names

# Top 20 Brand Distribution

# Display Top 20 Expensive Brands By Mean Price

ax = sns.barplot(x="brand_name", y="price", data=df_expPrice)

GoGroove Pal 0.000003

3c. Category Distribution

train['category_main'], train['category_sub1'], train['category_sub2'] = zip

cat_train = train[['category_main','category_sub1','category_sub2', 'price'

<style> .dataframe thead tr:only-child th { text-align: right; }

0 Men Tops T-shirts 10.0

1 Electronics Computers & Tablets Components & Parts 52.0

2 Women Tops & Blouses Blouse 10.0

3 Home Home Décor Home Décor Accents 35.0

4 Women Jewelry Necklaces 44.0

3c. Main Category

Women and Beauty take up majority of the distribution

Can we create a gender category (Female, Male, Nuetral). Example: Three

# Electronics have the highest std

<style> .dataframe thead tr:only-child th { text-align: right; }

Beauty 83315.0 19.727468 20.708703 0.0 10.0 15.0

Electronics 47986.0 33.763889 63.485958 0.0 9.0 15.0

Handmade 12257.0 18.325365 27.484725 0.0 6.0 12.0

Bag of Words Representation: treats each word as a feature of the