0% found this document useful (0 votes)
31 views11 pages

Huy

The document outlines a comprehensive data preparation process for machine learning, focusing on cleaning numerical features, engineering new features, handling missing values, encoding categorical variables, and performing feature scaling. It details specific steps taken for each task, including the use of pandas and scikit-learn for data manipulation and imputation, as well as exploratory data analysis (EDA) to understand feature distributions and correlations with the target variable 'price'. Finally, it discusses the selection of three regression models for predicting price, highlighting their strengths and suitability for the dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views11 pages

Huy

The document outlines a comprehensive data preparation process for machine learning, focusing on cleaning numerical features, engineering new features, handling missing values, encoding categorical variables, and performing feature scaling. It details specific steps taken for each task, including the use of pandas and scikit-learn for data manipulation and imputation, as well as exploratory data analysis (EDA) to understand feature distributions and correlations with the target variable 'price'. Finally, it discusses the selection of three regression models for predicting price, highlighting their strengths and suitability for the dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Task 2 Task 2, Question 1 focuses on cleaning numerical features currently stored as text/object types, often with characters like

s like '%'
question 1 or '$', to enable their use in machine learning.
We begin by importing pandas, numpy, and re for data manipulation and text parsing, then load train.csv. Initial inspection
(df.head(), df.dtypes, df.isnull().sum()) confirms columns like host_response_rate, bathrooms, and price are objects requiring
cleaning.

Cleaning Steps:
1. Percentage Columns (host_response_rate, host_acceptance_rate): These are converted to string, '%' is stripped, then
pd.to_numeric(errors='coerce') converts them to numbers (NaN for errors), and finally, they're divided by 100.0 for
decimal representation.
2. bathrooms Column: A custom function clean_bathrooms is applied. It handles existing NaNs, converts text to
lowercase, specifically maps 'half-bath' to 0.5, and uses regex r'(\d+\.?\d*)' to extract the first numerical value
(integer or decimal), converting it to float. If no number is found, it returns NaN.
3. price Column: If price is an object type, '$' and ',' are removed using df.replace(). Then,
pd.to_numeric(errors='coerce') converts it. If already numeric, it's ensured to be float via astype(float).
4. Boolean Columns (e.g., host_is_superhost): Columns in bool_tf_cols with 't'/'f' or True/False values are mapped to
1/0 respectively and converted to float to handle potential NaNs consistently.
5. Other Numeric Columns (e.g., accommodates, review_scores_rating): A predefined list of columns
(numeric_cols_to_convert) is processed. Each existing column is converted using pd.to_numeric(errors='coerce') to
ensure a numeric type, changing non-convertible entries to NaN.

Verification:
Post-cleaning, df.dtypes confirms columns like host_response_rate, bathrooms, and price are now float64. The
df.isnull().sum() output shows any new NaNs introduced by errors='coerce', which are expected and will be addressed
during imputation.
Task 2 Task 2, Question 2 involves engineering at least four new features from existing data to enhance model performance.
question 2
First, date columns (host_since, first_review, last_review) are converted to datetime objects using
pd.to_datetime(errors='coerce'), handling any parsing errors by converting them to NaT (Not a Time). This enables date-
based calculations.
New Features Created:
1. host_experience_years: Calculated by subtracting the year of host_since from the current year
(pd.Timestamp.now().year). NaNs (from missing or unparseable host_since dates) are filled with 0, representing
unknown or minimal experience.
2. amenities_count: The amenities string (e.g., ‘Wifi,TV,Kitchen’) is split by commas, and the length of the resulting list
gives the amenity count. Missing or unparseable amenity strings result in a count of 0, ensured by astype(str) and
fillna(0).
3. days_since_last_review: Determined by subtracting last_review from the current timestamp (pd.Timestamp.now())
and extracting the difference in days (.dt.days). NaNs (for listings with no or unparseable last review dates) are filled
with -1 to distinctly mark them.
4. description_length: The character length of each listing's description string is calculated using .astype(str).str.len().
Missing descriptions result in a length of 0 via fillna(0).

Verification:
A sample (df[[...]].head()) and dtypes printout confirm the new features are numerical (int32/int64). isnull().sum() shows
zero missing values in these new features due to the fillna() operations during their creation, making them ready for
modeling.
Task 2 Moving to Task 2, Question 3, our objective is to handle missing values in our dataset. Machine learning algorithms generally
question 3 cannot work with NaNs, so we need to impute them, meaning we’ll fill them in with estimated values. This step is crucial for
both our training and any future test datasets.

First, we focus on the numerical features by selecting columns with df.select_dtypes(include=‘number’) and storing them in
df_num. We then create df_data from these numerical columns. Our target variable, or label, is price. Next, we split this
numerical data into training and validation sets using train_test_split. We’re using an 80-20 split. It’s critical to split the data
before imputation to prevent data leakage from the validation set into the training set. We separate the features (X) from
the target variable (y) for both sets.

Before imputation, we can see from the output that our training set has 620 missing values and the validation set has 168.
To handle these, we use SimpleImputer from scikit-learn. Key points here are:
1. We initialize the imputer with strategy=‘mean’, which means it will replace all np.nan (missing values) with the mean
of the respective column.
2. Crucially, we fit the imputer only on the training data (df_x_train). This learns the mean for each feature from the
training set.
3. Then, we use this fitted imputer to transform both the training set (df_x_train) and the validation set (df_x_valid).
This ensures that the validation set is processed using information learned solely from the training data, mimicking
how we’d handle new, unseen data.
4. The transform method returns NumPy arrays, so we convert them back to pandas DataFrames, preserving the
original column names and indices.

As the final output lines show, after imputation, both the training and validation feature sets have 0 missing values,
successfully addressing the task.
Task 2 For Task 2, Question 4, we’re encoding our categorical variables. This is necessary because most machine learning models
question 4 require numerical input. We have specific rules for this encoding: handling multi-value entries, and managing high-
cardinality features by keeping the top 5 categories and grouping the rest into an ‘other’ category, before finally applying
one-hot encoding.

We define a function encode_categorical that takes our DataFrame and a column name. Let’s walk through its key actions:
1. Multi-value Handling: It first checks if a string value contains a comma. If so, it’s re-labeled as ‘other’. This simplifies
features like host_verifications or amenities if they haven’t been pre-processed into counts already.
2. Cardinality Reduction: If a column has more than 5 unique values (after handling multi-values and excluding NaNs), it
identifies the top 5 most frequent values. All other values in that column are then mapped to ‘other’. This prevents
creating too many new features from highly diverse columns.
3. NaN Handling & Type Conversion: The column is converted to pandas’ category dtype. We explicitly add ‘other’ as a
category if it’s not already present and the unique count is low, ensuring consistency. Then, any remaining NaN
values are filled with ‘other’.
4. One-Hot Encoding: Finally, pd.get_dummies is used to perform one-hot encoding. This creates new binary
(True/False or 1/0) columns for each unique category (including ‘other’), and the original categorical column is
dropped. The dummy_na=False ensures we don’t create an extra column for NaNs, as we’ve already handled them.
We then identify all columns with an ‘object’ dtype, which are our categorical features. We loop through this list, applying
our encode_categorical function to each one. The print statement inside the loop shows us which column is currently being
processed.

The output snippets ‘Data types after categorical encoding’ and ‘DataFrame head...’ demonstrate the result: the original
object columns are gone, replaced by new boolean columns (e.g., source_city scrape, property_type_Entire home,
name_other). This transformation makes our categorical data suitable for model training.
Task 2 For our final data preparation step in Task 2, Question 5, we’re going to perform feature scaling. This is a common and often
question 5 crucial step before training many types of predictive models. The goal is to standardize the range of our independent
numerical variables.

Why do we do this? Many machine learning algorithms, particularly those that rely on distance calculations (like K-Nearest
Neighbors or Support Vector Machines) or gradient descent (like linear regression, logistic regression, and neural networks),
can perform poorly or converge slowly if features are on vastly different scales. For example, a feature ranging from 0 to
100,000 might dominate another feature ranging from 0 to 1, unfairly influencing the model. StandardScaler addresses this
by transforming each feature to have a mean of 0 and a standard deviation of 1.

Looking at the code:


1. We initialize StandardScaler from scikit-learn.
2. For the training data (df_x_train), we use scaler.fit_transform(). The fit part calculates the mean and standard
deviation for each feature only from the training data. The transform part then applies the scaling using these
learned parameters.
3. For the validation data (df_x_valid), we only use scaler.transform(). This is critical: we apply the same scaling
parameters learned from the training set to the validation set. This prevents data leakage and ensures our validation
process accurately reflects how the model would perform on unseen data.
4. The results are NumPy arrays, so we convert them back into pandas DataFrames, making sure to keep the original
column names and indices.

As you can see in the ‘Training data after scaling’ and ‘Validation data after scaling’ outputs, the values in our feature
columns are now centered around zero and have been scaled. For instance, the host_listings_count and accommodates
columns now have values that are much closer in magnitude, ensuring no single feature will unduly dominate the learning
process due to its original scale.
Task 3 Welcome to Task 3. Before we dive into building and tuning our predictive models for price, it’s essential to perform some
Exploratory Data Analysis (EDA). The goal here is to understand our data better, particularly the distributions of our features
and, crucially, how these features relate to our target variable, ‘price’. This understanding will help us make informed
decisions during modeling.

We start by generating descriptive statistics for all our numerical features using df_data.describe(). The pd.set_option line
simply formats our float output for better readability.

Looking at the output table:

1. Count: We can see that most features have 7000 entries, matching the dataset size. However, columns like
host_acceptance_rate, bathrooms, and bedrooms have slightly fewer counts, indicating that even after our initial
cleaning and imputation of the training split, the full df_data (which we’re using for this EDA before splitting again for
modeling) still contains some NaNs that the describe method excludes by default from calculations like mean, std,
etc. Our prior imputation was done on df_x_train and df_x_valid, not this global df_data.
2. Price: For our target variable ‘price’, the mean is around 285, but the standard deviation is very high (2325), and the
maximum value is 145,160. The 75th percentile is only 268. This large difference between the mean/median and the
max, along with a high standard deviation, strongly suggests that the ‘price’ distribution is heavily right-skewed, with
some very expensive listings. This is a common characteristic of price data.
3. Other Features: We can also observe ranges and central tendencies for other features. For instance, accommodates
ranges from 1 to 16. Features like minimum_maximum_nights and maximum_nights_avg_ntm show extremely large
maximum values, which might indicate outliers or specific data encoding that we handled during cleaning but are still
visible in raw stats. The feature has_availability has a standard deviation of 0.00, and its min, mean, and max are all
1.00, indicating it’s a constant value across the dataset and won’t be useful for prediction.

Next, we visualize the distribution of each numerical feature using histograms. We loop through each column in df_data,
drop any NaN values for the purpose of plotting that specific histogram, and then plot its distribution with 50 bins. While we
won’t go through every single histogram, the general findings from this step would typically reinforce what describe()
suggested:
1. Price: The histogram for ‘price’ would visually confirm its strong right skew.
2. Count-based features (like bedrooms, number_of_reviews): These often also show right skew, with many listings
having fewer counts and a long tail for higher counts.
3. Review scores: These might be left-skewed, with many listings having high scores.

Understanding these distributions is important. For instance, highly skewed features (especially the target variable) might
benefit from transformations like a log transform before modeling, particularly for linear models.

Finally, and most directly related to predicting ‘price’, we calculate the Pearson correlation matrix for all numerical features.
We then extract the correlation of each feature with our target variable ‘price’ and sort these values in descending order.
This shows us the linear relationship strength and direction between each feature and the price. From the price_correlation
output, we can observe:
1. Positive Correlations:
a. bedrooms (0.08), accommodates (0.04), bathrooms (0.03), and beds (0.03) show the highest positive (though
modest) correlations with price. This is intuitive: listings that can accommodate more people or have more
bedrooms/bathrooms tend to be more expensive.
b. availability_365 (0.03) and other availability metrics show slight positive correlations, which might suggest that
properties that are available more often (perhaps professionally managed or higher-end) command slightly
higher prices, or it could be more nuanced.
2. Near-Zero Correlations: Many features show correlations very close to zero (e.g., review_scores_rating,
description_length, host_experience_years). This suggests a weak linear relationship with price. However, non-linear
relationships might still exist, or these features could be important in combination with others.
3. Negative Correlations: Features like number_of_reviews_l30d (-0.02), number_of_reviews (-0.02), and
reviews_per_month (-0.02) have slight negative correlations. This might seem counterintuitive, but it could imply
that listings with extremely high review volumes are often more established, potentially more budget-friendly, or
that very exclusive, high-priced listings don’t accumulate reviews as rapidly.
4. has_availability: Shows NaN for correlation. This is because, as we saw in describe(), this feature has no variance (it’s
constant), so a correlation cannot be computed. Such features are not useful for predictive modeling.
These EDA findings, especially the skewness of ‘price’ and the correlations, will guide our feature selection, feature
engineering, and choice of model transformations. For instance, the modest linear correlations suggest that more complex
models or interaction terms might be needed to capture the relationship with price effectively.
Now we move into the core modeling part of Task 3. We’ll first choose three distinct machine learning regression models,
explaining our rationale for each. Then, we’ll train these models on our prepared training data and evaluate their initial
predictive performance on both the training and validation sets.

For this competition, we’ve selected the following three regression models, each with different characteristics:

1. Random Forest Regressor:


a. Why we chose it: Random Forests are powerful ensemble learning methods. They work by constructing a
multitude of decision trees at training time and outputting the mean prediction of the individual trees.
b. Strengths: They are generally robust to outliers, can handle non-linear relationships well without explicit feature
transformation, and automatically perform a degree of feature selection by measuring feature importance. They
are less prone to overfitting than a single decision tree due to averaging.
c. Considerations for this problem: Given our EDA showed only modest linear correlations, a model like Random
Forest that can capture complex, non-linear interactions between features and price could be beneficial. We also
have a fair number of features after one-hot encoding, which Random Forests can handle well.
2. Elastic Net Regression:
d. Why we chose it: Elastic Net is a linear regression model that combines the penalties of both L1 (Lasso) and L2
(Ridge) regularization. The L1 penalty encourages sparsity (some feature coefficients can become zero),
performing feature selection, while the L2 penalty helps to handle multicollinearity and shrinks coefficients.
e. Strengths: It offers a good balance between Lasso and Ridge. It’s useful when there are many features, some of
which might be correlated. The l1_ratio parameter allows us to tune the balance between L1 and L2
regularization.
f. Considerations for this problem: With many features, including those derived from one-hot encoding, Elastic Net
can help in identifying the most impactful features and regularizing the model to prevent overfitting, especially if
some features are highly correlated (which is common after one-hot encoding categorical variables with many
levels). Our features were also scaled, which is beneficial for regularized linear models.
3. Stacking Regressor (with RidgeCV and LinearSVR as base, DecisionTreeRegressor as final):
g. Why we chose it: Stacking is an advanced ensemble technique where multiple different types of models (base
learners) are trained, and then a meta-model (final estimator) is trained on the outputs (predictions) of these
base learners. This can often lead to better performance than any single model because it combines the
strengths of diverse learners.
h. Strengths: Stacking can capture a more complex decision boundary or regression surface by leveraging different
modeling paradigms. It can be very powerful if the base models are diverse and their errors are somewhat
uncorrelated.
i. Our chosen configuration:
j. Base Estimators: We’ve chosen RidgeCV (Ridge Regression with built-in cross-validation for alpha) and LinearSVR
(Linear Support Vector Regressor). These are both linear models but operate on different principles (least squares
with L2 penalty vs. margin maximization).
k. Final Estimator: We’re using a DecisionTreeRegressor as the meta-model. This allows the final model to learn
non-linear combinations of the base model predictions.
l. Considerations for this problem: By stacking, we hope to combine the stabilizing effect of Ridge regression with
the different approach of SVR, and then allow a decision tree to learn how to best combine their predictions to
model the ‘price’.
First, we train our Random Forest Regressor.
a. Model Initialization: We initialize RandomForestRegressor with n_estimators=1000, meaning it will build 1000
decision trees. criterion=‘squared_error’ is the standard for regression. random_state=1 ensures reproducibility,
and n_jobs=-1 uses all available processor cores for faster training.
b. Training: We fit the model using our scaled training features (df_x_train) and the training target variable
(df_y_train). Note the .values.ravel() on df_y_train to ensure it’s passed as a 1D array, which some scikit-learn
estimators expect.
c. Prediction & Evaluation: We then make predictions on both the training and validation sets. We evaluate using R-
squared (R²), Mean Squared Error (MSE), and Mean Absolute Error (MAE).

R²: The R² on the training set is 0.844, which is quite high, indicating the model explains about 84.4% of the variance in the
training data. However, the validation R² is -0.495. A negative R² means the model is performing worse than a horizontal line
just predicting the mean of the target variable. This is a strong indicator of severe overfitting.
MSE & MAE: The MSE and MAE are significantly lower on the training set (MSE: ~920k, MAE: ~71) compared to the
validation set (MSE: ~5.2M, MAE: ~240). This large gap further confirms overfitting. The model has learned the training data
very well, including its noise, but fails to generalize to unseen validation data.
Hyperparameters: The n_estimators=1000 is a chosen value. Random Forests also have other hyperparameters like
max_depth, min_samples_split, min_samples_leaf, etc., which are currently at their defaults. The significant overfitting
suggests that these hyperparameters are not yet optimized and the model is too complex. We would need to use techniques
like GridSearchCV or RandomizedSearchCV for proper hyperparameter tuning, likely by restricting tree depth or increasing
minimum samples per leaf to regularize the model.
Fitted Weights (Feature Importances): While not explicitly printed here, a fitted Random Forest model has a
feature_importances_ attribute. This would show which features the ensemble of trees found most useful for making splits,
giving insight into what drives the predictions. Highly important features would have larger values.
Next, we train the Elastic Net model.
a. Model Initialization: We initialize ElasticNet with alpha=0.001 (the overall strength of regularization) and
l1_ratio=0.5 (an equal mix of L1 and L2 penalties). These are initial hyperparameter choices that would typically
be tuned.
b. Training & Evaluation: Similar to Random Forest, we fit on df_x_train and df_y_train, then predict and evaluate.

R²: The R² scores are very low for both training (0.009) and validation (-0.012). This indicates the linear Elastic Net model,
with these hyperparameters, explains very little variance in the price, performing poorly on both sets. The validation R²
being negative is again a bad sign.
MSE & MAE: The training MSE (~5.8M) is actually higher than the validation MSE (~3.5M), and the training MAE (~224) is
somewhat close to the validation MAE (~257). This unusual pattern (better validation MSE than training) might suggest
issues with the chosen alpha and l1_ratio, or that the model is heavily underfitting due to its linearity struggling with the
complexity of the data. The fact that validation MAE isn’t drastically worse than training MAE suggests less overfitting
compared to Random Forest, but the overall performance is poor.
Optimized Hyperparameters: The alpha=0.001 and l1_ratio=0.5 are not optimized here; they are pre-set. Proper tuning via
cross-validation (e.g., using ElasticNetCV or GridSearchCV) would be essential to find the best alpha and l1_ratio.
Fitted Weights (Coefficients): An Elastic Net model has a .coef_ attribute. This array would show the learned coefficient for
each feature. Due to the L1 component, some of these coefficients might be exactly zero, indicating those features were
effectively excluded by the model. Non-zero coefficients show the strength and direction of the linear relationship the model
found for each feature after regularization.
Finally, our Stacking Regressor.
a. Model Initialization: We define estimators as a list of tuples, containing our base models: RidgeCV (which tunes its
own alpha) and LinearSVR. The final_estimator is a DecisionTreeRegressor.
b. Training & Evaluation: The Stacking Regressor is fitted. Internally, it trains the base models, gets their out-of-fold
predictions (by default using 5-fold cross-validation for this), and then trains the final estimator on these predictions.

R²: The R² values are extremely poor and deeply negative for both training (-1.286) and validation (-2.324). This is the worst
performance so far, indicating this specific stacking configuration is failing badly.
MSE & MAE: The MSE and MAE values are very high, further confirming the poor performance. Similar to Elastic Net, the
training MSE is higher than validation MSE, which is highly unusual and points to significant problems, possibly with how the
base learners’ predictions are being combined or the suitability of the Decision Tree as a meta-learner here without further
tuning.
Optimized Hyperparameters:
For RidgeCV, the alpha is internally optimized.
For LinearSVR, parameters like C (regularization strength) and epsilon are at their defaults.
For the DecisionTreeRegressor (final estimator), parameters like max_depth are at defaults.
This entire Stacking setup would require extensive hyperparameter tuning for its components (base learners and final
estimator) using a nested cross-validation approach or a specialized grid search for stacking.
Fitted Weights: Accessing ‘weights’ in a Stacking Regressor is more complex. You can inspect the
final_estimator_.feature_importances_ if the final estimator supports it (like a Decision Tree or Random Forest), which
would tell you how important each base model’s predictions were to the final meta-model. You could also inspect the coef_
of the individual fitted base linear models (e.g., stacking_reg.named_estimators_[‘ridge’].coef_).
Based on these initial results, all three models are currently performing poorly on the validation set. The Random Forest
shows significant overfitting. The Elastic Net and Stacking Regressor are performing very badly overall, with negative R-
squared values on validation.
This clearly indicates that the default or arbitrarily chosen hyperparameters are not suitable. The next crucial step, which is
usually done via cross-validation techniques like GridSearchCV or RandomizedSearchCV, would be to systematically search
for better hyperparameter combinations for each of these models. This process involves:
a. Defining a grid of hyperparameter values to try for each model.
b. For each combination, training the model on folds of the training data and evaluating on the remaining fold.
c. Averaging the performance metric (e.g., negative MSE) across folds.
d. Selecting the hyperparameter combination that yields the best average cross-validated performance.
e. Retraining the model with these optimal hyperparameters on the entire training set.
Only after such tuning can we get a more reliable estimate of each model’s true predictive capability on unseen data. The
‘fitted weights’ or feature importances from these tuned models would then be more meaningful.

You might also like