Huy

The document outlines a comprehensive data preparation process for machine learning, focusing on cleaning numerical features, engineering new features, handling missing values, encoding categorical variables, and performing feature scaling. It details specific steps taken for each task, including the use of pandas and scikit-learn for data manipulation and imputation, as well as exploratory data analysis (EDA) to understand feature distributions and correlations with the target variable 'price'. Finally, it discusses the selection of three regression models for predicting price, highlighting their strengths and suitability for the dataset.

Uploaded by

nguyen.tan.quan.fw

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views11 pages

Huy

Uploaded by

nguyen.tan.quan.fw

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

Task 2 Task 2, Question 1 focuses on cleaning numerical features currently stored as text/object types, often with characters like

s like '%'
question 1 or '$', to enable their use in machine learning.
We begin by importing pandas, numpy, and re for data manipulation and text parsing, then load train.csv. Initial inspection
(df.head(), df.dtypes, df.isnull().sum()) confirms columns like host_response_rate, bathrooms, and price are objects requiring
cleaning.

Cleaning Steps:
1. Percentage Columns (host_response_rate, host_acceptance_rate): These are converted to string, '%' is stripped, then
pd.to_numeric(errors='coerce') converts them to numbers (NaN for errors), and finally, they're divided by 100.0 for
decimal representation.
2. bathrooms Column: A custom function clean_bathrooms is applied. It handles existing NaNs, converts text to
lowercase, specifically maps 'half-bath' to 0.5, and uses regex r'(\d+\.?\d*)' to extract the first numerical value
(integer or decimal), converting it to float. If no number is found, it returns NaN.
3. price Column: If price is an object type, '$' and ',' are removed using df.replace(). Then,
pd.to_numeric(errors='coerce') converts it. If already numeric, it's ensured to be float via astype(float).
4. Boolean Columns (e.g., host_is_superhost): Columns in bool_tf_cols with 't'/'f' or True/False values are mapped to
1/0 respectively and converted to float to handle potential NaNs consistently.
5. Other Numeric Columns (e.g., accommodates, review_scores_rating): A predefined list of columns
(numeric_cols_to_convert) is processed. Each existing column is converted using pd.to_numeric(errors='coerce') to
ensure a numeric type, changing non-convertible entries to NaN.

Verification:
Post-cleaning, df.dtypes confirms columns like host_response_rate, bathrooms, and price are now float64. The
df.isnull().sum() output shows any new NaNs introduced by errors='coerce', which are expected and will be addressed
during imputation.
Task 2 Task 2, Question 2 involves engineering at least four new features from existing data to enhance model performance.
question 2
First, date columns (host_since, first_review, last_review) are converted to datetime objects using
pd.to_datetime(errors='coerce'), handling any parsing errors by converting them to NaT (Not a Time). This enables date-
based calculations.
New Features Created:
1. host_experience_years: Calculated by subtracting the year of host_since from the current year
(pd.Timestamp.now().year). NaNs (from missing or unparseable host_since dates) are filled with 0, representing
unknown or minimal experience.
2. amenities_count: The amenities string (e.g., ‘Wifi,TV,Kitchen’) is split by commas, and the length of the resulting list
gives the amenity count. Missing or unparseable amenity strings result in a count of 0, ensured by astype(str) and
fillna(0).
3. days_since_last_review: Determined by subtracting last_review from the current timestamp (pd.Timestamp.now())
and extracting the difference in days (.dt.days). NaNs (for listings with no or unparseable last review dates) are filled
with -1 to distinctly mark them.
4. description_length: The character length of each listing's description string is calculated using .astype(str).str.len().
Missing descriptions result in a length of 0 via fillna(0).

Verification:
A sample (df[[...]].head()) and dtypes printout confirm the new features are numerical (int32/int64). isnull().sum() shows
zero missing values in these new features due to the fillna() operations during their creation, making them ready for
modeling.
Task 2 Moving to Task 2, Question 3, our objective is to handle missing values in our dataset. Machine learning algorithms generally
question 3 cannot work with NaNs, so we need to impute them, meaning we’ll fill them in with estimated values. This step is crucial for
both our training and any future test datasets.

First, we focus on the numerical features by selecting columns with df.select_dtypes(include=‘number’) and storing them in
df_num. We then create df_data from these numerical columns. Our target variable, or label, is price. Next, we split this
numerical data into training and validation sets using train_test_split. We’re using an 80-20 split. It’s critical to split the data
before imputation to prevent data leakage from the validation set into the training set. We separate the features (X) from
the target variable (y) for both sets.

Before imputation, we can see from the output that our training set has 620 missing values and the validation set has 168.
To handle these, we use SimpleImputer from scikit-learn. Key points here are:
1. We initialize the imputer with strategy=‘mean’, which means it will replace all np.nan (missing values) with the mean
of the respective column.
2. Crucially, we fit the imputer only on the training data (df_x_train). This learns the mean for each feature from the
training set.
3. Then, we use this fitted imputer to transform both the training set (df_x_train) and the validation set (df_x_valid).
This ensures that the validation set is processed using information learned solely from the training data, mimicking
how we’d handle new, unseen data.
4. The transform method returns NumPy arrays, so we convert them back to pandas DataFrames, preserving the
original column names and indices.

As the final output lines show, after imputation, both the training and validation feature sets have 0 missing values,
successfully addressing the task.
Task 2 For Task 2, Question 4, we’re encoding our categorical variables. This is necessary because most machine learning models
question 4 require numerical input. We have specific rules for this encoding: handling multi-value entries, and managing high-
cardinality features by keeping the top 5 categories and grouping the rest into an ‘other’ category, before finally applying
one-hot encoding.

We define a function encode_categorical that takes our DataFrame and a column name. Let’s walk through its key actions:
1. Multi-value Handling: It first checks if a string value contains a comma. If so, it’s re-labeled as ‘other’. This simplifies
features like host_verifications or amenities if they haven’t been pre-processed into counts already.
2. Cardinality Reduction: If a column has more than 5 unique values (after handling multi-values and excluding NaNs), it
identifies the top 5 most frequent values. All other values in that column are then mapped to ‘other’. This prevents
creating too many new features from highly diverse columns.
3. NaN Handling & Type Conversion: The column is converted to pandas’ category dtype. We explicitly add ‘other’ as a
category if it’s not already present and the unique count is low, ensuring consistency. Then, any remaining NaN
values are filled with ‘other’.
4. One-Hot Encoding: Finally, pd.get_dummies is used to perform one-hot encoding. This creates new binary
(True/False or 1/0) columns for each unique category (including ‘other’), and the original categorical column is
dropped. The dummy_na=False ensures we don’t create an extra column for NaNs, as we’ve already handled them.
We then identify all columns with an ‘object’ dtype, which are our categorical features. We loop through this list, applying
our encode_categorical function to each one. The print statement inside the loop shows us which column is currently being
processed.

The output snippets ‘Data types after categorical encoding’ and ‘DataFrame head...’ demonstrate the result: the original
object columns are gone, replaced by new boolean columns (e.g., source_city scrape, property_type_Entire home,
name_other). This transformation makes our categorical data suitable for model training.
Task 2 For our final data preparation step in Task 2, Question 5, we’re going to perform feature scaling. This is a common and often
question 5 crucial step before training many types of predictive models. The goal is to standardize the range of our independent
numerical variables.

Why do we do this? Many machine learning algorithms, particularly those that rely on distance calculations (like K-Nearest
Neighbors or Support Vector Machines) or gradient descent (like linear regression, logistic regression, and neural networks),
can perform poorly or converge slowly if features are on vastly different scales. For example, a feature ranging from 0 to
100,000 might dominate another feature ranging from 0 to 1, unfairly influencing the model. StandardScaler addresses this
by transforming each feature to have a mean of 0 and a standard deviation of 1.

Looking at the code:

1. We initialize StandardScaler from scikit-learn.
2. For the training data (df_x_train), we use scaler.fit_transform(). The fit part calculates the mean and standard
deviation for each feature only from the training data. The transform part then applies the scaling using these
learned parameters.
3. For the validation data (df_x_valid), we only use scaler.transform(). This is critical: we apply the same scaling
parameters learned from the training set to the validation set. This prevents data leakage and ensures our validation
process accurately reflects how the model would perform on unseen data.
4. The results are NumPy arrays, so we convert them back into pandas DataFrames, making sure to keep the original
column names and indices.

As you can see in the ‘Training data after scaling’ and ‘Validation data after scaling’ outputs, the values in our feature
columns are now centered around zero and have been scaled. For instance, the host_listings_count and accommodates
columns now have values that are much closer in magnitude, ensuring no single feature will unduly dominate the learning
process due to its original scale.
Task 3 Welcome to Task 3. Before we dive into building and tuning our predictive models for price, it’s essential to perform some
Exploratory Data Analysis (EDA). The goal here is to understand our data better, particularly the distributions of our features
and, crucially, how these features relate to our target variable, ‘price’. This understanding will help us make informed
decisions during modeling.

We start by generating descriptive statistics for all our numerical features using df_data.describe(). The pd.set_option line
simply formats our float output for better readability.

Looking at the output table:

1. Count: We can see that most features have 7000 entries, matching the dataset size. However, columns like
host_acceptance_rate, bathrooms, and bedrooms have slightly fewer counts, indicating that even after our initial
cleaning and imputation of the training split, the full df_data (which we’re using for this EDA before splitting again for
modeling) still contains some NaNs that the describe method excludes by default from calculations like mean, std,
etc. Our prior imputation was done on df_x_train and df_x_valid, not this global df_data.
2. Price: For our target variable ‘price’, the mean is around 285, but the standard deviation is very high (2325), and the
maximum value is 145,160. The 75th percentile is only 268. This large difference between the mean/median and the
max, along with a high standard deviation, strongly suggests that the ‘price’ distribution is heavily right-skewed, with
some very expensive listings. This is a common characteristic of price data.
3. Other Features: We can also observe ranges and central tendencies for other features. For instance, accommodates
ranges from 1 to 16. Features like minimum_maximum_nights and maximum_nights_avg_ntm show extremely large
maximum values, which might indicate outliers or specific data encoding that we handled during cleaning but are still
visible in raw stats. The feature has_availability has a standard deviation of 0.00, and its min, mean, and max are all
1.00, indicating it’s a constant value across the dataset and won’t be useful for prediction.

Next, we visualize the distribution of each numerical feature using histograms. We loop through each column in df_data,
drop any NaN values for the purpose of plotting that specific histogram, and then plot its distribution with 50 bins. While we
won’t go through every single histogram, the general findings from this step would typically reinforce what describe()
suggested:
1. Price: The histogram for ‘price’ would visually confirm its strong right skew.
2. Count-based features (like bedrooms, number_of_reviews): These often also show right skew, with many listings
having fewer counts and a long tail for higher counts.
3. Review scores: These might be left-skewed, with many listings having high scores.

Understanding these distributions is important. For instance, highly skewed features (especially the target variable) might
benefit from transformations like a log transform before modeling, particularly for linear models.

Finally, and most directly related to predicting ‘price’, we calculate the Pearson correlation matrix for all numerical features.
We then extract the correlation of each feature with our target variable ‘price’ and sort these values in descending order.
This shows us the linear relationship strength and direction between each feature and the price. From the price_correlation
output, we can observe:
1. Positive Correlations:
a. bedrooms (0.08), accommodates (0.04), bathrooms (0.03), and beds (0.03) show the highest positive (though
modest) correlations with price. This is intuitive: listings that can accommodate more people or have more
bedrooms/bathrooms tend to be more expensive.
b. availability_365 (0.03) and other availability metrics show slight positive correlations, which might suggest that
properties that are available more often (perhaps professionally managed or higher-end) command slightly
higher prices, or it could be more nuanced.
2. Near-Zero Correlations: Many features show correlations very close to zero (e.g., review_scores_rating,
description_length, host_experience_years). This suggests a weak linear relationship with price. However, non-linear
relationships might still exist, or these features could be important in combination with others.
3. Negative Correlations: Features like number_of_reviews_l30d (-0.02), number_of_reviews (-0.02), and
reviews_per_month (-0.02) have slight negative correlations. This might seem counterintuitive, but it could imply
that listings with extremely high review volumes are often more established, potentially more budget-friendly, or
that very exclusive, high-priced listings don’t accumulate reviews as rapidly.
4. has_availability: Shows NaN for correlation. This is because, as we saw in describe(), this feature has no variance (it’s
constant), so a correlation cannot be computed. Such features are not useful for predictive modeling.
These EDA findings, especially the skewness of ‘price’ and the correlations, will guide our feature selection, feature
engineering, and choice of model transformations. For instance, the modest linear correlations suggest that more complex
models or interaction terms might be needed to capture the relationship with price effectively.
Now we move into the core modeling part of Task 3. We’ll first choose three distinct machine learning regression models,
explaining our rationale for each. Then, we’ll train these models on our prepared training data and evaluate their initial
predictive performance on both the training and validation sets.

For this competition, we’ve selected the following three regression models, each with different characteristics:

1. Random Forest Regressor:

a. Why we chose it: Random Forests are powerful ensemble learning methods. They work by constructing a
multitude of decision trees at training time and outputting the mean prediction of the individual trees.
b. Strengths: They are generally robust to outliers, can handle non-linear relationships well without explicit feature
transformation, and automatically perform a degree of feature selection by measuring feature importance. They
are less prone to overfitting than a single decision tree due to averaging.
c. Considerations for this problem: Given our EDA showed only modest linear correlations, a model like Random
Forest that can capture complex, non-linear interactions between features and price could be beneficial. We also
have a fair number of features after one-hot encoding, which Random Forests can handle well.
2. Elastic Net Regression:
d. Why we chose it: Elastic Net is a linear regression model that combines the penalties of both L1 (Lasso) and L2
(Ridge) regularization. The L1 penalty encourages sparsity (some feature coefficients can become zero),
performing feature selection, while the L2 penalty helps to handle multicollinearity and shrinks coefficients.
e. Strengths: It offers a good balance between Lasso and Ridge. It’s useful when there are many features, some of
which might be correlated. The l1_ratio parameter allows us to tune the balance between L1 and L2
regularization.
f. Considerations for this problem: With many features, including those derived from one-hot encoding, Elastic Net
can help in identifying the most impactful features and regularizing the model to prevent overfitting, especially if
some features are highly correlated (which is common after one-hot encoding categorical variables with many
levels). Our features were also scaled, which is beneficial for regularized linear models.
3. Stacking Regressor (with RidgeCV and LinearSVR as base, DecisionTreeRegressor as final):
g. Why we chose it: Stacking is an advanced ensemble technique where multiple different types of models (base
learners) are trained, and then a meta-model (final estimator) is trained on the outputs (predictions) of these
base learners. This can often lead to better performance than any single model because it combines the
strengths of diverse learners.
h. Strengths: Stacking can capture a more complex decision boundary or regression surface by leveraging different
modeling paradigms. It can be very powerful if the base models are diverse and their errors are somewhat
uncorrelated.
i. Our chosen configuration:
j. Base Estimators: We’ve chosen RidgeCV (Ridge Regression with built-in cross-validation for alpha) and LinearSVR
(Linear Support Vector Regressor). These are both linear models but operate on different principles (least squares
with L2 penalty vs. margin maximization).
k. Final Estimator: We’re using a DecisionTreeRegressor as the meta-model. This allows the final model to learn
non-linear combinations of the base model predictions.
l. Considerations for this problem: By stacking, we hope to combine the stabilizing effect of Ridge regression with
the different approach of SVR, and then allow a decision tree to learn how to best combine their predictions to
model the ‘price’.
First, we train our Random Forest Regressor.
a. Model Initialization: We initialize RandomForestRegressor with n_estimators=1000, meaning it will build 1000
decision trees. criterion=‘squared_error’ is the standard for regression. random_state=1 ensures reproducibility,
and n_jobs=-1 uses all available processor cores for faster training.
b. Training: We fit the model using our scaled training features (df_x_train) and the training target variable
(df_y_train). Note the .values.ravel() on df_y_train to ensure it’s passed as a 1D array, which some scikit-learn
estimators expect.
c. Prediction & Evaluation: We then make predictions on both the training and validation sets. We evaluate using R-
squared (R²), Mean Squared Error (MSE), and Mean Absolute Error (MAE).

R²: The R² on the training set is 0.844, which is quite high, indicating the model explains about 84.4% of the variance in the
training data. However, the validation R² is -0.495. A negative R² means the model is performing worse than a horizontal line
just predicting the mean of the target variable. This is a strong indicator of severe overfitting.
MSE & MAE: The MSE and MAE are significantly lower on the training set (MSE: ~920k, MAE: ~71) compared to the
validation set (MSE: ~5.2M, MAE: ~240). This large gap further confirms overfitting. The model has learned the training data
very well, including its noise, but fails to generalize to unseen validation data.
Hyperparameters: The n_estimators=1000 is a chosen value. Random Forests also have other hyperparameters like
max_depth, min_samples_split, min_samples_leaf, etc., which are currently at their defaults. The significant overfitting
suggests that these hyperparameters are not yet optimized and the model is too complex. We would need to use techniques
like GridSearchCV or RandomizedSearchCV for proper hyperparameter tuning, likely by restricting tree depth or increasing
minimum samples per leaf to regularize the model.
Fitted Weights (Feature Importances): While not explicitly printed here, a fitted Random Forest model has a
feature_importances_ attribute. This would show which features the ensemble of trees found most useful for making splits,
giving insight into what drives the predictions. Highly important features would have larger values.
Next, we train the Elastic Net model.
a. Model Initialization: We initialize ElasticNet with alpha=0.001 (the overall strength of regularization) and
l1_ratio=0.5 (an equal mix of L1 and L2 penalties). These are initial hyperparameter choices that would typically
be tuned.
b. Training & Evaluation: Similar to Random Forest, we fit on df_x_train and df_y_train, then predict and evaluate.

R²: The R² scores are very low for both training (0.009) and validation (-0.012). This indicates the linear Elastic Net model,
with these hyperparameters, explains very little variance in the price, performing poorly on both sets. The validation R²
being negative is again a bad sign.
MSE & MAE: The training MSE (~5.8M) is actually higher than the validation MSE (~3.5M), and the training MAE (~224) is
somewhat close to the validation MAE (~257). This unusual pattern (better validation MSE than training) might suggest
issues with the chosen alpha and l1_ratio, or that the model is heavily underfitting due to its linearity struggling with the
complexity of the data. The fact that validation MAE isn’t drastically worse than training MAE suggests less overfitting
compared to Random Forest, but the overall performance is poor.
Optimized Hyperparameters: The alpha=0.001 and l1_ratio=0.5 are not optimized here; they are pre-set. Proper tuning via
cross-validation (e.g., using ElasticNetCV or GridSearchCV) would be essential to find the best alpha and l1_ratio.
Fitted Weights (Coefficients): An Elastic Net model has a .coef_ attribute. This array would show the learned coefficient for
each feature. Due to the L1 component, some of these coefficients might be exactly zero, indicating those features were
effectively excluded by the model. Non-zero coefficients show the strength and direction of the linear relationship the model
found for each feature after regularization.
Finally, our Stacking Regressor.
a. Model Initialization: We define estimators as a list of tuples, containing our base models: RidgeCV (which tunes its
own alpha) and LinearSVR. The final_estimator is a DecisionTreeRegressor.
b. Training & Evaluation: The Stacking Regressor is fitted. Internally, it trains the base models, gets their out-of-fold
predictions (by default using 5-fold cross-validation for this), and then trains the final estimator on these predictions.

R²: The R² values are extremely poor and deeply negative for both training (-1.286) and validation (-2.324). This is the worst
performance so far, indicating this specific stacking configuration is failing badly.
MSE & MAE: The MSE and MAE values are very high, further confirming the poor performance. Similar to Elastic Net, the
training MSE is higher than validation MSE, which is highly unusual and points to significant problems, possibly with how the
base learners’ predictions are being combined or the suitability of the Decision Tree as a meta-learner here without further
tuning.
Optimized Hyperparameters:
For RidgeCV, the alpha is internally optimized.
For LinearSVR, parameters like C (regularization strength) and epsilon are at their defaults.
For the DecisionTreeRegressor (final estimator), parameters like max_depth are at defaults.
This entire Stacking setup would require extensive hyperparameter tuning for its components (base learners and final
estimator) using a nested cross-validation approach or a specialized grid search for stacking.
Fitted Weights: Accessing ‘weights’ in a Stacking Regressor is more complex. You can inspect the
final_estimator_.feature_importances_ if the final estimator supports it (like a Decision Tree or Random Forest), which
would tell you how important each base model’s predictions were to the final meta-model. You could also inspect the coef_
of the individual fitted base linear models (e.g., stacking_reg.named_estimators_[‘ridge’].coef_).
Based on these initial results, all three models are currently performing poorly on the validation set. The Random Forest
shows significant overfitting. The Elastic Net and Stacking Regressor are performing very badly overall, with negative R-
squared values on validation.
This clearly indicates that the default or arbitrarily chosen hyperparameters are not suitable. The next crucial step, which is
usually done via cross-validation techniques like GridSearchCV or RandomizedSearchCV, would be to systematically search
for better hyperparameter combinations for each of these models. This process involves:
a. Defining a grid of hyperparameter values to try for each model.
b. For each combination, training the model on folds of the training data and evaluating on the remaining fold.
c. Averaging the performance metric (e.g., negative MSE) across folds.
d. Selecting the hyperparameter combination that yields the best average cross-validated performance.
e. Retraining the model with these optimal hyperparameters on the entire training set.
Only after such tuning can we get a more reliable estimate of each model’s true predictive capability on unseen data. The
‘fitted weights’ or feature importances from these tuned models would then be more meaningful.

Should: Action
No ratings yet
Should: Action
12 pages
DMA Flask
No ratings yet
DMA Flask
14 pages
Data Preprocessing Example Programs1
No ratings yet
Data Preprocessing Example Programs1
9 pages
Mtech Study Material
No ratings yet
Mtech Study Material
10 pages
Lab 08 - Data Preprocessing
No ratings yet
Lab 08 - Data Preprocessing
9 pages
1 - Data Preprocessing and Cleaning - 55
No ratings yet
1 - Data Preprocessing and Cleaning - 55
8 pages
Articles Xgboost Classification With Smote-Enn Algorithm
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
11 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
Data Analysis: Data Preparation
No ratings yet
Data Analysis: Data Preparation
9 pages
ML Assignment
No ratings yet
ML Assignment
34 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
Dsbda Lab - 1 - 1736243987425
No ratings yet
Dsbda Lab - 1 - 1736243987425
10 pages
Mercedes-Benz Greener Manufacturing Ai
0% (1)
Mercedes-Benz Greener Manufacturing Ai
16 pages
Kaggle Course Notes
No ratings yet
Kaggle Course Notes
87 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
22K61A0654 2 Sasi Auto
No ratings yet
22K61A0654 2 Sasi Auto
24 pages
Machine Learning Lab Assignment 2
No ratings yet
Machine Learning Lab Assignment 2
23 pages
Practical 3 - Categorical Feature Engineering
No ratings yet
Practical 3 - Categorical Feature Engineering
6 pages
Data Cleaning
No ratings yet
Data Cleaning
7 pages
Train
No ratings yet
Train
17 pages
Multi Classification - Py (For 1 Class TP, TN, FP, FN)
No ratings yet
Multi Classification - Py (For 1 Class TP, TN, FP, FN)
25 pages
Abhiml ML File
No ratings yet
Abhiml ML File
74 pages
ML Journal
No ratings yet
ML Journal
53 pages
House Price Prediction: Project Description
No ratings yet
House Price Prediction: Project Description
11 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
ML-Lab05-Data Preprocessing Techniques in Python
No ratings yet
ML-Lab05-Data Preprocessing Techniques in Python
7 pages
Predicting Credit Card Approvals
100% (1)
Predicting Credit Card Approvals
14 pages
Data Clearning
No ratings yet
Data Clearning
7 pages
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
No ratings yet
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
10 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
20 pages
AI&ML
No ratings yet
AI&ML
9 pages
Machine Learning Lab: Algorithms & Implementation
No ratings yet
Machine Learning Lab: Algorithms & Implementation
11 pages
Ap Python
No ratings yet
Ap Python
12 pages
FAQ's - Supervised Learning
No ratings yet
FAQ's - Supervised Learning
4 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
ML Manual
No ratings yet
ML Manual
29 pages
Data Mining with Python Lab Guide
No ratings yet
Data Mining with Python Lab Guide
39 pages
Lecture02. ML Pipeline (Chapter 2)
No ratings yet
Lecture02. ML Pipeline (Chapter 2)
50 pages
Data Preprocessing For Machine Learning in Python
No ratings yet
Data Preprocessing For Machine Learning in Python
27 pages
Data Cleaning Approaches in Machine Learning Algorithms
No ratings yet
Data Cleaning Approaches in Machine Learning Algorithms
8 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
18 pages
Data Cleaning - Cheatsheet
100% (2)
Data Cleaning - Cheatsheet
8 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
26 pages
Subject - Machine Learning Group - E27-24 Name
No ratings yet
Subject - Machine Learning Group - E27-24 Name
18 pages
ML Lab
No ratings yet
ML Lab
29 pages
Data Preprocessing 2
No ratings yet
Data Preprocessing 2
5 pages
Answer
No ratings yet
Answer
5 pages
Data Preprocessing and Cleaning For Machine Learning
No ratings yet
Data Preprocessing and Cleaning For Machine Learning
16 pages
Lab File
No ratings yet
Lab File
96 pages
Approachin190808095205 PDF
No ratings yet
Approachin190808095205 PDF
112 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
ML - Lab - Ex 2
No ratings yet
ML - Lab - Ex 2
4 pages
Project 2
No ratings yet
Project 2
5 pages
Data Wrangling and Preprocessing
100% (1)
Data Wrangling and Preprocessing
41 pages
A3 Classification and Feature Engineering
No ratings yet
A3 Classification and Feature Engineering
2 pages
FOR23.9 cjcm7sj
No ratings yet
FOR23.9 cjcm7sj
5 pages
Python in Research
No ratings yet
Python in Research
18 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
Synopsis 6 Extra
No ratings yet
Synopsis 6 Extra
5 pages
Improving Efficiency and Effectiveness of Public Administration Vietnam (English)
No ratings yet
Improving Efficiency and Effectiveness of Public Administration Vietnam (English)
97 pages
An
No ratings yet
An
2 pages
Black Pink
No ratings yet
Black Pink
13 pages
Statistic Id1302329 Live Event Market Revenue Worldwide 2017 2024 by Category
No ratings yet
Statistic Id1302329 Live Event Market Revenue Worldwide 2017 2024 by Category
5 pages
NLinh
No ratings yet
NLinh
5 pages
Untitled Extract Pages
No ratings yet
Untitled Extract Pages
3 pages
Company: Woolworths (WOW)
No ratings yet
Company: Woolworths (WOW)
18 pages
1) Economic Problems 1.1) Concerns Over Privacy
No ratings yet
1) Economic Problems 1.1) Concerns Over Privacy
4 pages
MBA403 A4 Background Reading
No ratings yet
MBA403 A4 Background Reading
3 pages
Ciara
No ratings yet
Ciara
1 page
DS Assignemnt - AUT 2025
No ratings yet
DS Assignemnt - AUT 2025
8 pages
Hung
No ratings yet
Hung
17 pages
Performance Graph Export
No ratings yet
Performance Graph Export
83 pages
FIN3IPM Assignment Sem 1 2025 Data (Part 1)
No ratings yet
FIN3IPM Assignment Sem 1 2025 Data (Part 1)
24 pages
Ciara
No ratings yet
Ciara
6 pages
Customer Segmentation and Market Potential
No ratings yet
Customer Segmentation and Market Potential
4 pages
JOFT 4 1 Canning
No ratings yet
JOFT 4 1 Canning
14 pages
Sample Assignment Spreadsheet
No ratings yet
Sample Assignment Spreadsheet
1 page
Hoài
No ratings yet
Hoài
11 pages
Hùng
No ratings yet
Hùng
3 pages
FIN3IPM Part 1 Assignment Marking Guide and Assessment Rubric
No ratings yet
FIN3IPM Part 1 Assignment Marking Guide and Assessment Rubric
2 pages
Ciara
No ratings yet
Ciara
4 pages
Breaking Water Barriers: Tran Hoang Gia Han - s3865751 BUSM2580 Integrated Perspectives On Business Problems
No ratings yet
Breaking Water Barriers: Tran Hoang Gia Han - s3865751 BUSM2580 Integrated Perspectives On Business Problems
22 pages
Power Flow Management Through Interline Power Flow Controller
No ratings yet
Power Flow Management Through Interline Power Flow Controller
6 pages
A Tutorial On MM Algorithms
No ratings yet
A Tutorial On MM Algorithms
28 pages
Classified Paper 1 - 9709 - Important Questions Sheet
No ratings yet
Classified Paper 1 - 9709 - Important Questions Sheet
454 pages
Jensen Newsletter Oct 7
No ratings yet
Jensen Newsletter Oct 7
1 page
Energetics of Ferromagnetism Analysis
No ratings yet
Energetics of Ferromagnetism Analysis
30 pages
Advanced Calculus for Math Students
No ratings yet
Advanced Calculus for Math Students
27 pages
Process
No ratings yet
Process
4 pages
Pharmaceutical Statistics Practical and Clinical Applications, Fifth Edition - 5th Edition Open Access Download
100% (14)
Pharmaceutical Statistics Practical and Clinical Applications, Fifth Edition - 5th Edition Open Access Download
17 pages
PSYCH 100 Chapter 2
No ratings yet
PSYCH 100 Chapter 2
4 pages
Chapter 1
No ratings yet
Chapter 1
13 pages
Traffic Signal Coordination Guide
No ratings yet
Traffic Signal Coordination Guide
50 pages
MEC522 Chapter 2 - Student
No ratings yet
MEC522 Chapter 2 - Student
60 pages
Lesson 4.3 Measures of Dispersion
No ratings yet
Lesson 4.3 Measures of Dispersion
19 pages
053 - CE8021, CE6701 Structural Dynamics and Earthquake Engineering - 2 Marks 2
0% (1)
053 - CE8021, CE6701 Structural Dynamics and Earthquake Engineering - 2 Marks 2
116 pages
Mistakes by Chinese Students
No ratings yet
Mistakes by Chinese Students
17 pages
Role of Infill Wall
No ratings yet
Role of Infill Wall
16 pages
Optimal Power Flow
No ratings yet
Optimal Power Flow
13 pages
Simplex Method: Example (All Constraints Are )
No ratings yet
Simplex Method: Example (All Constraints Are )
14 pages
Sec 2 Semestral Assessment (Question Paper)
No ratings yet
Sec 2 Semestral Assessment (Question Paper)
4 pages
Chapter 5 Forecasting: Quantitative Analysis For Management, 11e (Render)
No ratings yet
Chapter 5 Forecasting: Quantitative Analysis For Management, 11e (Render)
27 pages
Integrationwk1 102901
No ratings yet
Integrationwk1 102901
30 pages
An Introduction To Stochastic Thermodynamics From Basic To Advanced Naoto Shiraishi Instant Download
100% (1)
An Introduction To Stochastic Thermodynamics From Basic To Advanced Naoto Shiraishi Instant Download
83 pages
A Slice of Pi British English Student
No ratings yet
A Slice of Pi British English Student
7 pages
Geometry High Freqency CAT Questions Quant150
No ratings yet
Geometry High Freqency CAT Questions Quant150
4 pages
Laguna University BSIT Capstone Format
No ratings yet
Laguna University BSIT Capstone Format
5 pages
3rd Periodical Test - ALL SUBJECTS
82% (11)
3rd Periodical Test - ALL SUBJECTS
49 pages
Efficiency Assessment and Determinants of Performance
No ratings yet
Efficiency Assessment and Determinants of Performance
18 pages
Appendix Cox Regression
No ratings yet
Appendix Cox Regression
19 pages
Geog 105 - Lecture Notes
100% (1)
Geog 105 - Lecture Notes
51 pages
Test Procedure For Motor Protection
No ratings yet
Test Procedure For Motor Protection
87 pages

Huy

Uploaded by

Huy

Uploaded by

Task 2 Task 2, Question 1 focuses on cleaning numerical features currently stored as text/object types, often with characters like

Looking at the code:

Looking at the output table:

1. Random Forest Regressor:

You might also like