DATATHON
BAI
Group No. 10
Kundan Thakan – 1811060
Konark Patel – 1811071
Anvi Gupta – 1811096
Sourabh Choraria – 1811317
Jimmitesh Singla - 1811362
Improving Lead Generation at Eureka Forbes Using Machine Learning Algorithms
Executive summary
Eureka Forbes, founded in 1982, is a current market leader and has product lines such water
purification, vacuum cleaning, air purification and security solutions. However, it is facing
tough competition from new and local players in the industry. The company is currently
pursuing every lead physically and therefore incurring high cost. It is now looking for a cost-
effective distribution system.
Its website is a comprehensive one-stop solution to obtain information about the different
products and services offered by them. Though millions of potential customers visit the
website regularly, they don’t translate into prospects. Shashank Sinha, the Chief
Transformation Officer, is of the opinion that the team should use the data collected by his
team on the visitors, use it to predict the probability of conversion and send a representative to
the door of only those who have high probability of making a purchase. The similar analysis
can be extended for further applications such as one-to-one marketing and short-listing leads
according to sales budget.
Q.1 Perform descriptive analytics on the training data, write the insights based on the
descriptive analytics (3 points)
Average of different variables were computed for both converted and non-converted consumers to
understand which variables could influence the chances of a customer’s conversion. All the below
mentioned numbers are averages.
   1. Time spent on air purifier & water purifier page
   Analyzing the dataset, we realize that the consumers who got converted spent significantly more
   time (26.75) on air purifier page as compared to non-converted consumers (2.886). Also,
   consumers who got converted spent higher time (183) on water purifier page than non-
   converted consumers (83). The above analysis concludes that the converted customers spent
   more time on air and water purifier product page as compared to non-converted customers.
   Also, converted consumers are spending more time on average on the water purifier page than
   air purifier. Given the fact that Eureka Forbes is more widely known for water purifier, this insight
   seems logical.
   2. Time spent on checkout page
   Consumers who purchased the product spent less time on checkout page as compared to those
   who did not get converted. Converted consumers spent 9 units as compared to non-converted
   users spent 3.7. (Reason is explained in Q2-c). This is also logical as the consumers who had a
   higher intent to buy the product did the same immediately whereas the consumers who were
   unsure of buying the product spent more time on the page and eventually dropped the idea of
   purchasing the project.
   3. Time spent on customer service amc login
   Consumers who purchased the product spent less time (1.9) on customer service amc login than
   those who did not get converted (11.2). This may suggest that those consumers who have
   already bought product know their login details and directly log on the page. Those who have not
   purchased any product might just want to check the features of annual maintenance contract
   (amc). Also, another reason could be that the consumers who bought the product might have
   evaluated the product features in advance and hence, did not spend time on the login page
   whereas the non-converted customers were evaluating the features of the products.
   4. Day since last session
   Consumers who purchased the product visited webpage more often (6.6 days since last session)
   than those who did not get converted (11 days since last session). Intuitively, this suggests that
   those consumers who got converted make their decision fast and often visits webpage to either
   order or to get more information about product. The converted consumers might have also been
   evaluating other products and hence, felt the need to visit the page more frequently.
   5. Time spent on offer page
   Consumers who purchased the product spent more time (13.8) on offer page than those who did
   not get converted (5.6). Those consumers who are interested in buying the product would want
   to know offers available for purchasing that product and would spend more time on offer page to
get complete value of their offer. Those who are not interested would just check out the page
and not spend more time as they don’t intend to buy the product.
6. Users 30 days pageviews history
Consumers who purchased the product has high 30 days pageviews history (42.5) than those
who did not get converted (27.5). This shows that consumers who got converted visited more
pages than those who did not get converted as they would be checking out multiple products to
compare before purchasing. Also, for example, those who did not get converted might have just
visited the webpage through google search while searching about generic product and not
Eureka Forbes specific product.
7. Time spend on security solutions page
Consumers who purchased the product has spent less time on security solutions (avg 0) than
those who did not get converted (0.99). Eureka Forbes is famous for air and water purifier and
most consumers who got converted might want to buy only these products and do not visit
security solutions product page. While those who did not get converted might want to check out
all the products available with the company (Window shopping).
8. Total duration (in seconds) of users' sessions
Consumers who purchased the product has more average total duration spent on sessions (511)
than those who did not get converted (372.4). This is intuitive that those who got converted
would spend more time on different pages like product, checkout, amc login etc.
9. Time spent on store locator page
Consumers who purchased the product spent less time on store locator page (0) than those who
did not get converted (3). This shows that consumers who got converted would not spend time
on store locator as they directly purchased the product online and do not need to visit store
locator page. Also, the consumers who were more indecisive wanted to visit the store to get
more details and compare the products and hence, remained non-converted on the platform.
10. Time spent on success book demo page
Consumers who purchased the product spent considerably more time on success-book-demo
page (9.4) than those who did not get converted (0.6). This shows that consumers who
purchased product would want to have demo of those products and spend more time on
booking demo.
Q.2 How will Kashif test the following claims? (3 points)
   a. Customers using mobile, desktop, and tablet are equally distributed
Using ANOVA
Observed value:
     Row Labels                desktop           mobile        tablet   Grand Total
          0                     1801              9152           82       11035
          1                       4                61                       65
    Grand Total                 1805              9213           82       11100
Corresponding distribution:
 Row Labels         desktop            mobile tablet
 0                                100%   99% 100%
 1                                  0%    1%     0%
 Grand Total                      100%  100% 100%
Expected value:
                                                                Grand
   Row Labels           desktop        mobile         tablet    Total
                                       9159.0        81.5198
       0                1794.43           5              2      11035
       1               10.56982        53.95         0.48018      65
   Grand Total           1805           9213            82      11100
Corresponding distribution
 Row Labels         desktop            mobile tablet
 0                                 99%   99%    99%
 1                                  1%    1%     1%
 Grand Total                      100%  100% 100%
Statistics for Chi square distribution = 0.008 for
H0: Devices are equally distributed
H1: Devices are not equally distributed
Now since statistics is less than null value hence, we cannot reject null hypothesis, hence, devices are
equally distributed
   b. Repeat visitors are as likely to convert as new customers
From all the data from the data – downloaded from the link in the case
Observed value
Corresponding distribution
 Row Labels         New               Repeated
 0                              99.4%    99.7%
 1                               0.6%      0.3%
 Grand Total                    100%      100%
Expected value
 Row Labels          New                 Repeated
 0                              99.6%       99.6%
 1                               0.4%         0.4%
 Grand Total                    100%         100%
Statistics for Chi square distribution = 0.000 for
H0: Repeat visitors are as likely to convert as new customers
H1: Repeat visitors are not as likely to convert as new customers
Now since statistics is less than null value, hence, we cannot reject null hypothesis hence devices are
equally distributed
   c. Customers who convert spend more time on the website.
From the logistic regression explained in Q3 we find session_duration to have positive coefficient
whereas session_duration_hist to be negative which means that more decisive the customer is, the
more time they will spend during that particular session while they are making the purchase to make
their decision. However, if the previous (hist) time spend is large, this implies that the customer is just
fishing/exploring and does not have a serious intent to buy the product.
Below is the output of the logistic regression.
Q.3 Build machine learning models that can be used for improving the lead generation
at Eureka Forbes. State model accuracies and insights gained from each model. (10
points)
       Imputation of missing values:
For imputing blank values in the dataset, we have used K nearest neighbor algorithm (KNN)
with k=5. We have imputed missing values with nearest median value because a few very
large values are skewing the values of the mean whereas the first quartile, median and third
quartile of most variables are zero. (Refer Appendix)
       Data balancing
The data is highly imbalanced or skewed because 99% of the website visitors didn’t convert
and only 1% of the website visitors converted. Therefore, we have used both under sampling
and over sampling to balance the data. Only over sampling would lead to high repetition
because we would have to replicate the same 65 records of converted visitors multiple times.
Only under sampling would significantly reduce the size of training data. Assuming an ideal
ratio of 3, we will have a total of 260 records only. Finally, we took an equal percentage of
converts and non-converts in the training data.
      Feature engineering
We have enlisted the engineered features below. They all are significant in the final model
and have significantly reduced the null deviance.
   1. Session duration per session (sd.s) – This feature was derived by dividing total
      duration of users’ session by total number of sessions. If the potential customer is
      spending less time per session, he/she is more certain of the product he wants to
      purchase and therefore conversion probability is higher.
   2. Session duration history per session (sd.s_hist) – This feature was derived by dividing
      30 days session duration history by total number of sessions. If recently (i.e. last 30
      days) the customer has been spending more time per session, he/she might be getting
      more serious about the purchase.
   3. Number of pages viewed per session (p.s) – This feature was derived by dividing
      number of pageviews per user by total number of sessions. The base variable given in
      the data, Pageviews, was insignificant in the model. However, when divided by the
      number of sessions, it became significant. If the customer is exploring the website
      more, he might be collecting information more holistically and is therefore more likely
      to convert.
   4. Number of pages viewed in last 30 days per session (p.s_hist) – This feature was
      derived by dividing number of pageviews in last 30 days per user by total number of
      sessions. The base variable given in the data, Pageviews_hist, was insignificant in the
      model. However, when divided by the number of sessions, it became significant.
   5. Access by mobile device type or non-mobile device type (device.enon_mobile) – We
      developed CART tree model in order to determine importance of variables. Access by
      device type was used for branching. So, data was divided into only 2 types i.e. mobile
      and non-mobile (Laptop & Tablet).
   6. Access by referral source channel or not the source channel (sourceMedium.e) - We
      developed CART tree model in order to determine importance of variables. Access by
      referral source was used for branching. So, data was divided into only 2 types i.e.
      referral and non-referral (google_cpc, google_organic, direct-none, facebook_social)
      Model Creation:
  Logistic Regression Model:
AUC of the model of test data:
Confusion Matrix
Logistic regression model accuracy is 59%.
Decision Tree Model
Confusion Matrix of Model
Accuracy of Decision Tree model is 91.3%.
AUC of Model
Random Forest Model
Confusion Matrix of Model
Accuracy of Random Forrest Model is 99.6%
AUC of Model
Adaptive Boosting Model
Error Matrix
Accuracy of Adaptive Boosting Model is 75.3%
AUC of Model
Business Insights from various models
  1. Logistic regression Model:
   We can see all the service variables like visiting request page, amc page, contactus is
     positively correlated showing serious customers buying the product
   We can see people spending time looking at other products are not serious, -ive
     correlation
   The variable source.referral has the highest beta showing referral plays an important
     role in conversion
  2.   Decision Tree Model:
      The model helped us to formulate two of the variables
      We also see that demo sessions when requested more leads to higher conversion
      The tree also shows that all the engineered feature to be important as it comes in the
       top of the tree
  3. Random Forest Model:
      From RF model we can see session variables are to be most important
      Also Page/session is also important where as page views are not important
Usage of Model:
      We will be using ensembling method that is, getting result from multiple models that is using
       Logistic Regression, Random Forest and Adaptive boosting and take a vote according to
       accuracy. This is because all the model consider different important variables.
We will not be using Tree model because of less accuracy
Q.4 Based on the different model results, what would be your final recommendation to Kashif?
(4 points)
We built 4 different machine learning models in order to increase lead generation at Eureka Forbes.
We compared different parameter of different models and results are as below.
Company will have to shell out money in order to chase the potential lead customer. So, error of
misclassifying potential non-lead as lead is more important than overall error. Therefore, it is
important to look at error of misclassifying y=1 case.
                                                                            Error of misclassifying
         Model                     AUC                  Overall Error
                                                                              as lead consumer
   Logistic Regression             0.84                     41%                       0%
      Decision Tree                0.7                      8.7%                     70%
     Random Forest                 0.92                     0.4%                     40%
   Adaptive boosting               0.94                    24.7%                      0%
From above table, it can be seen that adaptive boosting model is best suited for unbalanced data. It
gives error of misclassifying as lead consumer as 0% and AUC of 0.94 which is highest of all the
model. Logistic regression also gives error of misclassifying as lead consumer 0% but it has low AUC
of 0.84.
Recommendations based on model:
   4. Sessions duration per session is most important variable for deciding lead generation. It is
      important to make session duration interactive for consumers. Company can add information
      about its product on webpage in order to engage and inform consumers about its products.
   5. As seen from the insights, consumers who spend more time on multiple product pages do
      not get converted to lead. So, company should not pass on the lead for the consumers who
      are browsing through multiple product webpages. Lead should be generated once consumer
      starts spending more time on one product page above the threshold limit.
   6. Lead conversion is more when the consumer has been referred to the website. Company
      should increase the engagement with referral leads at early stage.
   7. Since consumers from referral has high conversion, company should incentivize referral and
      should take steps to increase net promoter score (NPS).
8. Consumers who uses mobile to access website has more probability of getting converted to
   lead. So, company should improve its mobile website to make it more accessible. Eureka
   Forbes can also develop a mobile app to engage more consumers.
9. Consumers who got converted spend more time on offer page. Eureka Forbes can partner
   with financial institutions to get offers on various payment options. In era of e-commerce,
   consumers look for payment offers and platforms like amazon & flipkart give many offers.
Appendix:
 Variable         1st Quarter       Median           3rd quarter       Mean    Maximum
 Bounces_Hist     0                 0                1                 1.55    210
 Help_me_buy      0                 0                0                 0.244   19
 _evt_count
 Pageviews_hi     4                 9                26                27.6    580
 st
 Paid_Hist        0                 1                2                 1.74    27
 Hone_clicks_e    0                 0                0                 0.075   12
 vt_count_hist
 SessionDurati    89                477              1904              2011    88846
 on_hist
 Sessions_hist    1                 2                6                 6.316   233
 Visited_air_pu   0                 0                0                 0.128   12
 rifier_page_hi
 st
 Visited_check    0                 0                0                 0.494   15
 out_page_hist
 Visited_conta    0                 0                0                 0.07    15
 ctus_hist
 Visited_custo    0                 0                0                 0.181   21
 mer_service_
 amc_login_his
 t
 Visited_custo    0                 0                0                 0.053   20
 mer_service_r
 equest_login_
 hist
 Visited_demo     0                 0                1                 0.769   18
 _page_hist
 Visited_offer_   0                 0                0                 0.397   16
 page_hist
 Visited_securi   0                 0                0                 0.019   3
 ty_solutions_
 page_hist
 Visited_storel   0                 0                0                 0.032   8
 ocator_hist
 Visited_vacuu    0                 0                0                 0.717   20
 m_cleaner_pa
 ge_hist
 Visited_water    0                 0                1                 1.62    26
 _purifier_pag
 e_hist
Coefficients:
                                Estimate Std. Error z value Pr(>|z|)
(Intercept)                      -3.25118416 0.17768337 -18.298 < 2e-16 ***
DemoReqPg_CallClicks_evt_count             -17.85495372 864.28367732 -0.021 0.983518
air_purifier_page_top                0.00415941 0.00138159 3.011 0.002607 **
bounces                         -0.21492122 0.07259194 -2.961 0.003070 **
bounces_hist                     -0.09609688 0.02403169 -3.999 6.37e-05 ***
checkout_page_top                    -0.00288890 0.00187738 -1.539 0.123855
contactus_top                     0.00019618 0.00054971 0.357 0.721176
countryi                        0.68913787 0.17050408 4.042 5.30e-05 ***
customer_service_amc_login_top            -0.03430404 0.01002248 -3.423 0.000620 ***
customer_service_request_login_top         -0.00108166 0.00108689 -0.995 0.319645
demo_page_top                      -0.00185950 0.00018301 -10.160 < 2e-16 ***
device.enon_mobile                   -4.23925799 0.39718819 -10.673 < 2e-16 ***
dsls                          -0.00817475 0.00199611 -4.095 4.22e-05 ***
fired_DemoReqPg_CallClicks_evt             19.69396363 864.28370851 0.023 0.981821
fired_help_me_buy_evt                 -4.86378012 0.65251917 -7.454 9.07e-14 ***
fired_phone_clicks_evt                -0.71599563 0.34257451 -2.090 0.036614 *
goal4Completions                   -31.47158874 1403.69230354 -0.022 0.982112
help_me_buy_evt_count                   2.59499343 0.32768415 7.919 2.39e-15 ***
help_me_buy_evt_count_hist               -0.48905568 0.09438811 -5.181 2.20e-07 ***
offer_page_top                    -0.00168366 0.00129784 -1.297 0.194536
pageviews                        -0.01181204 0.02076766 -0.569 0.569512
pageviews_hist                     0.00784485 0.00516589 1.519 0.128866
paid                          1.50969242 0.14148625 10.670 < 2e-16 ***
paid_hist                      -0.52967004 0.04236657 -12.502 < 2e-16 ***
phone_clicks_evt_count                 0.67182962 0.24998605 2.687 0.007200 **
phone_clicks_evt_count_hist             0.61237235 0.10946508 5.594 2.22e-08 ***
security_solutions_page_top             0.00250247 1.13775733 0.002 0.998245
sessionDuration                    0.00125844 0.00021452 5.866 4.45e-09 ***
sessionDuration_hist                 -0.00021122 0.00004871 -4.337 1.45e-05 ***
sessions                        0.14267448 0.08047760 1.773 0.076254 .
sessions_hist                       0.07645416 0.02349908 3.253 0.001140 **
sd.s                         -0.00133716 0.00026940 -4.963 6.92e-07 ***
sd.s_his                          0.00077925 0.00014068 5.539 3.04e-08 ***
p.s                              0.11045646 0.02845922 3.881 0.000104 ***
p.s_hist                         -0.05508644 0.01899208 -2.900 0.003726 **
sourceMedium.ereferral                  10.32736287 0.60575702 17.049 < 2e-16 ***
storelocator_top                     0.00709582 0.54685063 0.013 0.989647
successbookdemo_top                      0.01392691 0.00344655 4.041 5.33e-05 ***
vacuum_cleaner_page_top                   -0.00017817 0.00047166 -0.378 0.705610
visited_air_purifier_page               -2.17435787 0.61699492 -3.524 0.000425 ***
visited_air_purifier_page_hist            0.22480431 0.13001590 1.729 0.083800 .
visited_checkout_page                   -1.31869560 0.41522733 -3.176 0.001494 **
visited_checkout_page_hist               -1.33594852 0.10683986 -12.504 < 2e-16 ***
visited_contactus                    1.38800448 0.23086836 6.012 1.83e-09 ***
visited_contactus_hist                 -0.19864674 0.15676092 -1.267 0.205085
visited_customer_service_amc_login            0.69374863 0.39940279 1.737 0.082393 .
visited_customer_service_amc_login_hist        -0.07999978 0.11542701 -0.693 0.488261
visited_customer_service_request_login        1.43899179 0.32433922 4.437 9.14e-06 ***
visited_customer_service_request_login_hist     0.60270382 0.16017007 3.763 0.000168 ***
visited_demo_page                      1.34634018 0.13747743 9.793 < 2e-16 ***
visited_demo_page_hist                   0.75000545 0.05087913 14.741 < 2e-16 ***
visited_offer_page                   -0.29541621 0.19158848 -1.542 0.123090
visited_offer_page_hist                -0.16724712 0.07775156 -2.151 0.031473 *
visited_security_solutions_page           -19.53636502 605.07425119 -0.032 0.974243
visited_security_solutions_page_hist       -15.08781082 341.57220452 -0.044 0.964768
visited_storelocator                 -19.97168993 428.58546344 -0.047 0.962833
visited_storelocator_hist              -20.32854172 335.10473085 -0.061 0.951627
visited_successbookdemo                  32.84048630 1403.69227439 0.023 0.981335
visited_vacuum_cleaner_page                -0.36860437 0.23711216 -1.555 0.120052
visited_vacuum_cleaner_page_hist              0.08376279 0.06511974 1.286 0.198342
visited_water_purifier_page               1.40577580 0.12570812 11.183 < 2e-16 ***
visited_water_purifier_page_hist            0.34336653 0.03838287 8.946 < 2e-16 ***
water_purifier_page_top                 -0.00098056 0.00021860 -4.486 7.27e-06 ***
Coefficients:
                              Estimate Std. Error z value Pr(>|z|)
(Intercept)                     -3.2045382 0.1581746 -20.259 < 2e-16 ***
air_purifier_page_top                 0.0045749 0.0012324 3.712 0.000205 ***
bounces                         -0.2584955 0.0636975 -4.058 4.95e-05 ***
bounces_hist                       -0.0960539 0.0202121 -4.752 2.01e-06 ***
countryi                        0.6594093 0.1597345 4.128 3.66e-05 ***
customer_service_amc_login_top              -0.0505902 0.0103466 -4.890 1.01e-06 ***
demo_page_top                        -0.0014143 0.0001589 -8.901 < 2e-16 ***
device.enon_mobile                    -4.8346324 0.4276306 -11.306 < 2e-16 ***
fired_help_me_buy_evt                  -4.5730778 0.5651232 -8.092 5.86e-16 ***
fired_phone_clicks_evt                -1.0924387 0.3510129 -3.112 0.001857 **
help_me_buy_evt_count                    2.4556866 0.2833272 8.667 < 2e-16 ***
help_me_buy_evt_count_hist                -0.4191308 0.0778048 -5.387 7.17e-08 ***
paid                           1.4388370 0.1296427 11.098 < 2e-16 ***
paid_hist                       -0.4972631 0.0384225 -12.942 < 2e-16 ***
phone_clicks_evt_count                  0.8569251 0.2605416 3.289 0.001005 **
phone_clicks_evt_count_hist               0.5046040 0.1006385 5.014 5.33e-07 ***
sessionDuration                     0.0007238 0.0001717 4.215 2.50e-05 ***
sessionDuration_hist                 -0.0002177 0.0000375 -5.805 6.42e-09 ***
sessions                        0.1580555 0.0583974 2.707 0.006799 **
sessions_hist                      0.0848723 0.0171010 4.963 6.94e-07 ***
sd.s                          -0.0010719 0.0002263 -4.736 2.18e-06 ***
sd.s_his                         0.0005375 0.0001211 4.437 9.11e-06 ***
p.s                           0.0811029 0.0127542 6.359 2.03e-10 ***
p.s_hist                         -0.0258690 0.0134435 -1.924 0.054320 .
sourceMedium.ereferral                 10.7000621 0.5666758 18.882 < 2e-16 ***
successbookdemo_top                     0.0254676 0.0025929 9.822 < 2e-16 ***
visited_air_purifier_page              -2.1551119 0.5702502 -3.779 0.000157 ***
visited_air_purifier_page_hist           0.2637309 0.1171688 2.251 0.024394 *
visited_checkout_page                  -1.4778892 0.2984918 -4.951 7.38e-07 ***
visited_checkout_page_hist              -1.2101764 0.0942133 -12.845 < 2e-16 ***
visited_contactus                    1.4404839 0.1768549 8.145 3.79e-16 ***
visited_customer_service_amc_login           1.0285116 0.3760540 2.735 0.006238 **
visited_customer_service_request_login       1.3927490 0.2586894 5.384 7.29e-08 ***
visited_customer_service_request_login_hist 0.7436959 0.1181880 6.292 3.12e-10 ***
visited_demo_page                      1.4809747 0.1242353 11.921 < 2e-16 ***
visited_demo_page_hist                  0.7169562 0.0451215 15.889 < 2e-16 ***
visited_offer_page_hist               -0.3081055 0.0727708 -4.234 2.30e-05 ***
visited_water_purifier_page              1.1663008 0.1138703 10.242 < 2e-16 ***
visited_water_purifier_page_hist          0.4022705 0.0356589 11.281 < 2e-16 ***
water_purifier_page_top                -0.0003673 0.0001818 -2.020 0.043338 *