2007 IEEE/WIC/ACM International Conference on Web Intelligence
Personalized PageRank for Web Page Prediction Based on Access
                                                            Time-Length and Frequency
                                 Yong Zhen Guo, Kotagiri Ramamohanarao and Laurence A. F. Park
                                       Dept. of Computer Science and Software Engineering,
                                           The University of Melbourne 3010, Australia
                                            {yzguo, rao, lapark}@csse.unimelb.edu.au
                                     Abstract                                               their subsequent accesses (implying that the prefetching
                                                                                            method has predicted the users’ actions poorly), the lim-
       Web page prefetching techniques are used to address                                  ited network bandwidth and server resources will not be
   the access latency problem of the Internet. To perform                                   used efficiently, and hence worsen the access delay
   successful prefetching, we must be able to predict the next                              problem. Therefore, the success of the prefetching
   set of pages that will be accessed by users. The PageRank                                method relies mainly on the prediction accuracy.
   algorithm used by Google is able to compute the popu-                                       In this paper, we propose a novel PageRank-like al-
   larity of a set of Web pages based on their link structure.                              gorithm to conduct more accurate web page prediction.
   In this paper, a novel PageRank-like algorithm is pro-                                   PageRank [1] is a Web page popularity measure used in
   posed for conducting Web page prediction. Two biasing                                    the Google search engine. Amongst a set of Web pages,
   factors are adopted to personalize PageRank, so that it                                  popularity is equivalent to the likelihood of future access
   favors the pages that are more important to users. One                                   of the page, therefore the PageRank value of a Web page
   factor is the length of time spent on visiting a page and the                            can be used to measure the likelihood of the page being
   other is the frequency that a page was visited. The ex-                                  downloaded. In this paper, we use a novel approach to
   periments conducted show that using these two factors                                    conduct web page prediction by employing a personal-
   simultaneously to bias PageRank results in more accurate                                 ized PageRank algorithm. We use both the frequency
   Web page prediction than other methods that use only one                                 that a web page is accessed and the time-length spent on
   of these two factors.                                                                    visiting it as the biasing factors to personalize PageRank
                                                                                            so that it favors the pages that are more important to us-
   1. Introduction                                                                          ers.
      Over the past decade the World Wide Web has                                           2. Brief review of PageRank
   evolved from being an information source to a global
   hub for business and socializing. However, according to                                     The whole Web can be considered as a directed graph
   a recent study [2], many users connect to the Internet us-                               with N nodes representing the N Web pages and the edges
   ing low bandwidth connections (e.g. dial-up or wireless                                  representing the links between them. Let Fv be the set of
   connections). Usually many of these users need to spend                                  pages page v links to and PR(v) the ranking value of page
   a large amount of time waiting for the Web page con-                                     v, then a link v → u will contribute PR(v)/|Fv| units of
   tents to be downloaded through their narrow bandwidth                                    page v’s ranking value to page u, where |Fv| is the num-
   connections. Although the access latency is alleviated for                               ber of the pages in Fv. The PageRank of a Web page is
   broadband users today, it can also be decreased signifi-                                 the sum of the ranking values of its backlinks, and the
   cantly.                                                                                  computation of it is a recursive process. In order to limit
      Using a Web page prefetching technique, we can effi-                                  the effect of rank sinks and guarantee the recursive
   ciently tackle this access latency problem. However, if                                  computation of PageRank is able to converge to a certain
   most prefetched web pages are not visited by the users in                                value, a damping factor (1- α ) is used ( α is a small
0-7695-3026-5/07 $25.00 © 2007 IEEE                                                   687
DOI 10.1109/WI.2007.58
  Authorized licensed use limited to: UNIVERSITY OF MELBOURNE. Downloaded on October 22, 2008 at 23:49 from IEEE Xplore. Restrictions apply.
 number which is usually set to 0.15 [1]). Furthermore,                                   consecutively. If we consider all user sessions as
                                                                                           st
 for pages which have no outlinks we add a link to all                                    1 -order Markov Chains (in this case, the next page to be
 other pages in the Web graph, in this way, ranks which                                   visited by a user only depends on the page the user is
 are lost due to pages with no outlinks can be redistrib-                                 visiting currently), then tu is the sum of the weights of
 uted uniformly to all other pages. Let Bu represent the set                              edges that point to node u. Let Bu be the set of pages that
 of pages pointing to page u, we have the recursive defi-                                 point to page u, we have the equation:
 nition of PageRank:
                  α              PR (v)              (1)
                                                                                                          t =    u  t      ∑    ( w,u )
                                                                                                                                               (2)
     PR i +1 (u) = + (1 − α ) × ∑ i                                                                                    w∈Bu
                     N                v∈ B u   | Fv |
                                                                                              From the definition of t(v,u) we can see that if more pre-
                                                                                          vious users follow the path v → u and stay on page u for
 3. The proposed approach: TFPR                                                           a longer time, the value of t(v,u) will be larger, thus t(v,u)
                                                                                          covers both information of access time-length and access
    In this section we will illustrate our novel PageR-                                   frequency of a page u.
 ank-like approach in detail.                                                                 In order to include access frequency and accessing
                                                                                          time-length of a page to conduct the computation of our
 3.1. Access time-length and frequency-based                                              personalized PageRank algorithm TFPR, we adopt t(v,u) as
 PageRank (TFPR)                                                                          the biasing factor. When distributing its ranking value to
                                                                                          its outlinks, page v will now propagate:
     In our approach, the ranking system resides on the                                                                          t ( v ,u )
 server side and all the information that is required for the
 computation of our personalized PageRank algorithm                                                                            ∑t
                                                                                                                               w∈Fv
                                                                                                                                       ( v , w)
 can be derived from the website’s Web logs. We assume
 that these Web logs are preprocessed and all user ses-                                   units of its importance to page u in a non-uniform way,
 sions are identified. The access frequency of a page u,                                  where Fv is the set of pages that page v points to.
 and the number of times page v was visited right after
                                                                                             We also guarantee the website’s directed graph G is
 page u can be obtained simply by counting the number of
                                                                                          strongly connected so that the calculation of TFPR can
 times page u appears and the number of times pages u
                                                                                          converge to a certain value by including the damping
 and v appear consecutively in all user sessions respec-
                                                                                          factor (1- α ). Then we eliminate all dangling pages from
 tively. Moreover, if page u and v were visited consecu-
                                                                                          G by adding a link to all other pages in G for pages with
 tively in a user session, and ru and rv are the times the
                                                                                          no outlinks.
 user requested them respectively, then rv-ru is the ap-
                                                                                             In accordance with the definition of PageRank, we
 proximate time-length the user spent on visiting page u
                                                                                          have the definition of our access time-length and fre-
 in this session. In the case that page u is the last page of a
                                                                                          quency-based PageRank (TFPR) as follows:
 user session, for which the access duration cannot be
 calculated, we can compute the average time-length                                                                   tu                                  TFPRi (v) × t( v, u )
                                                                                              TFPR i +1 (u) = α ×           + (1 − α ) ×          ∑
 spent on page u from all user sessions as its access                                                                ∑ tv
                                                                                                                     v∈G
                                                                                                                                                  v∈ Bu         ∑t
                                                                                                                                                               w∈Fv
                                                                                                                                                                      ( v, w)
                                                                                                                                                                                  (3)
 time-length. Finally, if a user opened a Web page but did
 not read it immediately (for example, the user went out
                                                                                              Where Fv is the set of pages that page v links to and Bu
 after he opened the Web page), this will result in a sig-
                                                                                          is the set of pages that link to page u.
 nificantly long but inaccurate access time-length for this
 Web page. Therefore, when the access time-length spent
                                                                                          3.2. Usage context of TFPR
 on a Web page by a user exceeds the average time-length
 spent on this page by a large percentage, we use this av-
                                                                                             By using our approach, we bias the PageRank to give
 erage access time-length as the user’s access time-length
                                                                                          a higher ranking to the pages that are visited more fre-
 on this Web page.
                                                                                          quently and for a longer time by previous users. If we
     Let us consider a website as a directed graph G,
                                                                                          use TFPR on the whole directed graph G of a website,
 where the nodes represent the web pages and the edges
                                                                                          we can obtain the global ranking of all pages in this
 represent the links. Both nodes and edges carry weights,
                                                                                          website.
 the weight tu on node u is the total time-length all previ-
                                                                                             However, the calculation of the global ranking of all
 ous users spent on browsing page u, while the weight t(v,u)
                                                                                          pages will consume huge computation power and proc-
 on edge v → u represents the sum of the time–lengths
                                                                                          essing time. In order to implement Web page prediction
 spent on visiting page u when page v and u were visited
                                                                                          for a current user, we should not apply TFPR to the
                                                                                    688
Authorized licensed use limited to: UNIVERSITY OF MELBOURNE. Downloaded on October 22, 2008 at 23:49 from IEEE Xplore. Restrictions apply.
 whole directed graph G but only to a subgraph of it. In                                  the total time-length spent on visiting a page as the bias-
      st
 the 1 -order Markov Chain scenario, the construction of                                  ing factor. It is based on the following equation:
 the subgraph for a current user can be described as fol-
                                                                                                                         tu                       TPRi (v) × tu
 lows: we first extract the navigational path a user has fol-                                 TPR i +1 (u) = α ×              + (1 − α ) ×   ∑
 lowed, and expand it by including all the pages the last                                                           ∑ tv
                                                                                                                    v∈G
                                                                                                                                             v∈Bu    ∑ tw
                                                                                                                                                     w∈ Fv
                                                                                                                                                                  (4)
 page links to directly in G, and the weighted links be-
 tween them. The length of the path taken into considera-                                    Where tu is the total time-length spent on visiting page
 tion when expanding the subgraph depends on the order                                    u by all previous users, Bu is the set of pages that have
 of the Markov Model we use. We then continuously                                         links to page u, and Fv is the set of pages that page v
 conduct the same process for the pages newly included                                    links to.
 in the subgraph until we reach a predefined depth. Then                                     The second contrasting algorithm is UPR (Us-
 we apply our TFPR algorithm to this subgraph to con-                                     age-based PageRank) [5], which uses only the access
 duct a “local” ranking computation for the pages in the                                  frequency of a Web page as the biasing factor. UPR can
 subgraph and provide the current user with a Web page                                    be calculated by the formula below:
 prediction list in descending order based on these Web                                                                                 UPRi (v) × w( v,u )
                                                                                                                   wu
 pages’ ranking values. For more detailed information on                                       UPRi +1 (u) = α ×       + (1 − α ) × ∑
 the construction of a subgraph and its relevant theories,                                                         ∑ v
                                                                                                                    w
                                                                                                                   v∈G
                                                                                                                                   v∈Bu   ∑ w(v, k )
                                                                                                                                                   k ∈Fv
                                                                                                                                                                  (5)
 readers can refer to [5].
                                                                                             Where wu is the number of times page u was visited
 4. Experiment                                                                            and w(v,u) is the number of times page u was visited right
                                                                                          after page v by previous users.
    In this section we demonstrate our experiments. Two
 contrasting algorithms are used to compare with our                                      4.3. Evaluation methods
 TFPR algorithm by using three different evaluation
 measures.                                                                                   We selected the 10 most popular paths the previous
                                                                                          users followed from the training data set, and each path
 4.1. Experimental data set                                                               was expanded to construct the corresponding subgraph
                                                                                          according to the sessions in the training data set. We
    We conducted our experiment using the Web logs of                                     chose the most popular paths because using these most
 the Department of Computer Science and Software En-                                      accessed paths allowed us to provide a better representa-
 gineering website [4], at the University of Melbourne.                                   tion of the typical navigational behaviors of Web users
 We obtained the Web logs of a two-week period in Sep-                                    than those paths that are with low access frequencies.
 tember 2006 and used the Web logs from 01/Sep./2006                                      Then we applied the three algorithms (TPR, UPR and
 to 07/Sep./2006 as the training data set, and the Web logs                               TFPR) to each subgraph to rank the pages it included,
 from 08/Sep./2006 to 14/Sep./2006 as the testing data                                    and for each algorithm, we obtained a set of the top-n
 set. We filtered the records and only reserved the hits                                  ranked pages as the most probable “next” pages for each
 requesting Web pages (such as *.htm, *.html, *.shtml                                     path. Following this, we derived a set of the top-n most
 and *.asp). When identifying user sessions, we set the                                   frequent “next” pages for each of the same path from the
 session timeout to 30 minutes, with a maximum of 50                                      testing data set. Finally, for every most popular path, we
 pageviews per session. After filtering out the hits by                                   respectively compared the result sets calculated by these
 Web crawlers, the training data set contained 23,132 re-                                 3 algorithms with the result set obtained from the testing
 cords and 9,404 sessions, while the testing data set con-                                data set.
 tained 21,933 records and 9,723 sessions.                                                   We used 3 measures, OSim , KSim and MRR, when
                                                                                          comparing two sets r1 and r2, each of size n. The first
 4.2. Contrasting algorithms                                                              method is OSim(r1,r2), which represents the degree of
                                                                                          overlap between the top-n pages of r1 and r2:
    We introduced two ranking methods to compare with                                                                     | r1 ∩ r2 |          (6)
                                                                                                                   OSim(r1 , r2 ) =
 our proposed TFPR algorithm, which uses both the                                                                                             n
 length of time spent on visiting a page and the frequency                                   The second method we used is KSim(r1,r2) [3], which
 that a page was visited as the biasing factors to personal-                              focuses on indicating the degree to which the relative
 ize PageRank. The first is TPR (Time-length-based                                        ordering of the top-n pages of two sets, r1 and r2, are in
 PageRank), a PageRank-style algorithm which uses only                                    agreement. The values of KSim(r1,r2) that are closer to 1
                                                                                    689
Authorized licensed use limited to: UNIVERSITY OF MELBOURNE. Downloaded on October 22, 2008 at 23:49 from IEEE Xplore. Restrictions apply.
 indicate closer agreement. Let U be the union of the                                     tively. Therefore, we can confirm that adopting both the
 pages in r1 and r2. Let r1’ extend r1 by adding U-r1 at its                              frequency a Web page is accessed and the time-length
 end, and let r2’ be defined analogously. We can define                                   spent on visiting it as the biasing factors simultaneously
 KSim(r1,r2) as follows:                                                                  can improve the accuracy of Web page prediction.
              | (u, v) : r1 ' , r2 ' agree on order of (u, v), u ≠ v | (7)                                       Accuracies for top-5 prediction
 KSim(r1 , r2 ) =
                                    | U | ×(| U | -1)                                               0.7
                                                                                                    0.6
     Where the numerator denotes the number of pairwise                                             0.5
 agreements of elements between r1’ and r2’.                                                        0.4
                                                                                                                                                   TPR
                                                                                                                                                   UPR
     The last measure we used to evaluate the performance                                           0.3
                                                                                                                                                   TFPR
                                                                                                    0.2
 of these 3 algorithms is the mean reciprocal rank (MRR).
                                                                                                    0.1
 Given an ordered list of predicted pages (these pages are                                            0
 in a descending order according to their ranking values)                                                     OSim            KSim           MRR
 and the “actual page” a user accesses in his subsequent
 visit, the reciprocal rank is calculated as 1/p where p is                                        Figure 2. Average OSim, KSim and MRR
 the position of the “actual page” in this ordered list, and                                                values for top-5 prediction
 it is set to zero if the “actual page” is not in this list. The
 MRR value of an algorithm that is closer to 1 denotes a                                  5. Conclusion and future works
 better performance.
                                                                                               In this paper, we used PageRank algorithm in the
 4.4. Experimental results                                                                context of Web page prediction. Two factors (the
                                                                                          time-length spent on visiting a Web page and the fre-
    We used these 3 algorithms (TPR, UPR and TFPR) to                                     quency the Web page is accessed) were used to bias
 obtain the sets of the top-3 and top-5 most highly-ranked                                PageRank so that it favors the pages that were visited for
 “next” pages for the 10 most popular paths we chose                                      a longer time and more frequently than others. The posi-
 from the training data set. Then for each of the 10 most                                 tive experimental results show that our approach can
 popular path we compared these sets of the top-3 and                                     produce more accurate page prediction than the methods
 top-5 pages with the sets of the top-3 and top-5 most                                    that use only one of these two biasing factors.
 frequent “next” pages derived from the testing data set                                       However, in our experimental setup, we only used the
                                                                                            st
 respectively by using the aforementioned evaluation                                      1 -order Markov Chain model, which is “memoryless”,
 measures. Figure 1 and 2 show the average OSim, KSim                                     to calculate the weights of edges in the directed graph of
 and MRR values for the top-3 and top-5 predictions of                                    a website and to expand the subgraph for a user’s current
 these 3 different algorithms.                                                            navigational path. In our future work, we will take into
                        Accuracies for top-3 prediction                                   account using higher order Markov Chain models to im-
          0.6
                                                                                          prove the prediction accuracy and also applying the pro-
          0.5                                                                             posed approach to other data sets to evaluate its reliabil-
          0.4
                                                                   TPR
                                                                                          ity and performance.
          0.3                                                      UPR
                                                                   TFPR
          0.2                                                                             6. References
          0.1
             0                                                                            [1] Larry Page. PageRank: Bringing Order to the Web. Tech-
                     OSim            KSim               MRR
                                                                                          nical report, Stanford digital libraries, 1997.
          Figure 1. Average OSim, KSim and MRR                                            [2] The China Internet Network Information Center.
                                                                                          http://www.cnnic.com.cn/, April 2007.
              values for top-3 prediction
                                                                                          [3] T. Haveliwala. “Topic-Sensitive PageRank”. In Proceed-
    As depicted in Figure 1 and 2, TFPR outperforms                                       ings of WWW2002 Conference, Hawaii USA, May 2002.
 TPR and UPR significantly in all OSim, KSim and MRR                                      [4] The Department of Computer Science and Software Engi-
 values. In the top-3 prediction, TFPR outperforms TPR                                    neering, the University of Melbourne,               Australia.
 by 23.1%, 23.4% and 52.6% for OSim, KSim and MRR                                         http://www.cs.mu.oz.au, April 2007.
 respectively. As to UPR, TFPR outperforms it by 23.1%,                                   [5] M. Eirinaki and M. Vazirgiannis, “Usage-based PageRank
 24.4% and 93.3% for different evaluation methods. In                                                                                      th
                                                                                          for Web Personalization”. In Proceedings of the 5 IEEE In-
 the top-5 prediction, the corresponding percentages are                                  ternational Conference on Data Mining (ICDM’05), Lousiana,
 20.0%, 5.0%, 40.6% and 14.3%, 10.8%, 84.8% respec-                                       November 2005.
                                                                                    690
Authorized licensed use limited to: UNIVERSITY OF MELBOURNE. Downloaded on October 22, 2008 at 23:49 from IEEE Xplore. Restrictions apply.