Abstract
The page rank of a commercial web site has an enormous economic impact because it directly influences the number of potential customers that find the site as a highly ranked search engine result. Link spamming – inflating the page rank of a target page by artificially creating many referring pages – has therefore become a common practice. In order to maintain the quality of their search results, search engine providers try to oppose efforts that decorrelate page rank and relevance and maintain blacklists of spamming pages while spammers, at the same time, try to camouflage their spam pages. We formulate the problem of identifying link spam and discuss a methodology for generating training data. Experiments reveal the effectiveness of classes of intrinsic and relational attributes and shed light on the robustness of classifiers against obfuscation of attributes by an adversarial spammer. We identify open research problems related to web spam.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Adali, S., Liu, T., Magdon-Ismail, M.: Optimal link bombs are uncoordinated. In: Proc. of the Workshop on Adversarial IR on the Web (2005)
Baeza-Yates, R., Castillo, C., López, V.: Pagerank increase under different collusion topologies. In: Proc. of the Workshop on Adversarial IR on the Web (2005)
Bharat, K., Chang, B., Henzinger, M., Ruhl, M.: Who links to whom: Mining linkage between web sites. In: Proc. of the IEEE International Conference on Data Mining (2001)
Bifet, A., Castillo, C., Chirita, P.-A., Weber, I.: An analysis of factors used in search engine ranking. In: Proc. of the Workshop on Adversarial IR on the Web (2005)
Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., Wiener, J.: Graph structure in the web. In: Proc. of the International WWW Conference (2000)
Cafarella, M., Cutting, D.: Building Nutch: Open source search. ACM Queue 2(2) (2004)
Dalvi, N., Domingos, P., Mausam, Sanghai, S., Verma, D.: Adversarial classification. In: Proc. of the ACM International Conference on Knowledge Discovery and Data Mining (2004)
Davison, B.: Recognizing nepotistic links on the web, 2000. In: Proceedings of the AAAI 2000 Workshop on Artificial Intelligence for Web Search (2000)
Ebel, H., Mielsch, L.-I., Bornholdt, S.: Scale free topology of e-mail networks. Physical Review E (2002)
Fetterly, D., Manasse, M., Najork, M.: On the evolution of clusters of near-duplicate web pages. In: Proc. of the Latin American Web Congress (2003)
Fetterly, D., Manasse, M., Najork, M.: Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In: Proc. of the International Workshop on the Web and Databases (2004)
Fetterly, D., Manasse, M., Najork, M., Wiener, J.: A large-scale study of the evolution of web pages. In: Proc. of the International WWW Conference (2003)
Gy”ongyi, Z., Garcia, H.: Web spam taxonomy. In: Proc. of the Workshop on Adversarial IR on the Web (2005)
Gyongyi, Z., Garcia-Molina, H., Pedersen, J.: Combating web spam with TrustRank. In: Proc. of the International Conf. on Very Large Data Bases (2004)
Henzinger, M., Motwani, R., Silverstein, C.: Challenges in web search engines. In: Proc. of the International Joint Conference on Artificial Intelligence (2003)
Joachims, T.: Making large-scale SVM learning practical. In: Advances in Kernel Methods – Support Vector Learning. MIT Press, Cambridge (1998)
Lempel, R., Amitay, E., Carmel, D., Darlow, A., Soffer, A.: The connectivity sonar: Detecting site functionality by structural patterns. Journal of Digital Information 4(3) (2003)
Page, L., Brin, S.: The anatomy of a large-scale hypertextual web search engine. In: Proc. of the Seventh International World-Wide Web Conference (1998)
Wu, B., Davison, B.D.: Identifying link farm spam pages. In: Proc. of the 14th International WWW Conference (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Drost, I., Scheffer, T. (2005). Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds) Machine Learning: ECML 2005. ECML 2005. Lecture Notes in Computer Science(), vol 3720. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11564096_14
Download citation
DOI: https://doi.org/10.1007/11564096_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29243-2
Online ISBN: 978-3-540-31692-3
eBook Packages: Computer ScienceComputer Science (R0)