Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam

Drost, Isabel; Scheffer, Tobias

doi:10.1007/11564096_14

Isabel Drost²³ &
Tobias Scheffer²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3720))

Included in the following conference series:

European Conference on Machine Learning

6152 Accesses
31 Citations
8 Altmetric

Abstract

The page rank of a commercial web site has an enormous economic impact because it directly influences the number of potential customers that find the site as a highly ranked search engine result. Link spamming – inflating the page rank of a target page by artificially creating many referring pages – has therefore become a common practice. In order to maintain the quality of their search results, search engine providers try to oppose efforts that decorrelate page rank and relevance and maintain blacklists of spamming pages while spammers, at the same time, try to camouflage their spam pages. We formulate the problem of identifying link spam and discuss a methodology for generating training data. Experiments reveal the effectiveness of classes of intrinsic and relational attributes and shed light on the robustness of classifiers against obfuscation of attributes by an adversarial spammer. We identify open research problems related to web spam.

Download to read the full chapter text

Chapter PDF

A review of spam email detection: analysis of spammer strategies and the dataset shift problem

Article Open access 11 May 2022

FlexTrustRank: A New Approach to Link Spam Combating

An Advanced Approach for Link-Based Spam Detection Using Machine Learning

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Adali, S., Liu, T., Magdon-Ismail, M.: Optimal link bombs are uncoordinated. In: Proc. of the Workshop on Adversarial IR on the Web (2005)
Google Scholar
Baeza-Yates, R., Castillo, C., López, V.: Pagerank increase under different collusion topologies. In: Proc. of the Workshop on Adversarial IR on the Web (2005)
Google Scholar
Bharat, K., Chang, B., Henzinger, M., Ruhl, M.: Who links to whom: Mining linkage between web sites. In: Proc. of the IEEE International Conference on Data Mining (2001)
Google Scholar
Bifet, A., Castillo, C., Chirita, P.-A., Weber, I.: An analysis of factors used in search engine ranking. In: Proc. of the Workshop on Adversarial IR on the Web (2005)
Google Scholar
Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., Wiener, J.: Graph structure in the web. In: Proc. of the International WWW Conference (2000)
Google Scholar
Cafarella, M., Cutting, D.: Building Nutch: Open source search. ACM Queue 2(2) (2004)
Google Scholar
Dalvi, N., Domingos, P., Mausam, Sanghai, S., Verma, D.: Adversarial classification. In: Proc. of the ACM International Conference on Knowledge Discovery and Data Mining (2004)
Google Scholar
Davison, B.: Recognizing nepotistic links on the web, 2000. In: Proceedings of the AAAI 2000 Workshop on Artificial Intelligence for Web Search (2000)
Google Scholar
Ebel, H., Mielsch, L.-I., Bornholdt, S.: Scale free topology of e-mail networks. Physical Review E (2002)
Google Scholar
Fetterly, D., Manasse, M., Najork, M.: On the evolution of clusters of near-duplicate web pages. In: Proc. of the Latin American Web Congress (2003)
Google Scholar
Fetterly, D., Manasse, M., Najork, M.: Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In: Proc. of the International Workshop on the Web and Databases (2004)
Google Scholar
Fetterly, D., Manasse, M., Najork, M., Wiener, J.: A large-scale study of the evolution of web pages. In: Proc. of the International WWW Conference (2003)
Google Scholar
Gy”ongyi, Z., Garcia, H.: Web spam taxonomy. In: Proc. of the Workshop on Adversarial IR on the Web (2005)
Google Scholar
Gyongyi, Z., Garcia-Molina, H., Pedersen, J.: Combating web spam with TrustRank. In: Proc. of the International Conf. on Very Large Data Bases (2004)
Google Scholar
Henzinger, M., Motwani, R., Silverstein, C.: Challenges in web search engines. In: Proc. of the International Joint Conference on Artificial Intelligence (2003)
Google Scholar
Joachims, T.: Making large-scale SVM learning practical. In: Advances in Kernel Methods – Support Vector Learning. MIT Press, Cambridge (1998)
Google Scholar
Lempel, R., Amitay, E., Carmel, D., Darlow, A., Soffer, A.: The connectivity sonar: Detecting site functionality by structural patterns. Journal of Digital Information 4(3) (2003)
Google Scholar
Page, L., Brin, S.: The anatomy of a large-scale hypertextual web search engine. In: Proc. of the Seventh International World-Wide Web Conference (1998)
Google Scholar
Wu, B., Davison, B.D.: Identifying link farm spam pages. In: Proc. of the 14th International WWW Conference (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Humboldt-Universität zu Berlin, Unter den Linden, 6, 10099, Berlin, Germany
Isabel Drost & Tobias Scheffer

Authors

Isabel Drost
View author publications
You can also search for this author in PubMed Google Scholar
Tobias Scheffer
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Economics of the University of Porto, Portugal
João Gama
Faculdade de Engenharia & LIAAD, Universidade do Porto, Portugal
Rui Camacho
LIAAD-INESC Porto L.A./Faculty of Economics, University of Porto, Rua de Ceuta, 118-6, 4050-190, Porto, Portugal
Pavel B. Brazdil
LIACC/FEP, Universidade do Porto, Portugal
Alípio Mário Jorge
LIAAD-INESC Porto LA / FEP, University of Porto, R. de Ceuta, 118, 6., 4050-190, Porto, Portugal
Luís Torgo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Drost, I., Scheffer, T. (2005). Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds) Machine Learning: ECML 2005. ECML 2005. Lecture Notes in Computer Science(), vol 3720. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11564096_14

Download citation

DOI: https://doi.org/10.1007/11564096_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29243-2
Online ISBN: 978-3-540-31692-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam

Abstract

Chapter PDF

Similar content being viewed by others

A review of spam email detection: analysis of spammer strategies and the dataset shift problem

FlexTrustRank: A New Approach to Link Spam Combating

An Advanced Approach for Link-Based Spam Detection Using Machine Learning

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam

Abstract

Chapter PDF

Similar content being viewed by others

A review of spam email detection: analysis of spammer strategies and the dataset shift problem

FlexTrustRank: A New Approach to Link Spam Combating

An Advanced Approach for Link-Based Spam Detection Using Machine Learning

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation