skip to main content
10.1145/1242572.1242604acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
Article

Internet-scale collection of human-reviewed data

Published: 08 May 2007 Publication History

Abstract

Enterprise and web data processing and content aggregation systems often require extensive use of human-reviewed data (e.g. for training and monitoring machine learning-based applications). Today these needs are often met by in-house efforts or out-sourced offshore contracting. Emerging applications attempt to provide automated collection of human-reviewed data at Internet-scale. We conduct extensive experiments to study the effectiveness of one such application. We also study the feasibility of using Yahoo! Answers, a general question-answering forum, for human-reviewed data collection.

References

[1]
Amazon Mechanical Turk. http://www.mturk.com/.
[2]
J. Angwin. On the offensive -- a problem for hot web outfits: Keeping pages free from porn. Wall Street Journal, May 2006.
[3]
S. Argamon-Engelson and I. Dagan. Committee-based sample selection for probabilistic classiers. Journal of Artificial Intelligence Research, 1999.
[4]
O. Benjelloun, H. Garcia-Molina, H. Kawai, T. Larson, D. Menestrina, Q. Su, S. Thavisomboon, and J. Widom. Generic Entity Resolution in the SERF Project. IEEE Data Engineering Bulletin, June 2006.
[5]
J. Breese, D. Heckerman, and C. Kadie. Empirical analysis of predictive algorithms for collaborative filtering. In Proc. of Uncertainty in Artificial Intelligence, 1998.
[6]
D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 1994.
[7]
H. Galhardas, D. Florescu, E. Simon, C. Saita, and D. Shasha. Declarative data cleaning: Language, model, and algorithms. In Proc. of VLDB, 2001.
[8]
C. Gentry, Z. Ramzan, and S. Stubblebine. Secure distributed human computation.In Proc. of ACM Conference on Electronic Commerce, 2005.
[9]
Google Image Labeler. http://images.google.com/imagelabeler/.
[10]
J. Hipp, U. Guntzer, and U. Grimmer. Data quality mining -- making a virtue of necessity. In Proc. of SIGMOD DMKD Workshop, 2001.
[11]
J. Howe. The rise of crowdsourcing. Wired, June 2006.
[12]
A. Koblin. The sheep market: Two cents worth. Master's thesis, UCLA, 2006.
[13]
A. McCallum and K. Nigam. Employing em in pool-based active learning for text classication. In Proc. of ICML, 1998.
[14]
E. Rahm and H. Do. Data Cleaning: Problems and Current Approaches. IEEE Data Engineering Bulletin, December 2000.
[15]
P. Resnick and H. Varian. Recommender systems. Communications of the ACM, March 1997.
[16]
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proc. of ACM KDD, 2002.
[17]
J. Surowiecki. The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations. Doubleday, 2004.
[18]
Tenacious Search. http://openphi.net/tenacious/.
[19]
L. von Ahn. Games with a Purpose. IEEE Computer Magazine, June 2006.
[20]
L. von Ahn and L. Dabbish. Labeling Images with a Computer Game. In Proc. of ACM CHI, 2004.
[21]
L. von Ahn, S. Ginosar, M. Kedia, R. Liu, and M. Blum. Peekaboom: A Game for Locating Objects in Images. In Proc. of ACM CHI, 2006.
[22]
L. von Ahn, M. Kedia, and M. Blum. Verbosity: A Game for Collecting Common-Sense Facts. ACM CHI Notes, 2006.
[23]
L. von Ahn et al. The ESP Game. http://www.espgame.org/.
[24]
Yahoo! Answers.http://answers.yahoo.com/.
[25]
Yahoo! Suggestion Board. http://suggestions.yahoo.com/.

Cited By

View all
  • (2023)CredibleExpertRank: Leveraging Social Network Analysis and Opinion Mining to Facilitate Reliable Information Retrieval on Knowledge-Sharing SitesIEEE Access10.1109/ACCESS.2023.328141211(54724-54749)Online publication date: 2023
  • (2020)Investigating the Adoption of ERP SystemsJournal of Information Technology Research10.4018/JITR.202001010713:1(96-117)Online publication date: 1-Jan-2020
  • (2020)Modeling and Aggregation of Complex Annotations via Annotation DistancesProceedings of The Web Conference 202010.1145/3366423.3380250(1807-1818)Online publication date: 20-Apr-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '07: Proceedings of the 16th international conference on World Wide Web
May 2007
1382 pages
ISBN:9781595936547
DOI:10.1145/1242572
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 May 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data collection
  2. human data
  3. manual review

Qualifiers

  • Article

Conference

WWW'07
Sponsor:
WWW'07: 16th International World Wide Web Conference
May 8 - 12, 2007
Alberta, Banff, Canada

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)CredibleExpertRank: Leveraging Social Network Analysis and Opinion Mining to Facilitate Reliable Information Retrieval on Knowledge-Sharing SitesIEEE Access10.1109/ACCESS.2023.328141211(54724-54749)Online publication date: 2023
  • (2020)Investigating the Adoption of ERP SystemsJournal of Information Technology Research10.4018/JITR.202001010713:1(96-117)Online publication date: 1-Jan-2020
  • (2020)Modeling and Aggregation of Complex Annotations via Annotation DistancesProceedings of The Web Conference 202010.1145/3366423.3380250(1807-1818)Online publication date: 20-Apr-2020
  • (2019)DLTA: A Framework for Dynamic Crowdsourcing Classification TasksIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.284938531:5(867-879)Online publication date: 1-May-2019
  • (2018)Web Forum Retrieval and Text AnalyticsFoundations and Trends in Information Retrieval10.1561/150000006212:1(1-163)Online publication date: 3-Jan-2018
  • (2018)Analyzing Payment-Driven Targeted Q8A SystemsACM Transactions on Social Computing10.1145/32814491:3(1-21)Online publication date: 10-Dec-2018
  • (2018)Selective Cluster Presentation on the Search Results PageACM Transactions on Information Systems10.1145/315867236:3(1-42)Online publication date: 28-Feb-2018
  • (2018)Explicit Diversification of Event Aspects for Temporal SummarizationACM Transactions on Information Systems10.1145/315867136:3(1-31)Online publication date: 2-Feb-2018
  • (2018)Pay-per-QuestionProceedings of the 2018 ACM International Conference on Supporting Group Work10.1145/3148330.3148332(1-11)Online publication date: 7-Jan-2018
  • (2018)Analysis of Knowledge Sharing Activities on a Social Network Incorporated Discussion Forum: A Case Study of DISboardsIEEE Transactions on Big Data10.1109/TBDATA.2017.27493074:4(432-446)Online publication date: 1-Dec-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media