Discover a unique dataset that bridges the gap between human creativity and artificial intelligence. This collection features texts across various genres—essays, stories, poetry, and Python code—crafted by both humans and advanced large language models (LLMs) like GPT-4 and BARD. It serves as a valuable resource for those studying the detection of AI-generated content.
- Diverse Text Collection: Explore human-written and AI-generated content across multiple genres, providing a rich resource for analyzing the subtle differences and detecting AI involvement.
Dive deeper into the research by accessing our papers:
- Hayawi, K., Shahriar, S., & Mathew, S. S. (2024). The imitation game: Detecting human and AI-generated texts in the era of ChatGPT and BARD. Journal of Information Science, 0(0). https://doi.org/10.1177/01655515241227531
The potential of AI-based large language models (LLMs) is vast, offering significant opportunities to transform education, research, and various practices. However, as AI-generated text becomes more prevalent, distinguishing it from human-authored content has become a critical challenge.
This article presents a comparative study that introduces a novel dataset featuring human-written and LLM-generated texts across genres, including essays, stories, poetry, and Python code. We employ multiple machine learning models to classify these texts, showcasing their effectiveness in differentiating between human and AI-generated content. Despite the dataset’s limited sample size, the models performed well, although classifying GPT-generated text, particularly in the context of storytelling, proved more difficult.
Our findings reveal that binary classification—distinguishing between human-generated text and a specific LLM—is more straightforward compared to the more complex task of multiclass classification involving human text and multiple LLMs. This research offers valuable insights for AI text detection and lays the groundwork for future studies in this rapidly evolving field.