Here's a concise project summary based on your slides and contributions, including the dataset used: 📘 Project Summary: Zipf’s Law – Finding Hidden Patterns in Data
Team Name: Roomies❤️🌻
Team Members: Tanima Samanta, Koyna Arya, Aparajita K Singh, Riddhi Khera
In this project, we explored Zipf’s Law, a statistical phenomenon commonly found in natural language datasets. The law states that in any large collection of text, the frequency of a word is inversely proportional to its rank in the frequency table. Our goal is to test this principle using real-world data and visualize the resulting patterns.
🔍 Dataset Used: We analyzed the lyrics of songs by the band COLDPLAY. The dataset was compiled to contain a representative sample of Coldplay’s discography, offering a rich and diverse text corpus for word frequency analysis.
🛠️ What We Did:
- Preprocessed the text data by removing stopwords and punctuation and performing tokenization.
- Calculated word frequencies and ranked words by their occurrence.
- Visualized Zipfian patterns using rank-frequency and log-log plots.
- Verified the Zipfian distribution
- Collaboratively coded in Python using Google Colab and visualized results using Matplotlib.
- Documented and explained the study's theoretical foundation and practical findings.
🎯 Each team member contributed equally, focusing on research, coding, visualization, documentation, and presentation.
The final results confirmed that Coldplay's lyrics follow Zipf’s Law, demonstrating that even in artistic or musical text, natural language follows statistically predictable patterns.