Characteristic Words Detection is a Streamlit application designed for corpus linguistics research. It identifies characteristic words (or n-grams) within categories of your dataset by comparing the frequency of words in each category with their global frequency in the corpus. The app applies statistical testing (hypergeometric tests with Benjamini–Hochberg correction) to determine significant over- or under-representation of terms.
This app processes text data by:
- Preprocessing: Tokenizing text, removing stopwords (customizable by language), and applying stemming.
- N-gram Generation: Supports unigrams, bigrams, and trigrams.
- Word Grouping: Optionally groups words together under a common label.
- Statistical Analysis: Computes internal and global frequencies, applies hypergeometric tests, and corrects p-values using the Benjamini–Hochberg procedure.
- Visualization & Results: Displays summary statistics, characteristic words tables, and interactive bar charts (via Plotly). Results can be downloaded in CSV and/or Excel formats.
The application is ideal for linguists, social scientists, and researchers analyzing large text corpora to identify thematic or stylistic markers.
-
Customizable Text Preprocessing:
- Tokenization and stopword removal (supports English, French, Italian, and Spanish).
- Optional stemming using Snowball stemmers.
- N-gram selection (unigrams, bigrams, trigrams).
-
Word Grouping:
- Replace sets of words with a group name for consistent treatment.
-
Statistical Testing:
- Uses hypergeometric distribution to assess term significance.
- Applies Benjamini–Hochberg correction for multiple comparisons.
-
Visualization:
- Generates interactive bar charts showing overrepresented and underrepresented terms.
-
Downloadable Results:
- Export the results in CSV, Excel, or ZIP format.
-
Session Persistence:
- Utilizes Streamlit's session state to store analysis results and configuration.
-
Clone the Repository
git clone https://github.com/yourusername/characteristic-words-detection.git cd characteristic-words-detection -
(Optional) Create a Virtual Environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install Dependencies
The required packages are listed in the
requirements.txtfile. Install them using:pip install -r requirements.txt
-
Run the Streamlit App
Launch the application by running:
streamlit run main.py
-
Upload Your Data
- Upload a CSV, Excel, TSV, or TXT file containing your corpus.
- Use the sidebar to preview your data and select the text and category columns.
-
Configure Analysis Settings
- Word Grouping: Add word groups and provide a name and comma-separated words.
- Stopword Removal: Enable stopword removal and select a language. Optionally, add custom stopwords.
- Stemming: Choose whether to apply stemming.
- N-gram Selection: Select which n-grams to consider (unigrams, bigrams, trigrams).
- Minimum Frequency & Significance Level: Set the minimum frequency threshold and significance level (alpha).
-
Run Analysis
Click the "Run Analysis" button. A progress bar will update as the corpus is processed. The app displays:
- Summary statistics (number of tokens, types, morphological complexity, etc.)
- A table of significant characteristic words with their internal and global frequencies, test values, and p-values.
- Interactive visualizations for each category.
-
Download Results
Once the analysis is complete, select your preferred download format(s) (CSV and/or Excel) and download the results.
.
├── main.py # Main Streamlit application code for characteristic words detection
├── requirements.txt # Required Python packages and versions
└── README.md # This file
The app requires the following packages (with minimum versions):
- streamlit >= 1.40.2
- pandas >= 2.1.1
- numpy >= 1.25.3
- stop-words >= 0.2.5
- snowballstemmer >= 2.1.0
- scipy >= 1.10.1
- plotly >= 5.17.0
- openpyxl >= 3.1.2
Contributions are welcome! If you have suggestions, bug fixes, or improvements:
- Fork the repository.
- Create a new branch for your feature or bugfix.
- Commit your changes.
- Open a pull request with a detailed description of your modifications.
This project is open-source and available under the MIT License.
Gabriele Di Cicco, PhD in Social Psychology
GitHub | ORCID | LinkedIn
Happy Analyzing!