Characteristic Words Detection in Corpus Linguistics

Characteristic Words Detection is a Streamlit application designed for corpus linguistics research. It identifies characteristic words (or n-grams) within categories of your dataset by comparing the frequency of words in each category with their global frequency in the corpus. The app applies statistical testing (hypergeometric tests with Benjamini–Hochberg correction) to determine significant over- or under-representation of terms.

Overview

This app processes text data by:

Preprocessing: Tokenizing text, removing stopwords (customizable by language), and applying stemming.
N-gram Generation: Supports unigrams, bigrams, and trigrams.
Word Grouping: Optionally groups words together under a common label.
Statistical Analysis: Computes internal and global frequencies, applies hypergeometric tests, and corrects p-values using the Benjamini–Hochberg procedure.
Visualization & Results: Displays summary statistics, characteristic words tables, and interactive bar charts (via Plotly). Results can be downloaded in CSV and/or Excel formats.

The application is ideal for linguists, social scientists, and researchers analyzing large text corpora to identify thematic or stylistic markers.

Features

Customizable Text Preprocessing:
- Tokenization and stopword removal (supports English, French, Italian, and Spanish).
- Optional stemming using Snowball stemmers.
- N-gram selection (unigrams, bigrams, trigrams).
Word Grouping:
- Replace sets of words with a group name for consistent treatment.
Statistical Testing:
- Uses hypergeometric distribution to assess term significance.
- Applies Benjamini–Hochberg correction for multiple comparisons.
Visualization:
- Generates interactive bar charts showing overrepresented and underrepresented terms.
Downloadable Results:
- Export the results in CSV, Excel, or ZIP format.
Session Persistence:
- Utilizes Streamlit's session state to store analysis results and configuration.

Installation

Clone the Repository

git clone https://github.com/yourusername/characteristic-words-detection.git
cd characteristic-words-detection

(Optional) Create a Virtual Environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install Dependencies

The required packages are listed in the requirements.txt file. Install them using:
```
pip install -r requirements.txt
```

Usage

Run the Streamlit App

Launch the application by running:
```
streamlit run main.py
```
Upload Your Data
- Upload a CSV, Excel, TSV, or TXT file containing your corpus.
- Use the sidebar to preview your data and select the text and category columns.
Configure Analysis Settings
- Word Grouping: Add word groups and provide a name and comma-separated words.
- Stopword Removal: Enable stopword removal and select a language. Optionally, add custom stopwords.
- Stemming: Choose whether to apply stemming.
- N-gram Selection: Select which n-grams to consider (unigrams, bigrams, trigrams).
- Minimum Frequency & Significance Level: Set the minimum frequency threshold and significance level (alpha).
Run Analysis

Click the "Run Analysis" button. A progress bar will update as the corpus is processed. The app displays:
- Summary statistics (number of tokens, types, morphological complexity, etc.)
- A table of significant characteristic words with their internal and global frequencies, test values, and p-values.
- Interactive visualizations for each category.
Download Results

Once the analysis is complete, select your preferred download format(s) (CSV and/or Excel) and download the results.

File Structure

.
├── main.py           # Main Streamlit application code for characteristic words detection
├── requirements.txt  # Required Python packages and versions
└── README.md         # This file

Requirements

The app requires the following packages (with minimum versions):

streamlit >= 1.40.2
pandas >= 2.1.1
numpy >= 1.25.3
stop-words >= 0.2.5
snowballstemmer >= 2.1.0
scipy >= 1.10.1
plotly >= 5.17.0
openpyxl >= 3.1.2

Contributing

Contributions are welcome! If you have suggestions, bug fixes, or improvements:

Fork the repository.
Create a new branch for your feature or bugfix.
Commit your changes.
Open a pull request with a detailed description of your modifications.

License

This project is open-source and available under the MIT License.

Contact

Gabriele Di Cicco, PhD in Social Psychology
GitHub | ORCID | LinkedIn

Happy Analyzing!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Characteristic Words Detection in Corpus Linguistics

Table of Contents

Overview

Features

Installation

Usage

File Structure

Requirements

Contributing

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Characteristic Words Detection in Corpus Linguistics

Table of Contents

Overview

Features

Installation

Usage

File Structure

Requirements

Contributing

License

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages