This project implements a complete text‑mining pipeline over academic articles gathered from various sources.
-
Data Collection & Preprocessing
- Web scraping from the UPCH repository (https://repositorio.upch.edu.pe) via the notebook
01_Translation_y_scraping.ipynb
- Manual translation review of German passages using
02_find_no_translated.ipynb
- Text cleaning (lowercasing, punctuation removal, stopword filtering, lemmatization) with
03_preprocesamiento_con_lematizacion.py
- Results saved to results/tables/
- Web scraping from the UPCH repository (https://repositorio.upch.edu.pe) via the notebook
-
Vectorization
- TF–IDF
- Word2Vec (pretrained)
- SBERT (pretrained)
The script03_vectorize.py
outputs matrices to results/vectorizers/
- Clone the repository with
git clone https://github.com/YOURUSER/TextMineML.git
and change into the TextMineML directory - Create the environment with
mamba env create -f environment.yml
thenmamba activate prograv
, or install dependencies viapip install -r requirements.txt
- (Optional) Install external tools via
conda install -c bioconda mafft clustalo fasttree iqtree
TextMineML/
├── models/ Pretrained embeddings and models
├── results/ Cleaned tables, vectors, clusters
├── scripts/ Pipeline scripts
├── figures/ Generated plots and visualizations
├── README.md Project overview and instructions
- Word2Vec:
wiki-news-300d-1M-subword.vec
(download from FastText, unzip intomodels/
) - SBERT:
all-mpnet-base-v2
(clone from Hugging Face intomodels/
with Git LFS)
Run the preprocessing, vectorization, clustering, and visualization scripts in sequence. Outputs appear in results/
and figures/
.
UMAP projection of document embeddings
Silhouette score comparison across different cluster counts
Most frequent words per cluster
Number of articles in each cluster
Word cloud visualization for cluster 1
- Ensure pretrained model files exist in
models/
- Increase system memory or swap for large corpora
- Recreate the environment to fix dependency issues