The purpose of this file is:
- Describe briefly the code of the notebook and help the reader to decode some difficult parts.
- Describe the training datasets (augmentation.csv).
- Describe the training prompts (train_prompts.csv)
The project has built for Assignment A3 which is about LLM text detection. The description of the project is the following:
We will use the data of a competition that challenges participants to develop a model that can accurately detect whether an essay was written by a (middle/high school) student or by an LLM. Download the data from the challenge, which will comprise student-written essays and (well, toy) LLM-generated ones. The test data of the competition are expected to comprise essays generated by a variety of LLMs and which are hard to distinguish from the ones generated by students. In this assignment, however, you are expected to create your own evaluation data.
A. Data augmentation
- Prompt an LLM to generate essays, so that you balance the data (use both prompts provided by the challenge).
- Build text classifiers on the augmented data, using cross validation with appropriate classification evaluation metrics to assess them, and suggest the best performing one.
- Compute two scores per generated text, one reflecting the maximum and the other the average similarity of that text with student essays.
- Study the correlation between the similarity scores and the prediction probability of your best classifier for the generated texts; compute the prediction probability per text, by training the selected classifier on all except from that text, which is used a test instance (a.k.a. the leave-one-out cross validation setting).
- Based on your study so far, decide which generated texts should be discarded in order to improve the benchmark and yield a more robust classifier.
B. Learning curves
- Keep a test set apart and split the train data to portions (10%, …, 90%, 100%).
- Train your best performing algorithm on each portion.
- Assess each trained instance on the test (the same across portions) and on the training data.
- Visualise the two curves (train, test), based on an appropriate evaluation measure, diagnosing weak and strong points of your classifier (a.k.a. the learning curves).
- Add a regressor to the plot, to estimate how many more texts should you generate to reach the "best" performance.
C. Clustering-based augmentation
- Use K-Means, based on an approprate text representation and the (estimated) optimum K, to cluster the generated essays, and then the student essays.
- Compare the cluster balance (number of instances per cluster) between the two clusterings.
- Yield a title per cluster, reflecting the topic of the texts included.
- Study the similarities between the two clusterings, by finding clusters comprising similar texts.
- Generate more texts (as in A) in order to better balance your clusters.
- Re-train your best-performant classifier on the new data (or a careful selection of them) and analyze the benefits of using clustering to improve the classifier.
- Data augmentation
- Prompted GPT LLM via API call so to generate more essays iot balance our given dataset.
- For every generate essay use of library uuid iot create a unique id
- Exploratory of our Dataset by using plots (distributions of essay length, boxplot, similarity between words)
- Train classifiers with 10-fold cross validation and keep the best classifier based on the classification report AND the overall accuracy from the competition
- Calculation of the maximum and average similarity per generated essays with the student essays. We removed stopwords so we do not include them in the count and we calculate the commons words with methods intersection and union.
- For every generate essay we train our best classifier with this specific essay and we calculated the probability this essay to be generated or not. Plot the information about similarities and probabilities and drop the generated essays that might downgrade the classifier.
- Learning Curves
- We hided a test set and we start train our best classifier on portions of the training set (10%, …, 90%, 100%). For every portion we run our trained model to both test set and training set. We stored the accuracy and F1 score and plotted them to see the learning curve. Use of our best classifier model from the previous query and create a new function iot use it as many times as we need.
- Use Least Squared method iot find a linear regression model that fits into accuracy and F1-score of test set
- Plot the regression model with the learning curves. Estimate the number of essays that we need in order to improve our classifier.
- Clustering-based augmentation
- We ran elbow and silhouette methods iot estimate the optimum k for k-means algorithm for both LLM/student essays
- We ran k-means algorithm based on the optimum k (k=2) and plot the number of essays for each cluster. We concluded with two clusterings one for LLM essays and the other for student essays. Each clustering has 2 clusters
- For each cluster we found the most common words. We vectorized via TF-IDF the essays and for each cluster we found the most common words. We used NMF because for each word has a coefficient (weight) that show their frequency. We kept 5 most common words and via GPT API we extracted a formal title.
- We compared every cluster with all others (same or different clustering) iot find similarities between them
- Plotted percentage of similarity between the clusters in a bar plot.
- Via clustering we filled any inbalanced cluster iot have a perfectly balanced training dataset
- Based on the regressor plot we generated more essays, so we can train our best classifier via the estimated number of required essays. We generated our essays balanced so number LLM essays is equal to number of students essays.
- Train the model based on the new dataset and compare the results from the competition with the previous models (with this training dataset we had the best results)
-
train_essays.csv is the first training dataset which is used to train our classifiers The columns for this set are:
Column1: id, Column2: prompt_id, Column3: text, Column4:generated.
- Column 'id' contains the unique id for each essay
- Column 'prompt_id' contains the number of prompt (0 or 1) that was used for the generation/write of the essay
- Column 'text' contains the essay
- Column 'generated' is the ground truth of each essay. 0 is if the essay was written by a student and 1 if the essay was generated by an LLM.
-
best_train_essays.csv contains the same structure as the above file however some vulnerable generated essays have been removed for a more robust classifier
-
augmentation.csv is the final training dataset. We have only created an additional column 'cluster' which contains the cluster that each essay belongs.
-
train_prompts.csv contains the prompts/instructions for the construction of an essay. The columns of this file are:
Column1: prompt_id, Column2: prompt_name, Column3: instructions, Column4: source_text
- Column 'prompt_id' containts the associated number for each topic.
- Column 'prompt_name' contains the name of the topic related to the id.
- Column 'instructions' contains the necessary instruction for the construction of an essay (either by an LLM or a student)
- Column 'source_text' contains more information about the topics.
- test_essays.csv is mandatory only for the competition. No need for exploratory, only for some code blocks.
- Please check the 'requirements.txt' for the necessary libraries.
- Inside the folder 'data' the csv files are required for the run of code.
- Enter to the Jupyter Notebook 'Assignment_A3' to check code, descriptions & plots.
- For a quick view open report.pdf to check the plots.
- Each cell of the notebook is executable except for cell [2], which requires token for GPT API and also has a cost to run it (approximately 5.5euros). The data that was generated via GPT API were saved in a .csv file so no need for running this block of code.
If a problem occurs with the code please check that the files are correctly imported.
Student: Rafail Mpalis
Inspiration, code snippets, etc.