A Transformer-based library for SocialNLP classification tasks.
Currently supports:
- Sentiment Analysis (Spanish, English)
- Emotion Analysis (Spanish, English)
- Hate Speech Detection (Spanish, English)
Just do pip install pysentimiento and start using it:
from pysentimiento import create_analyzer
analyzer = create_analyzer(task="sentiment", lang="es")
analyzer.predict("Qué gran jugador es Messi")
# returns SentimentOutput(output=POS, probas={POS: 0.998, NEG: 0.002, NEU: 0.000})
analyzer.predict("Esto es pésimo")
# returns SentimentOutput(output=NEG, probas={NEG: 0.999, POS: 0.001, NEU: 0.000})
analyzer.predict("Qué es esto?")
# returns SentimentOutput(output=NEU, probas={NEU: 0.993, NEG: 0.005, POS: 0.002})
analyzer.predict("jejeje no te creo mucho")
# SentimentOutput(output=NEG, probas={NEG: 0.587, NEU: 0.408, POS: 0.005})
"""
Emotion Analysis in English
"""
analyzer = create_analyzer(task="emotion", lang="en")
emotion_analyzer.predict("yayyy")
# returns EmotionOutput(output=joy, probas={joy: 0.723, others: 0.198, surprise: 0.038, disgust: 0.011, sadness: 0.011, fear: 0.010, anger: 0.009})
emotion_analyzer.predict("fuck off")
# returns EmotionOutput(output=anger, probas={anger: 0.798, surprise: 0.055, fear: 0.040, disgust: 0.036, joy: 0.028, others: 0.023, sadness: 0.019})Also, you might use pretrained models directly with transformers library.
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("pysentimiento/robertuito-sentiment-analysis")
model = AutoModelForSequenceClassification.from_pretrained("pysentimiento/robertuito-sentiment-analysis")pysentimiento features a tweet preprocessor specially suited for tweet classification with transformer-based models.
from pysentimiento.preprocessing import preprocess_tweet
# Replaces user handles and URLs by special tokens
preprocess_tweet("@perezjotaeme debería cambiar esto http://bit.ly/sarasa") # "@usuario debería cambiar esto url"
# Shortens repeated characters
preprocess_tweet("no entiendo naaaaaaaadaaaaaaaa", shorten=2) # "no entiendo naadaa"
# Normalizes laughters
preprocess_tweet("jajajajaajjajaajajaja no lo puedo creer ajajaj") # "jaja no lo puedo creer jaja"
# Handles hashtags
preprocess_tweet("esto es #UnaGenialidad")
# "esto es una genialidad"
# Handles emojis
preprocess_tweet("🎉🎉", lang="en")
# 'emoji party popper emoji emoji party popper emoji'Check CLASSIFIERS.md for details on the reported performances of each model.
- Clone and install
git clone https://github.com/pysentimiento/pysentimiento
pip install poetry
poetry shell
poetry install
- Download data TASS 2020 data to
data/tass2020(you have to register here to download the dataset)
Labels must be placed under data/tass2020/test1.1/labels
Open an issue or email us if you are not able to get the data.
- Run script to train models
Check TRAIN.md for further information on how to train your models
- Upload models to Huggingface's Model Hub
Check "Model sharing and upload" instructions in huggingface docs.
pysentimiento is an open-source library. However, please be aware that models are trained with third-party datasets and are subject to their respective licenses, many of which are for non-commercial use
- TASS Dataset license (License for Sentiment Analysis in Spanish, Emotion Analysis in Spanish & English)
- SEMEval 2017 Dataset license (Sentiment Analysis in English)
If you use pysentimiento in your work, please cite this paper
@misc{perez2021pysentimiento,
title={pysentimiento: A Python Toolkit for Sentiment Analysis and SocialNLP tasks},
author={Juan Manuel Pérez and Juan Carlos Giudici and Franco Luque},
year={2021},
eprint={2106.09462},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Please use the repository issue tracker to point out bugs and make suggestions (new models, use another datasets, some other languages, etc)