This repo is an aggregation of sources for Greek language to tackle varios Natural Language Processing/Understanding/Generation needs.
Contents
Greek language is spoken by majority of population in two countries.
| X | Country - ISO Language code |
|---|---|
| CY | |
| GR |
Morphological and syntatic annotations of Greek corpus. This Greek UD source used by many other pretrained open-source components.
Manually annotated: lemmas, dependencies, POS, features.
Genres: news, wiki, spoken
Souces: public domain, wikinews articles, European Parlament sessions texts.
Corpus size: 2521 sentences/ 61.673 tokens.
https://universaldependencies.org/treebanks/el_gdt/index.html
Greek text requires accents and diacritics removal. Some new Tokenizers include this step but earliest editions doesn not. https://legacy.cltk.org/en/latest/greek.html
Spacy lemmatizer (trainable lemmatizer)
Depends on a sutiation we might need different corpus tokenization. Sources below include general tokenizers for word, sentence, paragraph tokenization.
Spacy Tokenizer Also available a pipeline component for Greek language senter for Sentence segmentation.
Spacy offers other helpful components: morphologizer, dependency parser, attribute ruler.
| Source | Supported labels | Link |
|---|---|---|
| Spacy | EVENT, GPE, LOC, ORG, PERSON, PRODUCT | Spacy models |
| Spark NLP | ||
| Stanza | ||
| AUEB | LOC, ORG, PERSON, | gr-nlp-toolkit transformer-based |
| Package | Details | Link |
|---|---|---|
| Spark NLP | Multilingual (wrapped from Hugging Face) | |
| Transformers | Multilingual |
Cross-lingual QA dataset: XQuAD
BERT model pretrained on Greek corpus only.
bert-base-greek-uncased-v1
List of 144,000 Classical Greek proper nouns