NLP for Greek language

This repo is an aggregation of sources for Greek language to tackle varios Natural Language Processing/Understanding/Generation needs.

Contents

The language and countries

Greek language is spoken by majority of population in two countries.

X	Country - ISO Language code
CY
GR

Greek Tree Bank

Morphological and syntatic annotations of Greek corpus. This Greek UD source used by many other pretrained open-source components.

Manually annotated: lemmas, dependencies, POS, features.

Genres: news, wiki, spoken

Souces: public domain, wikinews articles, European Parlament sessions texts.

Corpus size: 2521 sentences/ 61.673 tokens.

https://universaldependencies.org/treebanks/el_gdt/index.html

Pipeline Components

Accentuation and diacritics

Greek text requires accents and diacritics removal. Some new Tokenizers include this step but earliest editions doesn not. https://legacy.cltk.org/en/latest/greek.html

Lemmatization

Spacy lemmatizer (trainable lemmatizer)

JohnSnowLabs Greek lemmatizer

CLTK Greek lemmatiter

Tokenization

Depends on a sutiation we might need different corpus tokenization. Sources below include general tokenizers for word, sentence, paragraph tokenization.

NLTK

NLTK tokenizer module

Spacy

Spacy Tokenizer Also available a pipeline component for Greek language senter for Sentence segmentation.

Other

Spacy offers other helpful components: morphologizer, dependency parser, attribute ruler.

NLP tasks

Named Entity Recognition

Source	Supported labels	Link
Spacy	EVENT, GPE, LOC, ORG, PERSON, PRODUCT	Spacy models
Spark NLP
Stanza
AUEB	LOC, ORG, PERSON,	gr-nlp-toolkit transformer-based

Translation

Package	Details	Link
Spark NLP	Multilingual (wrapped from Hugging Face)
Transformers	Multilingual

Question Answering

Cross-lingual QA dataset: XQuAD

Transformers model

BERT model pretrained on Greek corpus only.

bert-base-greek-uncased-v1

Greek BERT

Other

Proper nouns

List of 144,000 Classical Greek proper nouns

Ancient Greek

Some handy stuff for Ancient Greek

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
README.rst		README.rst

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP for Greek language

The language and countries

Greek Tree Bank

Pipeline Components

Accentuation and diacritics

Lemmatization

Tokenization

NLTK

Spacy

Other

NLP tasks

Named Entity Recognition

Translation

Question Answering

Transformers model

Other

Proper nouns

Ancient Greek

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

NLP for Greek language

The language and countries

Greek Tree Bank

Pipeline Components

Accentuation and diacritics

Lemmatization

Tokenization

NLTK

Spacy

Other

NLP tasks

Named Entity Recognition

Translation

Question Answering

Transformers model

Other

Proper nouns

Ancient Greek

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages