CATT: Character-based Arabic Tashkeel Transformer

This is the official implementation of the paper CATT: Character-based Arabic Tashkeel Transformer.

How to Run?

Run Using `catt-tashkeel` Package

You can easily install CATT models as a packge using pip as follows:

pip install catt-tashkeel

Then, you can import the classes and use them directly. Here is an example:

from catt_tashkeel import CATTEncoderDecoder, CATTEncoderOnly

eo = CATTEncoderOnly()
ed = CATTEncoderDecoder()

text = 'وقالت مجلة نيوزويك الأمريكية التحديث الجديد ل إنستجرام يمكن أن يساهم في إيقاف وكشف الحسابات المزورة بسهولة شديدة'

print(eo.do_tashkeel_batch([text], verbose=False))
print(ed.do_tashkeel_batch([text], verbose=False))
print(ed.do_tashkeel(text, verbose=False))
print(eo.do_tashkeel(text, verbose=False))

Run Using PyTorch

You need first to download models. You can find them in the Releases section of this repo.
The best checkpoint for Encoder-Decoder (ED) model is best_ed_mlm_ns_epoch_178.pt.
For the Encoder-Only (EO) model, the best checkpoint is best_eo_mlm_ns_epoch_193.pt.
use the following bash script to download models:

mkdir models/
wget -P models/ https://github.com/abjadai/catt/releases/download/v2/best_ed_mlm_ns_epoch_178.pt
wget -P models/ https://github.com/abjadai/catt/releases/download/v2/best_eo_mlm_ns_epoch_193.pt

You can use the inference code examples: predict_ed.py for ED models and predict_eo.py for EO models.
Both examples are provided with batch inference support. Read the source code to gain a better understanding.

python predict_ed.py
python predict_eo.py

EO models are recommended for faster inference.
ED models are recommended for better accuracy of the predicted diacritics.

Converting Models to ONNX

Export PyTorch Models

To convert your trained PyTorch models to ONNX format, use the export script:

python export_to_onnx.py

This script will:

Load your trained PyTorch model checkpoints
Export separate ONNX models for encoder and decoder components
Validate the exported models for correctness
Save the ONNX models in the onnx_models/ directory

Output files:

encoder.onnx - The encoder component
decoder.onnx - The decoder component (or linear layer for encoder-only models)

Running ONNX Models

To test and run inference with the exported ONNX models:

python run_onnx.py

This script will:

Load the exported ONNX models
Run inference using ONNX Runtime

For more details on the export process, check the export_to_onnx.py script configuration.

How to Train?

To start trainnig, you need to download the dataset from the Releases section of this repo.

wget https://github.com/abjadai/catt/releases/download/v2/dataset.zip
unzip dataset.zip

Then, edit the script train_catt.py and adjest the default values:

# Model's Configs
model_type = 'ed' # 'eo' for Encoder-Only OR 'ed' for Encoder-Decoder
dl_num_workers = 32
batch_size = 32
max_seq_len = 1024
threshold = 0.6

# Pretrained Char-Based BERT
pretrained_mlm_pt = None # Use None if you want to initialize weights randomly OR the path to the char-based BERT
#pretrained_mlm_pt = 'char_bert_model_pretrained.pt'

Finally, run the training script.

python train_catt.py

Resources

This code is mainly adapted from this repo.
An older version of some Arabic scripts that are available in pyarabic library were used as well.

ToDo

License Change: CC-BY-NC to Apache 2.0

This repository has updated its license from Creative Commons Attribution-NonCommercial (CC-BY-NC) to Apache 2.0 License. This change removes commercial use restrictions and adopts an industry-standard open-source license, enabling broader adoption and collaboration. This transition supports the project's growth while maintaining our commitment to open-source principles.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
benchmarking		benchmarking
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bw2ar.py		bw2ar.py
catt_models_onnx.py		catt_models_onnx.py
char_bert_dataset.py		char_bert_dataset.py
char_bert_pl.py		char_bert_pl.py
char_bert_tokenizer.py		char_bert_tokenizer.py
compute_der.py		compute_der.py
ed.py		ed.py
ed_pl.py		ed_pl.py
eo.py		eo.py
eo_pl.py		eo_pl.py
export_to_onnx.py		export_to_onnx.py
predict_ed.py		predict_ed.py
predict_eo.py		predict_eo.py
requirements.txt		requirements.txt
run_onnx.py		run_onnx.py
tashkeel_dataset.py		tashkeel_dataset.py
tashkeel_tokenizer.py		tashkeel_tokenizer.py
tashkeel_tokenizer_onnx.py		tashkeel_tokenizer_onnx.py
train_catt.py		train_catt.py
train_char_bert.py		train_char_bert.py
transformer.py		transformer.py
utils.py		utils.py
xer.py		xer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CATT: Character-based Arabic Tashkeel Transformer

How to Run?

Run Using `catt-tashkeel` Package

Run Using PyTorch

Converting Models to ONNX

Export PyTorch Models

Running ONNX Models

How to Train?

Resources

ToDo

License Change: CC-BY-NC to Apache 2.0

About

Uh oh!

Releases 1

Packages

Contributors 2

Languages

License

abjadai/catt

Folders and files

Latest commit

History

Repository files navigation

CATT: Character-based Arabic Tashkeel Transformer

How to Run?

Run Using catt-tashkeel Package

Run Using PyTorch

Converting Models to ONNX

Export PyTorch Models

Running ONNX Models

How to Train?

Resources

ToDo

License Change: CC-BY-NC to Apache 2.0

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Run Using `catt-tashkeel` Package

Packages