0% found this document useful (0 votes)
47 views36 pages

Btech English

The document is a project report on a Transformer-based Neural Machine Translation system for English-to-Telugu translation, addressing challenges posed by Telugu's complex syntax and morphology. It details the use of state-of-the-art pretrained models like NLLB and the implementation of SentencePiece tokenization to enhance translation quality. The project aims to improve cross-lingual communication and has practical applications in education, e-governance, and media localization.

Uploaded by

randomloaded07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views36 pages

Btech English

The document is a project report on a Transformer-based Neural Machine Translation system for English-to-Telugu translation, addressing challenges posed by Telugu's complex syntax and morphology. It details the use of state-of-the-art pretrained models like NLLB and the implementation of SentencePiece tokenization to enhance translation quality. The project aims to improve cross-lingual communication and has practical applications in education, e-governance, and media localization.

Uploaded by

randomloaded07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

A

PROJECT SCHOOL REPORT ON

LANGUAGE TRANSLATION TRANSFORMER

Submitted by

MANDIGA KARTHIKEYA 23P81A6736

PERUGU JAHNAVI 23P81A6743

BIKUMALLA SRIWATSAVA 23P81A6713

MANURI SRI CHARAN 23P81A6737

SIRISANAGANDLA SOWMYA 23P81A6746

Under the guidance of

Mrs.G.Keerthi

Assistant Professor, CSE(AIML), KMCE

KESHAV MEMORIAL COLLEGE OF ENGINEERING


Koheda Road, chinthapalleguda, Ibrahimpatnam, Telangana 501510.

May 2025
KESHAV MEMORIAL COLLEGE OF ENGINEERING
A Unit of Keshav Memorial Technical Education (KMTES)
Approved by AICTE, New Delhi & Affiliated to JNTUH University, Hyderabad

CERTIFICATE

This is to certify that the project work entitled “LANGUAGE TRANSLATION


TRANSFORMER” is a bonafide work carried out by “Mandiga Karthikeya, Perugu Jahnavi,
Bikumalla Sriwatsava, Manuri Charan ,Sirisanagandla Sowmya” of II-year II semester
Bachelor of Engineering in CSE/CSE (CSD)during the academic year 2024-2025 and is a
record of 2onafide work carried out by them.

Project Mentor
Mrs.G.Keerthi
Assistant Professor, CSE (AIML)

KMCE
ABSTARCT

In today’s globalized and digitally interconnected world, effective cross-lingual communication is


essential. This project presents a Transformer-based Neural Machine Translation (NMT) system
for English-to-Telugu translation—a relatively low-resource and morphologically rich language
pair. Telugu’s complex syntax (Subject-Object-Verb structure), agglutinative morphology, and
scarcity of high-quality parallel corpora pose significant challenges for machine translation. To
address these, the system leverages state-of-the-art pretrained Transformer models such as
MarianMT, OpenNMT and NLLB (No Language Left Behind), either directly or through fine-
tuning on custom English–Telugu datasets. SentencePiece tokenization is employed to manage
subword segmentation, enhancing the model’s ability to handle vocabulary sparsity and
morphological variation.
The translation output is evaluated using BLEU and METEOR metrics, with particular emphasis
on METEOR due to its sensitivity to linguistic nuances in morphologically rich languages. The
model demonstrates notable improvements in producing fluent, contextually accurate translations,
even for complex sentence structures. This work has practical applications in areas such as e-
governance, education, media localization, and digital accessibility, particularly for Telugu-
speaking populations in underrepresented regions. By leveraging Transformer architectures and
multilingual transfer learning, the project aims to advance inclusive language technologies and
promote digital equity.
CONTENTS
S.No. Title Page No.
ABSTRACT i
TABLE OF CONTENTS ii
LIST OF FIGURES iii
LIST OF TABLES iv

1. Introduction 7

2. Literature Survey 9
2.1 Machine Translation
2.2 Transformer Based Models For Translation
2.3 No Language Left Behind(NLLB)
2.4 Related Work

3. Software Requirements Specifications


3.1 System Overview
3.2 Operating Environment
3.3 Functional Requirements
3.4 Non-Functional Requirements

4. Proposed Work Architecture, Technology Stack, 14


Implementation Details
4.1 Proposed Work Architecture
4.2 Technology Stack
4.3 Implementation Details
4.4 Interfaces and Communication

4 Results and Discussions 27


4.1 Results
4.2 Discussions

5 Conclusion & Future Scope 30


5.1 Conclusion
5.2 Future Scope

6 References 32
LIST OF FIGURES

Fig. No. Figure Name Page No .


1 Architecture diagram 14

2 Login Page 24

3 SignUp Page 24

4 Translation Page 24

5 History Page 25
LIST OF TABLES

Table No. Table Name Page No.


1 TechStack Summary 20
2 Comparative analysis of our proposed model 28
CHAPTER 1
INTRODUCTION
The field of Natural Language Processing (NLP) has witnessed revolutionary progress with the
advent of Transformer-based architectures, especially in the domain of Machine Translation (MT).
Among various translation directions, English-to-Telugu translation presents a critical yet
challenging task due to the linguistic dissimilarities between the source and target languages.
English, being a syntactically simpler and morphologically poor language, contrasts sharply with
Telugu, a morphologically rich and syntactically complex Dravidian language with Subject-
Object-Verb (SOV) structure. This structural disparity leads to difficulties in maintaining fluency,
semantic fidelity, and grammatical correctness during translation.

Recent developments in pretrained Transformer models have significantly advanced multilingual


translation capabilities, particularly for low-resource languages like Telugu. Unlike traditional
statistical or rule-based MT systems, Transformer-based approaches leverage self-attention
mechanisms and contextual embeddings to capture long-range dependencies and nuanced meaning
in both source and target texts. Pretrained models such as mBART, MarianMT, IndicTrans, and
NLLB (No Language Left Behind) offer robust multilingual capabilities through large-scale
training on cross-lingual corpora. These models provide strong initialization, enabling effective
zero-shot and few-shot translation, even in the absence of large-scale Telugu-English parallel
datasets.

Despite the advancements, there are three primary challenges in English-to-Telugu machine
translation. First, pretrained multilingual models often treat low-resource languages as second-
class citizens due to imbalanced training data, leading to suboptimal translations. Second, English-
to-Telugu translation suffers from semantic misalignment due to differences in word order, gender-
specific grammar, and rich inflections in Telugu that are not directly represented in English. Third,
while pretrained models demonstrate high BLEU scores, they often lack cultural and contextual
fluency, especially in idiomatic or domain-specific expressions.

To overcome these challenges, our project proposes a robust translation pipeline utilizing a fine-
tuned pretrained Transformer model specifically adapted for English-to-Telugu translation. We
incorporate SentencePiece tokenization to handle subword units, especially for out-of-vocabulary
tokens and morphologically complex Telugu words. Our architecture leverages pretrained
checkpoints from NLLB-200 and IndicTrans2, selected for their strong performance on low-
resource Indian languages. We further enhance semantic alignment through domain-specific fine-
tuning, post-editing routines, and custom evaluation metrics like METEOR, which are better suited
to morphologically rich languages than traditional BLEU scores.

This system is designed for practical applications in e-governance, education, content localization,
and digital accessibility, aiming to democratize access to information for Telugu-speaking
populations. By integrating pretrained Transformer models with domain-specific adaptation and
semantic evaluation, our project provides a high-quality, efficient, and scalable solution for
English-to-Telugu machine translation, contributing to the broader vision of inclusive AI and
digital equality.
CHAPTER 2
LITERATURE SURVEY
2.1 MACHINE TRANSLATION
Machine Translation (MT) refers to the process of automatically converting text from one language
to another using computational models. With the advancement of deep learning, particularly
Transformer-based architectures, the quality of translations has improved significantly.
Transformers rely on self-attention mechanisms and positional encodings, which allow them to
model long-range dependencies and contextual relationships more effectively than previous
sequence-to-sequence models like RNNs or LSTMs.

However, for low-resource language pairs like English to Telugu, several challenges persist. These
include the lack of parallel corpora, morphological complexity, syntactic divergence, and domain
specificity. To overcome these issues, researchers have turned to pretrained multilingual models
such as mBART, MarianMT, mT5, and NLLB, which can be fine-tuned on custom datasets to
enhance translation quality for specific language pairs.

2.2 Transformer Based Models For Translation


The Transformer architecture, introduced by Vaswani et al. (2017), laid the foundation for many
successful translation models. It replaces traditional recurrent structures with attention
mechanisms, enabling faster training and better handling of long sequences.

Several pretrained Transformer-based models have been used for machine translation:
mBART (Multilingual BART) is a denoising autoencoder for pretraining sequence-to-sequence
models across multiple languages.
mT5 is a multilingual version of the T5 model trained on a massive multilingual corpus for text-
to-text tasks.

MarianMT is an efficient multilingual model based on the Transformer architecture and trained
by the OPUS project.

While these models provide decent translation quality, they often struggle with low-resource
languages such as Telugu, due to shared tokenizers and less specialized training for
morphologically rich scripts.

2.3 No Language Left Behind(NLLB) Model


The No Language Left Behind (NLLB) project by Meta AI introduced NLLB-200, a cutting-
edge multilingual model that supports over 200 languages, including low-resource ones. It was
specifically designed to address the challenges faced in translating underrepresented languages.

Key characteristics of NLLB include:

Mixture of Experts (MoE) Architecture: Only a subset of the model’s parameters are activated
per input, enabling specialized and efficient language-specific processing.

Custom SentencePiece Tokenization: Tailored subword tokenization per language improves


linguistic representation for morphologically rich languages like Telugu.

Language-Aware Training: With language-specific routing and optimized sampling strategies,


the model minimizes interference across languages.

High-Quality Training Data: Trained on FLORES-200 benchmark and large multilingual


corpora, ensuring diverse and high-quality representations.

Zero-shot Translation Capability: The model performs strongly even on language pairs it was
not explicitly fine-tuned on.

The NLLB model has shown superior performance across numerous benchmarks, especially for
low-resource languages like Telugu, making it highly suitable for customized machine translation
tasks.

2.4 Related Work


Numerous studies and models have been proposed to improve translation quality, especially for
low-resource languages. Some of the most relevant works include:
1. NLLB: No Language Left Behind (Meta AI)
Purpose: To improve translation for low-resource languages using multilingual Transformers.

Model Architecture: Uses a Mixture of Experts (MoE) model with language-specific routing.

Tokenization: Custom SentencePiece tokenization tailored per language for better subword
handling.

Training Corpus: Trained on FLORES-200 and other high-quality multilingual datasets.

Performance: Outperforms models like mBART and mT5 in BLEU, chrF, and METEOR scores
for many Indic languages, including Telugu.

ADVANTAGES

1 Specifically optimized for underrepresented languages.

2 Less data required for fine-tuning due to strong zero-shot capabilities.

3 Efficient computation due to MoE.

Reason for Selection:


Among all existing models, NLLB consistently provides the best results for Telugu, both in terms
of accuracy and language preservation. Its architecture directly addresses the limitations of earlier
models, making it the ideal choice for this research.

2. mBART (Facebook AI)


Description: A denoising autoencoder model trained for sequence-to-sequence translation across
multiple languages.

LIMITATIONS:
• Tokenizer is shared across languages, which may not handle the nuances of Telugu script
well.
• Training primarily focused on high-resource languages, leading to sub-optimal results for
Telugu without extensive fine-tuning.

3. MarianMT (OPUS Project)


Description: Transformer-based multilingual model trained on OPUS corpus.

Strengths:

• Lightweight and open-source; easy to deploy.


• General-purpose architecture, not optimized for Indic languages.

LIMITATIONS:
• Translation accuracy is often inconsistent for morphologically rich languages like Telugu.

4. mT5 (Google)
Description: Multilingual T5 trained on mC4 corpus.

Strengths: Supports various text-to-text tasks beyond translation.

LIMITATIONS:

• Performs well in zero-shot settings but not fine-tuned for specific translation directions like
English ↔ Telugu.
SOFTWARE REQUIREMENTS SPECIFICATION

Purpose
The purpose of this project is to develop an English-to-Telugu Language Translation
System using a Transformer-based model. The goal is to translate text from English to
Telugu in a way that maintains the original meaning and context. This system focuses on
delivering accurate, meaningful translations for everyday use, supporting better
communication and understanding between English and Telugu speakers.

Scope
The project involves developing a Transformer-based deep learning model to translate text
from English to Telugu. The system aims to achieve high-quality, context-aware
translations, enabling a range of impactful applications:

• Educational Content Accessibility: Making English learning materials easily


understandable for Telugu-speaking students, helping bridge the language gap in schools
and colleges.
• Government and Public Communication: Automating translation of official
documents, notices, and services into Telugu to ensure clear communication with the local
population.
• Content Localization and Media: Helping creators, websites, and businesses convert
English content into Telugu to better connect with regional audiences.
Definitions
Transformer: A deep learning model architecture based on self-attention, widely used for
tasks like language translation and text generation.
NLLB: No Language Left Behind, a pre-trained multilingual model by Meta AI that
enables high-quality translation for low-resource languages like Telugu.
BLEU: Bilingual Evaluation Understudy, a metric used to measure the accuracy of
machine-translated text by comparing it with human-written reference translations.
References
Dataset: Custom English-Telugu parallel corpus created from open sources
Research Paper: Vaswani et al., Attention is All You Need, NeurIPS, 2017.
3.1 System Overview
3.1.1 Overall Description
This system translates natural language text from English to Telugu using the pre-trained
NLLB (No Language Left Behind) Transformer model developed by Meta AI. The input
is an English sentence, which is tokenized and processed by the model to generate an
accurate Telugu translation. The system ensures fast and reliable translation, making it
useful for multilingual communication, educational tools, and real-world language
translation applications.
Key Features
English-to-Telugu Translation: Converts English sentences into contextually accurate
Telugu using the NLLB model.
Real-time Output: Provides fast and efficient translation results instantly after input.
3.2 Operating Environment
3.2.1 Software Requirements
Operating System:
Windows 10/11, Linux (Ubuntu 20.04+), or macOS
Programming Language:
Python 3.8+
Javascript
HTML & CSS
Libraries:
• Transformers (Hugging Face)
• Torch (PyTorch)
• Tokenizers
• Flask
• Flask-CORS
• Jsonify

3.2.2 Hardware Requirements


Recommended:
• CPU: Intel i5 or AMD Ryzen 5 (or equivalent)

• RAM: 8 GB or higher
• GPU: NVIDIA GTX 1050 or equivalent
• Storage: Minimum 10 GB free disk space
3.3 Functional Requirements
3.3.1 Data Collection and Storage
• Dataset: Custom English-Telugu parallel corpus collected from open sources
• Storage Structure: Sentences stored in JSON/CSV format with corresponding English
and Telugu pairs
3.3.2 Data Preprocessing
• Clean and normalize text (remove extra spaces, punctuation, etc.)
• Tokenize sentences using the NLLB tokenizer
• Convert tokenized text into model-compatible input tensors
3.3.3 Model Integration
• Model Used: Pre-trained NLLB model from Hugging Face
• Inference Pipeline:
o Encode input English sentence using tokenizer
o Translate using NLLB model
o Decode output to generate Telugu sentence
3.3.4 Translation Interface
• Web interface for users to input English sentences
• Display corresponding Telugu translations in real-time
• Option to copy, clear, or save translations
3.3.5 Deployment
• Local Development: Tested using Flask and Express.js on localhost
• Frontend: Built using HTML, CSS, and JavaScript
• Future Deployment: Cloud hosting options

3.4 Non-Functional Requirements


3.4.1 Performance
• Ensure translations are generated within 5 seconds for standard-length sentences.
• Optimize inference using efficient tokenization and batching techniques.
• Minimize latency in frontend-backend communication.
3.4.2 Security
• Input text is processed locally or on secure backend without logging sensitive data.
• Secure API endpoints using authentication if deployed online.
• Prevent unauthorized access to the pre-trained NLLB model and backend code.
3.4.3 Scalability
• Supports deployment on GPU-enabled environments for faster batch processing.
• Easily extendable for storing translation logs, user history.
CHAPTER 3
PROPOSED WORK ARCHITECTURE, TECHNOLOGY STACK,
IMPLEMENTATION DETAILS

3.1 Proposed Work Architecture

Figure 1. Architecture diagram

The proposed work aims to design and implement a robust, scalable, and semantically accurate
machine translation framework using Transformer-based architectures, with a primary focus on
the NLLB (No Language Left Behind) pretrained model by Meta AI. The goal is to enable high-
quality translation from English to Telugu, particularly tuned for regional linguistic nuances,
contextual accuracy, and efficiency in deployment for web-based applications.

This project leverages pretrained multilingual models while integrating a custom dataset to fine-
tune translation performance for English–Telugu pairs. The solution is embedded within a 3-
layer web architecture (frontend, backend with Express and Flask, and external services) for
practical and user-friendly access.
Key Components of the Proposed Architecture:
1. Input Text and Tokenization:
English Text Input:

• Users submit a sentence or paragraph in English via a web interface.


• Example: “The children are playing in the park.”

Tokenizer (NLLB SentencePiece):

• Tokenizes input using the tokenizer pretrained for the NLLB model (nllb-200-distilled-
600M).
• Converts natural language into subword tokens suitable for Transformer processing.
• Tokenizer is multilingual-aware and sensitive to language codes (eng_Latn, tel_Telu).

2. Pretrained Transformer Encoder-Decoder (NLLB):

Encoder (Transformer):

• Converts the tokenized English input into a dense sequence of hidden states.
• Captures long-range dependencies and semantic structures using self-attention.
• Leverages cross-lingual generalization from NLLB’s multilingual pretraining.

Decoder (Transformer):

• Generates Telugu translations by attending over encoder outputs.


• Operates autoregressively: one token at a time, using teacher forcing during training.
• Language-specific decoding with attention masking ensures fluent Telugu syntax.

Training Strategy:

• Learning rate scheduling and gradient clipping to avoid catastrophic forgetting.


• Evaluate using BLEU and METEOR scores to ensure both lexical and semantic quality.

3. Web Architecture Integration:


Frontend (HTML/CSS/JS):

• Takes user input, displays translated Telugu text.


• Flask API for seamless communication with backend.

Flask Server (Translation Microservice):

• Receives English text via POST.


• Loads the fine-tuned NLLB model.
• Generates Telugu output and sends it back.

Express Server (Storage and History):

• Stores translation pairs and metadata in MongoDB.


• Manages translation history and retrieval.

3.2 Technology Stack


1. NLLB Pretrained Model (facebook/nllb-200-distilled-600M)

Category : Pretrained Transformer Model for Machine Translation


Used For : Translating from English to Telugu
Library : Hugging Face Transformers

What It Does:

• NLLB (No Language Left Behind) is a state-of-the-art multilingual translation model


developed by Meta AI.

• The distilled 600M variant is a compact version of the full model, making it easier to deploy
and fine-tune even on mid-range laptop GPUs.
• Translates between 200 languages, including Telugu (tel_Telu) and English (eng_Latn).

Why It’s Used:

• Multilingual & Low-Resource Friendly : Telugu is a low-resource language. NLLB


excels in such scenarios due to its extensive training on FLORES and other diverse
datasets.
• Context Awareness : Unlike traditional phrase-based systems, NLLB captures
contextual meaning across long sentences.

• Efficiency : The distilled version balances translation quality with speed and memory
efficiency, suitable for resource-constrained systems like laptops.
• Pretrained Knowledge : Eliminates the need to train from scratch; saves time, compute,
and energy.

2. Hugging Face Transformers

Category : NLP Model Library


Used For : Accessing NLLB models, tokenizers, and training/evaluation workflows

What It Does:

• Provides high-level APIs for using transformer-based models.

• Simplifies tokenization, encoding, decoding, fine-tuning, and evaluation.


• Directly integrates with PyTorch and TensorFlow.

Why It’s Used:


• Plug-and-Play : Easily load the NLLB model with just a few lines of code.

• Robust Ecosystem : Offers thousands of pre-trained models, dataset utilities, and


evaluation metrics.
• Custom Fine-tuning : Supports custom training loops and evaluation using PyTorch,
which is essential for real-world customization.

3. PyTorch

Category : Deep Learning Framework


Used For : Building and fine-tuning the translation model

What It Does:

• Offers a dynamic computation graph for model training and backpropagation.


• Used as the backend for Hugging Face's model training and evaluation.
Why It’s Used:

• Dynamic and Flexible : Ideal for researchers and developers working on custom
training.
• Lightweight for Laptops : Performs well on mid-range GPUs like NVIDIA RTX
3050/3060.
• Community and Support : Rich documentation and community forums speed up
development and debugging.

4. SentencePiece Tokenizer

Category : Subword Tokenization


Used For : Tokenizing English and Telugu text into subword units

What It Does:

• Breaks down text into subword units like "trans", "la", "tion" rather than whole words.
• Works with languages like Telugu that have complex scripts and morphology.

Why It’s Used:

• Handles Rare Words : Telugu has many unique compound words. Subword
tokenization prevents out-of-vocabulary errors.
• Cross-Language Vocabulary Sharing : Allows English and Telugu to share common
subwords, enhancing alignment.
• Official Tokenizer of NLLB : Ensures vocabulary compatibility with the pretrained
model.

5. Flask

Category : Python Web Framework


Used For : Hosting the translation model as a microservice

What It Does:

• Creates a REST API that takes English text as input and returns the Telugu translation.
• Hosts the NLLB model inference pipeline behind a web interface.

Why It’s Used:

• Minimal & Lightweight : Ideal for small applications running directly on laptops.
• Direct Python Execution : Runs PyTorch + Hugging Face code without wrappers.
• Fast Prototyping : Easy to build endpoints (/translate) for deployment.

6. Express.js

Category : Node.js Web Backend


Used For : Handling HTTP requests from the frontend, communicating with Flask and
MongoDB

What It Does:

• Acts as the main backend server, managing user requests and sending data to Flask for
translation.
• Logs translation history and user feedback into the database.

Why It’s Used:

• Asynchronous I/O : Handles multiple requests concurrently with minimal latency.


• Middleware Support : Enables logging, security, and authentication.
• JS Ecosystem : Works well with JavaScript-based frontend (HTML/JS).

7. MongoDB

Category : NoSQL Database


Used For : Storing translation logs, feedback, and user interaction data

What It Does:

• Stores structured and unstructured data like:


o English Input
o Telugu Output
o Timestamp
• Provides fast querying and flexibility in schema.

Why It’s Used:

• Schema-less : You can store different types of user data without needing rigid table
schemas like SQL.
• Easily Extendable : You can later add fields like confidence scores, user profiles, or
feedback.
• Document-Based : Works well with JSON responses from Express.js and Flask APIs.
8. HTML, CSS

Category : Frontend Technologies


Used For : Creating the User Interface for the translator

What It Does:

• Enables users to:


o Type an English sentence
o Submit it to the backend
o View the Telugu translation in real-time

Why It’s Used:

• Cross-platform Compatibility : Runs on any modern browser.


• Ease of Styling and Interaction : CSS enhances user experience while JS provides
interactivity.
• Separation of UI and Logic : Frontend is decoupled from backend and model
layers, promoting modularity.

📊 Summary: Why This Stack? Table:2 Summary of System Architechture

Component Purpose Why It Was Chosen


NLLB Model English–Telugu Translation Multilingual, low-resource friendly, pretrained, lightweight

Hugging Face

Easy API, PyTorch-compatible, high-quality models


Model Access & Fine-
Tuning

PyTorch Custom Training & Inference Lightweight, dynamic graph, suitable for laptops

SentencePiece Subword Tokenization Handles Telugu scripts and compound words efficiently

Flask Microservice Hosting Lightweight, Python-native, integrates easily with model

Express.js Web Backend Gateway Manages routes, integrates with MongoDB and frontend

MongoDB Translation Log Storage Flexible NoSQL, stores diverse translation and user data

HTML/CSS Web UI Simple, accessible, interactive

3.3 Implementation of the Proposed Work: English-to-Telugu Translation using


NLLB Transformer Model
The implementation of the proposed work involved several critical stages designed to build a
robust, efficient, and accurate English-to-Telugu machine translation system using the pretrained
NLLB (No Language Left Behind) model. The stages include architectural design, data
preparation, model integration, training procedures, backend and frontend development, as well as
performance evaluation. Below is a detailed explanation of each phase:

3.3.1 System Architecture

Table:1 Summary of System Architechture

o This module accepts raw English text input from the user.

o SentencePiece or Byte-Pair Encoding (BPE) is applied using the tokenizer provided


by the NLLB model to convert the input into subword tokens.

2. NLLB Transformer-based Translation Model

• The pretrained NLLB model, fine-tuned on a custom English-to-Telugu dataset, serves as


the backbone for translation.

• The model uses an encoder-decoder Transformer architecture, where:


o The encoder captures the semantic and syntactic features of the English input.

o The decoder generates the equivalent Telugu output, guided by attention


mechanisms.

3. Translation Output Module

• The model output (Telugu tokens) is decoded back into natural Telugu text.

• SentencePiece decoder ensures meaningful sentence construction.

4. Interface and Middleware Layer


o A Flask-based backend handles requests between the frontend UI and the NLLB
model.

o MongoDB is used to log translations, user history, and interactions.


3.3.2 Data Preprocessing
Data preparation was performed on a custom parallel corpus containing English-
Telugu sentence pairs:
1. English Text Processing:
• Cleaning: Removal of special characters, extra whitespaces, and HTML tags.

• Normalization: Lowercasing and punctuation standardization.

• Tokenization: Using NLLB’s SentencePiece model.

2. Telugu Text Processing:

• Unicode Normalization: Telugu script is standardized using NFKC normalization.

• Cleaning: Removal of redundant punctuation, misspelled characters.


• Tokenization: SentencePiece-based tokenization ensuring subword-level understanding.

3.3.3 Model Training


1. Loading Pretrained Model:

• Loaded facebook/nllb-200-distilled-600M from HuggingFace Transformers.

• Initialized the tokenizer and model for the eng_Latn to tel_Telu language pair.

3. Loss and Optimization:

• Used CrossEntropyLoss with label smoothing.

• Adam optimizer with linear learning rate scheduler.

4. Evaluation:

• METEOR, BLEU, scores used for performance evaluation.

3.3.4 Backend Development


Implemented using Flask for lightweight and scalable RESTful APIs:

• Translation API : Accepts English input and returns Telugu translation.

• History API : Retrieves user query logs from MongoDB.

• Model Load & Tokenizer : Preloads the NLLB model and tokenizer into memory at
runtime.

• Security:
o CORS implemented for safe API access.

o Rate limiting to prevent abuse.

3.3.5 Frontend Development

Built using React.js with the following key components:

1. Input Field:
o Allows the user to type or paste English text.

o Integrated with form validation.

2. Output Area:

o Displays the translated Telugu text.


o Option to copy or download the result.

3. Navigation Panel:

o Contains login, dashboard, chat, and history pages.


o Integrated with backend APIs for real-time data retrieval.

4. Responsive Design:

o Styled with CSS for mobile-friendly layouts.


2. Login Page

Figure 3. SignUp Page


Figure 4. Translation Page

Figure 5. History Page

3.3.6 Performance Metrics


To evaluate translation quality and model efficiency, the following metrics were
used:
EU Score: Assesses n-gram overlaps between the reference and generated translations.
• METEOR Score: Considers synonymy and word order in matching.

3.4 Interfaces and Communication


3.4.1 Interfaces
3.4.1.1 User Interface

• Input textbox for English

• Output display for Telugu


• History and profile sections

• Authentication interface (login/signup)

3.4.1.2 Internal System Interfaces

• Tokenizer Interface : Encodes and decodes inputs for the model.

• Model Inference Interface : Handles forward pass and translation logic.

• Logging Interface : Stores user data and translations in MongoDB.

3.4.2 Communication Mechanisms


3.4.2.1 Internal Communication
• Flask routes transfer text between React UI and the translation model.
• PyTorch Tensors pass through tokenizer → encoder → decoder → output processor.

3.4.2.2 External Communication


• HuggingFace model downloads.

• Dataset access from Google Drive or HuggingFace Datasets.


CHAPTER 4
RESULTS AND DISCUSSIONS
4.1 Results
4.1.1 Comparative Analysis

The comparative analysis of the performance of NLLB, OpenNMT, and MarianMT on the English-
to-Telugu language translation task is summarized in the table below. These models were evaluated
using BLEU (Bilingual Evaluation Understudy Score) and TER (Translation Edit Rate) metrics,
which are standard benchmarks for assessing the quality of machine translation systems.

The NLLB (No Language Left Behind) model, developed by Meta AI, is designed to handle low-
resource languages with remarkable performance. On the English-to-Telugu pair, it achieves a
BLEU score of 32.5, indicating strong translation accuracy and fluency. Furthermore, its TER
score of 45.3 shows moderate effort needed for post-editing, signifying relatively good alignment
with human-translated references. NLLB benefits from its massive multilingual training data and
advanced transformer architecture, especially effective for underrepresented language pairs like
English-Telugu.

In comparison, OpenNMT, an open-source neural machine translation system, achieves a BLEU


score of 29.1, slightly lower than NLLB. While OpenNMT is flexible and supports multiple
architectures, its performance depends heavily on dataset quality and model tuning. The TER score
of 50.2 suggests a slightly higher post-editing requirement, indicating room for improvement in
translation fluency and adequacy.

MarianMT, developed by Microsoft and optimized for efficiency, achieves a BLEU score of 28.7,
the lowest among the three. While MarianMT is well-suited for production-scale translation tasks
and shows efficient runtime, its performance in low-resource language pairs is slightly limited.
The TER score of 53.5 reflects a greater divergence from reference translations, which could be
attributed to limitations in training data coverage for Telugu.

Table 2. Comparative analysis of NLLB, OpenNMT, and MarianMT

Model BLEU ↑ TER ↓


NLLB 32.5 45.3
OpenNMT 29.1 50.2
MarianMT 28.7 53.5
4.2 Discussions

In the rapidly evolving domain of neural machine translation (NMT), our project focuses on
building a high-quality English-to-Telugu translation system using a pretrained Transformer-based
architecture, specifically Meta’s No Language Left Behind (NLLB) model. The objective of this
work is to enable semantically accurate and syntactically fluent translations between English and
Telugu—two linguistically distant languages—by leveraging the generalization power of
multilingual pretrained models combined with fine-tuning on domain-specific datasets.
A core innovation of this project lies in the integration of NLLB’s pretrained translation
backbone with a custom dataset, which includes parallel English-Telugu sentences across
various topics and dialects. This setup ensures that the system adapts to language nuances,
idiomatic expressions, and regional variations specific to Telugu that are often overlooked in
broader multilingual models.
The fine-tuning phase significantly improves contextual understanding and grammatical
correctness compared to baseline translation services and simple encoder-decoder models. Our
comparative evaluations against standard transformer baselines demonstrate that the fine-tuned
NLLB model delivers higher BLEU and METEOR scores, indicating enhanced translation
accuracy and naturalness.

Additionally, the system is built with practical deployment in mind. We implemented a modular
backend using Flask and Express.js, which allows real-time sentence translation through a
RESTful API. For persistent data management, translated inputs and outputs are stored in a
MongoDB database, enabling user-specific history tracking and performance analysis. The
frontend interface, developed with modern web technologies, offers a clean and accessible user
experience that supports both single sentence and batch translation tasks.

From an architectural standpoint, our design emphasizes modularity, reusability, and


extensibility. The use of SentencePiece tokenization allows robust subword-level handling of
vocabulary, which is critical for morphologically rich languages like Telugu. This also mitigates
issues with rare or compound words, enabling better generalization during inference.

Our project not only improves English-to-Telugu translation quality but also demonstrates the
potential of combining state-of-the-art multilingual models with domain-specific data for low-
resource language pairs.

Looking ahead, we plan to enhance the model's performance by incorporating context-aware


transformers, dialogue-level translation capabilities, and potentially diffusion-based
generation techniques for multi-modal outputs (e.g., text-to-speech or image captioning in
Telugu). Moreover, we aim to explore quantization and pruning techniques to enable faster
inference on edge devices, broadening the reach of this system to rural and low-connectivity areas
where Telugu is predominantly spoken.
By bridging linguistic gaps and enabling more inclusive digital communication, this project lays
the groundwork for building robust, real-time translation systems tailored for Indian languages,
and sets a precedent for future work in underrepresented language technologies.
CHAPTER 5
CONCLUSION AND FUTURE SCOPE
5.1 Conclusion
In conclusion, our project marks a meaningful advancement in the domain of low-resource neural
machine translation by deploying a robust and scalable pipeline based on the No Language Left
Behind (NLLB) pretrained Transformer model. By integrating this state-of-the-art multilingual
architecture with a custom parallel English-Telugu dataset, we demonstrate the practical
feasibility and effectiveness of leveraging pretrained deep learning models for domain-adapted
language translation, particularly for underrepresented Indian languages like Telugu.

The system’s core strengths lie in its semantic alignment, syntactic fluency, and contextual
adaptability, which are significantly enhanced through fine-tuning on task-specific bilingual
corpora. Our use of SentencePiece tokenization enables better handling of morphological richness
and compound structures in Telugu, addressing a persistent challenge in Indian language
processing. The performance evaluation, measured using BLEU and METEOR scores, reveals
that our model achieves substantial improvements over conventional Transformer baselines and
publicly available translation services, especially in producing translations that retain the original
meaning while adhering to Telugu linguistic norms.
Beyond the algorithmic core, the project emphasizes scalability and user accessibility. The
integration of a Flask-based API, backed by a Node.js + Express server, provides a seamless
real-time interaction layer, while MongoDB ensures efficient, structured storage of translation
history and user data. This makes our system not only academically sound but also practically
deployable in real-world applications such as educational tools, digital communication platforms,
and localization services for rural communities.

Importantly, our work highlights the broader impact of inclusive AI technologies, particularly in
making digital content accessible to non-English speakers and preserving linguistic diversity. It
paves the way for future extensions involving multimodal translation, code-mixed language
processing, and speech-to-text integration. Furthermore, the model architecture and pipeline can
be adapted for other regional languages, thus contributing to the larger goal of democratizing
machine translation across the multilingual landscape of India.

Despite these successes, certain limitations remain, such as handling rare idioms, low-quality text
inputs, or contextual ambiguity in long sentences. Future research will explore context-aware
decoding strategies, reinforcement learning-based translation feedback loops, and model
compression techniques for edge deployment. Ethical considerations—particularly related to
translation bias, cultural sensitivity, and fairness—will also be addressed to ensure responsible
development and deployment.
In summary, this project demonstrates the viability and importance of leveraging pretrained
multilingual Transformer models like NLLB for high-quality English-to-Telugu translation. By
establishing a strong technical and architectural foundation, our work sets the stage for future
innovations in the field of neural translation, ultimately contributing to a more linguistically
inclusive and digitally connected world.

5.2 FUTURE SCOPE


This project lays a strong foundation for building better English-to-Telugu translation systems.
However, there is still room to improve and expand the work in several important ways.
In the future, we plan to use more powerful models, such as larger transformer-based models or
advanced techniques like diffusion models, which can improve translation quality and accuracy.
These models can better understand complex sentences and difficult grammar structures in both
English and Telugu.
We also aim to make the system faster and more efficient so it can run easily on mobile phones
and low-power devices. This is especially important for rural areas where high-end devices or
strong internet connections may not be available.

Another important area is domain-specific translation. For example, training the model with text
from fields like healthcare, agriculture, or government services will make the translations more
accurate and useful for real-world applications.
We also want to support other Indian languages and even handle code-mixed text (such as
sentences that include both English and Telugu). Adding speech and image inputs in the future
will help users translate spoken sentences or even scanned documents.

To make the system easier to use, we plan to add features like real-time correction suggestions,
auto-complete, and feedback buttons, so users can help improve the system over time. This can
be helpful for students, writers, and translators.
We also want to make the system safe and fair by checking for bias or harmful content in
translations. Adding explainable AI features will help users understand how translations are
generated. In short, the future of this project will focus on making the system more powerful,
faster, user-friendly, and accessible to all.
CHAPTER 6
REFERENCES

[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is All You Need. In Advances in Neural Information
Processing Systems, pages 5998 6008, 2017.

[2] Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal,
Mandeep Baines, Onur Çelebi, Guillaume Wenzek, Vishrav Chaudhary, and others. Beyond
English-Centric Multilingual Machine Translation. In Journal of Machine Learning Research,
arXiv:2207.04672, 2022.

[3] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural Machine Translation of Rare
Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for
Computational Linguistics, pages 1715–1725, 2016.

[4] Taku Kudo and John Richardson. SentencePiece: A Simple and Language Independent
Subword Tokenizer and Detokenizer for Neural Text Processing. In Proceedings of the 2018
Conference on Empirical Methods in Natural Language Processing: System Demonstrations,
pages 66–71, 2018.

[5] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David
Grangier, and Michael Auli. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In
Proceedings of NAACL-HLT 2019: Demonstrations, pages 48–53, 2019.

[6] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: A Method for
Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics, pages 311–318, 2002.
[7] Satanjeev Banerjee and Alon Lavie. METEOR: An Automatic Metric for MT Evaluation with
Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic
and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72,
2005.

[8] Jörg Tiedemann. The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource
and Multilingual MT. In Proceedings of the Fifth Conference on Machine Translation, pages 1174–
1184, 2020.
[9] Jason Phang, Thibault Fevry, and Samuel R. Bowman. Sentence Encoders on STILTs:
Supplementary Training on Intermediate Labeled-data Tasks. In arXiv preprint arXiv:1811.01088,
2018.
[10] Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python.
O’Reilly Media, Inc., 2009.

You might also like