Btech English
Btech English
Submitted by
Mrs.G.Keerthi
May 2025
KESHAV MEMORIAL COLLEGE OF ENGINEERING
A Unit of Keshav Memorial Technical Education (KMTES)
Approved by AICTE, New Delhi & Affiliated to JNTUH University, Hyderabad
CERTIFICATE
Project Mentor
Mrs.G.Keerthi
Assistant Professor, CSE (AIML)
KMCE
ABSTARCT
1. Introduction 7
2. Literature Survey 9
2.1 Machine Translation
2.2 Transformer Based Models For Translation
2.3 No Language Left Behind(NLLB)
2.4 Related Work
6 References 32
LIST OF FIGURES
2 Login Page 24
3 SignUp Page 24
4 Translation Page 24
5 History Page 25
LIST OF TABLES
Despite the advancements, there are three primary challenges in English-to-Telugu machine
translation. First, pretrained multilingual models often treat low-resource languages as second-
class citizens due to imbalanced training data, leading to suboptimal translations. Second, English-
to-Telugu translation suffers from semantic misalignment due to differences in word order, gender-
specific grammar, and rich inflections in Telugu that are not directly represented in English. Third,
while pretrained models demonstrate high BLEU scores, they often lack cultural and contextual
fluency, especially in idiomatic or domain-specific expressions.
To overcome these challenges, our project proposes a robust translation pipeline utilizing a fine-
tuned pretrained Transformer model specifically adapted for English-to-Telugu translation. We
incorporate SentencePiece tokenization to handle subword units, especially for out-of-vocabulary
tokens and morphologically complex Telugu words. Our architecture leverages pretrained
checkpoints from NLLB-200 and IndicTrans2, selected for their strong performance on low-
resource Indian languages. We further enhance semantic alignment through domain-specific fine-
tuning, post-editing routines, and custom evaluation metrics like METEOR, which are better suited
to morphologically rich languages than traditional BLEU scores.
This system is designed for practical applications in e-governance, education, content localization,
and digital accessibility, aiming to democratize access to information for Telugu-speaking
populations. By integrating pretrained Transformer models with domain-specific adaptation and
semantic evaluation, our project provides a high-quality, efficient, and scalable solution for
English-to-Telugu machine translation, contributing to the broader vision of inclusive AI and
digital equality.
CHAPTER 2
LITERATURE SURVEY
2.1 MACHINE TRANSLATION
Machine Translation (MT) refers to the process of automatically converting text from one language
to another using computational models. With the advancement of deep learning, particularly
Transformer-based architectures, the quality of translations has improved significantly.
Transformers rely on self-attention mechanisms and positional encodings, which allow them to
model long-range dependencies and contextual relationships more effectively than previous
sequence-to-sequence models like RNNs or LSTMs.
However, for low-resource language pairs like English to Telugu, several challenges persist. These
include the lack of parallel corpora, morphological complexity, syntactic divergence, and domain
specificity. To overcome these issues, researchers have turned to pretrained multilingual models
such as mBART, MarianMT, mT5, and NLLB, which can be fine-tuned on custom datasets to
enhance translation quality for specific language pairs.
Several pretrained Transformer-based models have been used for machine translation:
mBART (Multilingual BART) is a denoising autoencoder for pretraining sequence-to-sequence
models across multiple languages.
mT5 is a multilingual version of the T5 model trained on a massive multilingual corpus for text-
to-text tasks.
MarianMT is an efficient multilingual model based on the Transformer architecture and trained
by the OPUS project.
While these models provide decent translation quality, they often struggle with low-resource
languages such as Telugu, due to shared tokenizers and less specialized training for
morphologically rich scripts.
Mixture of Experts (MoE) Architecture: Only a subset of the model’s parameters are activated
per input, enabling specialized and efficient language-specific processing.
Zero-shot Translation Capability: The model performs strongly even on language pairs it was
not explicitly fine-tuned on.
The NLLB model has shown superior performance across numerous benchmarks, especially for
low-resource languages like Telugu, making it highly suitable for customized machine translation
tasks.
Model Architecture: Uses a Mixture of Experts (MoE) model with language-specific routing.
Tokenization: Custom SentencePiece tokenization tailored per language for better subword
handling.
Performance: Outperforms models like mBART and mT5 in BLEU, chrF, and METEOR scores
for many Indic languages, including Telugu.
ADVANTAGES
LIMITATIONS:
• Tokenizer is shared across languages, which may not handle the nuances of Telugu script
well.
• Training primarily focused on high-resource languages, leading to sub-optimal results for
Telugu without extensive fine-tuning.
Strengths:
LIMITATIONS:
• Translation accuracy is often inconsistent for morphologically rich languages like Telugu.
4. mT5 (Google)
Description: Multilingual T5 trained on mC4 corpus.
LIMITATIONS:
• Performs well in zero-shot settings but not fine-tuned for specific translation directions like
English ↔ Telugu.
SOFTWARE REQUIREMENTS SPECIFICATION
Purpose
The purpose of this project is to develop an English-to-Telugu Language Translation
System using a Transformer-based model. The goal is to translate text from English to
Telugu in a way that maintains the original meaning and context. This system focuses on
delivering accurate, meaningful translations for everyday use, supporting better
communication and understanding between English and Telugu speakers.
Scope
The project involves developing a Transformer-based deep learning model to translate text
from English to Telugu. The system aims to achieve high-quality, context-aware
translations, enabling a range of impactful applications:
• RAM: 8 GB or higher
• GPU: NVIDIA GTX 1050 or equivalent
• Storage: Minimum 10 GB free disk space
3.3 Functional Requirements
3.3.1 Data Collection and Storage
• Dataset: Custom English-Telugu parallel corpus collected from open sources
• Storage Structure: Sentences stored in JSON/CSV format with corresponding English
and Telugu pairs
3.3.2 Data Preprocessing
• Clean and normalize text (remove extra spaces, punctuation, etc.)
• Tokenize sentences using the NLLB tokenizer
• Convert tokenized text into model-compatible input tensors
3.3.3 Model Integration
• Model Used: Pre-trained NLLB model from Hugging Face
• Inference Pipeline:
o Encode input English sentence using tokenizer
o Translate using NLLB model
o Decode output to generate Telugu sentence
3.3.4 Translation Interface
• Web interface for users to input English sentences
• Display corresponding Telugu translations in real-time
• Option to copy, clear, or save translations
3.3.5 Deployment
• Local Development: Tested using Flask and Express.js on localhost
• Frontend: Built using HTML, CSS, and JavaScript
• Future Deployment: Cloud hosting options
The proposed work aims to design and implement a robust, scalable, and semantically accurate
machine translation framework using Transformer-based architectures, with a primary focus on
the NLLB (No Language Left Behind) pretrained model by Meta AI. The goal is to enable high-
quality translation from English to Telugu, particularly tuned for regional linguistic nuances,
contextual accuracy, and efficiency in deployment for web-based applications.
This project leverages pretrained multilingual models while integrating a custom dataset to fine-
tune translation performance for English–Telugu pairs. The solution is embedded within a 3-
layer web architecture (frontend, backend with Express and Flask, and external services) for
practical and user-friendly access.
Key Components of the Proposed Architecture:
1. Input Text and Tokenization:
English Text Input:
• Tokenizes input using the tokenizer pretrained for the NLLB model (nllb-200-distilled-
600M).
• Converts natural language into subword tokens suitable for Transformer processing.
• Tokenizer is multilingual-aware and sensitive to language codes (eng_Latn, tel_Telu).
Encoder (Transformer):
• Converts the tokenized English input into a dense sequence of hidden states.
• Captures long-range dependencies and semantic structures using self-attention.
• Leverages cross-lingual generalization from NLLB’s multilingual pretraining.
Decoder (Transformer):
Training Strategy:
What It Does:
• The distilled 600M variant is a compact version of the full model, making it easier to deploy
and fine-tune even on mid-range laptop GPUs.
• Translates between 200 languages, including Telugu (tel_Telu) and English (eng_Latn).
• Efficiency : The distilled version balances translation quality with speed and memory
efficiency, suitable for resource-constrained systems like laptops.
• Pretrained Knowledge : Eliminates the need to train from scratch; saves time, compute,
and energy.
What It Does:
3. PyTorch
What It Does:
• Dynamic and Flexible : Ideal for researchers and developers working on custom
training.
• Lightweight for Laptops : Performs well on mid-range GPUs like NVIDIA RTX
3050/3060.
• Community and Support : Rich documentation and community forums speed up
development and debugging.
4. SentencePiece Tokenizer
What It Does:
• Breaks down text into subword units like "trans", "la", "tion" rather than whole words.
• Works with languages like Telugu that have complex scripts and morphology.
• Handles Rare Words : Telugu has many unique compound words. Subword
tokenization prevents out-of-vocabulary errors.
• Cross-Language Vocabulary Sharing : Allows English and Telugu to share common
subwords, enhancing alignment.
• Official Tokenizer of NLLB : Ensures vocabulary compatibility with the pretrained
model.
5. Flask
What It Does:
• Creates a REST API that takes English text as input and returns the Telugu translation.
• Hosts the NLLB model inference pipeline behind a web interface.
• Minimal & Lightweight : Ideal for small applications running directly on laptops.
• Direct Python Execution : Runs PyTorch + Hugging Face code without wrappers.
• Fast Prototyping : Easy to build endpoints (/translate) for deployment.
6. Express.js
What It Does:
• Acts as the main backend server, managing user requests and sending data to Flask for
translation.
• Logs translation history and user feedback into the database.
7. MongoDB
What It Does:
• Schema-less : You can store different types of user data without needing rigid table
schemas like SQL.
• Easily Extendable : You can later add fields like confidence scores, user profiles, or
feedback.
• Document-Based : Works well with JSON responses from Express.js and Flask APIs.
8. HTML, CSS
What It Does:
Hugging Face
PyTorch Custom Training & Inference Lightweight, dynamic graph, suitable for laptops
SentencePiece Subword Tokenization Handles Telugu scripts and compound words efficiently
Express.js Web Backend Gateway Manages routes, integrates with MongoDB and frontend
MongoDB Translation Log Storage Flexible NoSQL, stores diverse translation and user data
o This module accepts raw English text input from the user.
• The model output (Telugu tokens) is decoded back into natural Telugu text.
• Initialized the tokenizer and model for the eng_Latn to tel_Telu language pair.
4. Evaluation:
• Model Load & Tokenizer : Preloads the NLLB model and tokenizer into memory at
runtime.
• Security:
o CORS implemented for safe API access.
1. Input Field:
o Allows the user to type or paste English text.
2. Output Area:
3. Navigation Panel:
4. Responsive Design:
The comparative analysis of the performance of NLLB, OpenNMT, and MarianMT on the English-
to-Telugu language translation task is summarized in the table below. These models were evaluated
using BLEU (Bilingual Evaluation Understudy Score) and TER (Translation Edit Rate) metrics,
which are standard benchmarks for assessing the quality of machine translation systems.
The NLLB (No Language Left Behind) model, developed by Meta AI, is designed to handle low-
resource languages with remarkable performance. On the English-to-Telugu pair, it achieves a
BLEU score of 32.5, indicating strong translation accuracy and fluency. Furthermore, its TER
score of 45.3 shows moderate effort needed for post-editing, signifying relatively good alignment
with human-translated references. NLLB benefits from its massive multilingual training data and
advanced transformer architecture, especially effective for underrepresented language pairs like
English-Telugu.
MarianMT, developed by Microsoft and optimized for efficiency, achieves a BLEU score of 28.7,
the lowest among the three. While MarianMT is well-suited for production-scale translation tasks
and shows efficient runtime, its performance in low-resource language pairs is slightly limited.
The TER score of 53.5 reflects a greater divergence from reference translations, which could be
attributed to limitations in training data coverage for Telugu.
In the rapidly evolving domain of neural machine translation (NMT), our project focuses on
building a high-quality English-to-Telugu translation system using a pretrained Transformer-based
architecture, specifically Meta’s No Language Left Behind (NLLB) model. The objective of this
work is to enable semantically accurate and syntactically fluent translations between English and
Telugu—two linguistically distant languages—by leveraging the generalization power of
multilingual pretrained models combined with fine-tuning on domain-specific datasets.
A core innovation of this project lies in the integration of NLLB’s pretrained translation
backbone with a custom dataset, which includes parallel English-Telugu sentences across
various topics and dialects. This setup ensures that the system adapts to language nuances,
idiomatic expressions, and regional variations specific to Telugu that are often overlooked in
broader multilingual models.
The fine-tuning phase significantly improves contextual understanding and grammatical
correctness compared to baseline translation services and simple encoder-decoder models. Our
comparative evaluations against standard transformer baselines demonstrate that the fine-tuned
NLLB model delivers higher BLEU and METEOR scores, indicating enhanced translation
accuracy and naturalness.
Additionally, the system is built with practical deployment in mind. We implemented a modular
backend using Flask and Express.js, which allows real-time sentence translation through a
RESTful API. For persistent data management, translated inputs and outputs are stored in a
MongoDB database, enabling user-specific history tracking and performance analysis. The
frontend interface, developed with modern web technologies, offers a clean and accessible user
experience that supports both single sentence and batch translation tasks.
Our project not only improves English-to-Telugu translation quality but also demonstrates the
potential of combining state-of-the-art multilingual models with domain-specific data for low-
resource language pairs.
The system’s core strengths lie in its semantic alignment, syntactic fluency, and contextual
adaptability, which are significantly enhanced through fine-tuning on task-specific bilingual
corpora. Our use of SentencePiece tokenization enables better handling of morphological richness
and compound structures in Telugu, addressing a persistent challenge in Indian language
processing. The performance evaluation, measured using BLEU and METEOR scores, reveals
that our model achieves substantial improvements over conventional Transformer baselines and
publicly available translation services, especially in producing translations that retain the original
meaning while adhering to Telugu linguistic norms.
Beyond the algorithmic core, the project emphasizes scalability and user accessibility. The
integration of a Flask-based API, backed by a Node.js + Express server, provides a seamless
real-time interaction layer, while MongoDB ensures efficient, structured storage of translation
history and user data. This makes our system not only academically sound but also practically
deployable in real-world applications such as educational tools, digital communication platforms,
and localization services for rural communities.
Importantly, our work highlights the broader impact of inclusive AI technologies, particularly in
making digital content accessible to non-English speakers and preserving linguistic diversity. It
paves the way for future extensions involving multimodal translation, code-mixed language
processing, and speech-to-text integration. Furthermore, the model architecture and pipeline can
be adapted for other regional languages, thus contributing to the larger goal of democratizing
machine translation across the multilingual landscape of India.
Despite these successes, certain limitations remain, such as handling rare idioms, low-quality text
inputs, or contextual ambiguity in long sentences. Future research will explore context-aware
decoding strategies, reinforcement learning-based translation feedback loops, and model
compression techniques for edge deployment. Ethical considerations—particularly related to
translation bias, cultural sensitivity, and fairness—will also be addressed to ensure responsible
development and deployment.
In summary, this project demonstrates the viability and importance of leveraging pretrained
multilingual Transformer models like NLLB for high-quality English-to-Telugu translation. By
establishing a strong technical and architectural foundation, our work sets the stage for future
innovations in the field of neural translation, ultimately contributing to a more linguistically
inclusive and digitally connected world.
Another important area is domain-specific translation. For example, training the model with text
from fields like healthcare, agriculture, or government services will make the translations more
accurate and useful for real-world applications.
We also want to support other Indian languages and even handle code-mixed text (such as
sentences that include both English and Telugu). Adding speech and image inputs in the future
will help users translate spoken sentences or even scanned documents.
To make the system easier to use, we plan to add features like real-time correction suggestions,
auto-complete, and feedback buttons, so users can help improve the system over time. This can
be helpful for students, writers, and translators.
We also want to make the system safe and fair by checking for bias or harmful content in
translations. Adding explainable AI features will help users understand how translations are
generated. In short, the future of this project will focus on making the system more powerful,
faster, user-friendly, and accessible to all.
CHAPTER 6
REFERENCES
[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is All You Need. In Advances in Neural Information
Processing Systems, pages 5998 6008, 2017.
[2] Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal,
Mandeep Baines, Onur Çelebi, Guillaume Wenzek, Vishrav Chaudhary, and others. Beyond
English-Centric Multilingual Machine Translation. In Journal of Machine Learning Research,
arXiv:2207.04672, 2022.
[3] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural Machine Translation of Rare
Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for
Computational Linguistics, pages 1715–1725, 2016.
[4] Taku Kudo and John Richardson. SentencePiece: A Simple and Language Independent
Subword Tokenizer and Detokenizer for Neural Text Processing. In Proceedings of the 2018
Conference on Empirical Methods in Natural Language Processing: System Demonstrations,
pages 66–71, 2018.
[5] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David
Grangier, and Michael Auli. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In
Proceedings of NAACL-HLT 2019: Demonstrations, pages 48–53, 2019.
[6] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: A Method for
Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics, pages 311–318, 2002.
[7] Satanjeev Banerjee and Alon Lavie. METEOR: An Automatic Metric for MT Evaluation with
Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic
and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72,
2005.
[8] Jörg Tiedemann. The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource
and Multilingual MT. In Proceedings of the Fifth Conference on Machine Translation, pages 1174–
1184, 2020.
[9] Jason Phang, Thibault Fevry, and Samuel R. Bowman. Sentence Encoders on STILTs:
Supplementary Training on Intermediate Labeled-data Tasks. In arXiv preprint arXiv:1811.01088,
2018.
[10] Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python.
O’Reilly Media, Inc., 2009.