0% found this document useful (0 votes)
24 views14 pages

Compiler

The document discusses the lexical analysis phase of a compiler, which is essential for processing source code by breaking it down into tokens for further analysis. It outlines the purpose, advantages, and disadvantages of lexical analysis, as well as the concepts of tokens, lexemes, and attributes. The document also explains how lexical analysis works using regular expressions and finite automata, emphasizing its role in simplifying parsing and improving compilation efficiency.

Uploaded by

bd7636514
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views14 pages

Compiler

The document discusses the lexical analysis phase of a compiler, which is essential for processing source code by breaking it down into tokens for further analysis. It outlines the purpose, advantages, and disadvantages of lexical analysis, as well as the concepts of tokens, lexemes, and attributes. The document also explains how lexical analysis works using regular expressions and finite automata, emphasizing its role in simplifying parsing and improving compilation efficiency.

Uploaded by

bd7636514
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

THE LEXICAL ANALYSIS

PHASE OF A COMPILER
(Approved by: AICTE & Affiliated to Maulana Abul Kalam Azad University of Technology)
Campus: Bishnupur, Dist : Bankura, (W.B)

SUBJECT:- Compiler Design


SUBJECT CODE:- PCC-CS501
STUDENT NAME:- AKASH MANNA
DEPERTMENT:- COMPUTER SCIENCE ENGINEERING
UNIVERSITY ROLL NO:- 15800124097
UNIVERSITY REGISTRATION NO :- 241580120172
YEAR:- 3RD SEMESTER:- 5th
ACADEMIC YEAR :- 2025-2026
EXAM:- CA1
INDEX
 1. WHAT IS LEXICAL ANALYSIS ?
 PURPOSE
 EXAMPLE
 IMPORTANT
 ADVANTAGE
 DISADVANTAGE

 2.TOKENS, LEXEMES, ATTRIBUTES, LOCATION INFORMATION


 3.HOW IT WORK
 4.CONCLUSION
What is Lexical Analysis ?
Lexical analysis is the first step of text processing used in many artificial
intelligence algorithms. Learn why this process is a key step in natural language
processing, allowing machines to understand human text more effectively.
Lexical analysis is one of the first steps in natural language processing, allowing
computers to break down input text into individual units for further analysis. This
article will explore key terms related to lexical analysis, the steps of lexical
analysis, advantages and limitations, and what types of careers utilize this
process.
PURPOSE
 Reads Code Character by Character : - The first step is to read the source
code (or other text) one character at a time from beginning to end.
 Groups Characters into "Lexemes” :- It identifies sequences of characters
that belong together based on the language's rules (e.g., i f forms the word "if", 1 2 3
forms the number "123"). These sequences are called lexemes.
 Classifies Lexemes into "Tokens” :- Each lexeme is then categorized into a
specific type of meaningful unit called a token. A token has a type (e.g., "KEYWORD",
"IDENTIFIER", "OPERATOR") and often a value (e.g., the keyword if, the identifier
myVariable).
 Discards Irrelevant Information : - Whitespace (spaces, tabs, newlines) and
comments are removed because they are not essential for the program's execution.
 Detects Basic Errors : - It can spot invalid characters or character sequences
that don't form a valid token in the language.
EXAMPLE
 INPUT :-
int main()
{
// 2 variables
int a, b;
a = 10;
return 0;
}

 OUTPUT :- ‘int' 'main' '(' ')' '{' 'int' 'a' ',' 'b' ';'
'a' '=' '10' ';' 'return' '0' ';' '}'
IMPORTANT OF Lexical Analysis
➢ First Line of Defense: It acts as the initial filter for source code, catching invalid
characters or malformed tokens early, thus preventing errors from affecting later
stages.
➢ Tokenization Simplifies Parsing: By converting a stream of characters into
easily identifiable tokens, lexical analysis simplifies the complex process of syntax
analysis (parsing) that follows. Syntax analyzers can then work with tokens
instead of raw characters, making their job more straightforward and efficient.
➢ Removes Unnecessary Elements: Whitespaces and comments, which do not
contribute to execution, are removed here. This makes the input to subsequent
compiler stages more concise, streamlining the compilation process.
➢ Speeds Up Compilation: Preprocessing and cleaning up the code at this stage
helps speed up later compiler phases and reduces the risk of ambiguous
interpretations.
➢ Enables Accurate Symbol Table Construction: The lexical analyzer often
records identifier names, keywords, and literals. This information is essential for
populating and managing the symbol table, a key data structure used throughout
compilation.
Advantages
 Simplifies Parsing :- Breaking down the source code into tokens makes it
easier for computers to understand and work with the code. This helps
programs like compilers or interpreters to figure out what the code is supposed
to do. It's like breaking down a big puzzle into smaller pieces, which makes it
easier to put together and solve.
 Error Detection :- Lexical analysis will detect lexical errors such as
misspelled keywords or undefined symbols early in the compilation process.
This helps in improving the overall efficiency of the compiler or interpreter by
identifying errors sooner rather than later.
 Efficiency :- Once the source code is converted into tokens, subsequent
phases of compilation or interpretation can operate more efficiently. Parsing
and semantic analysis become faster and more streamlined when working with
tokenized input.
DISADVANTAGES
 Limited Context : - Lexical analysis operates based on individual tokens
and does not consider the overall context of the code. This can sometimes lead
to ambiguity or misinterpretation of the code's intended meaning especially in
languages with complex syntax or semantics.
 Overhead : - Although lexical analysis is necessary for the compilation or
interpretation process, it adds an extra layer of overhead. Tokenizing the source
code requires additional computational resources which can impact the overall
performance of the compiler or interpreter.
 Debugging Challenges : - Lexical errors detected during the analysis
phase may not always provide clear indications of their origins in the original
source code. Debugging such errors can be challenging especially if they result
from subtle mistakes in the lexical analysis process.
Tokens, Lexemes, Attributes,
Location Information
The output of lexical analysis (also known as scanning or tokenization) is a structured
sequence of tokens representing the source program’s text. Each token captures the
essential details required for parsing and further compilation steps, abstracting away the
raw character stream.
Tokens :-
Each recognized sequence is categorized into a specific token type (e.g., IDENTIFIER,
NUMBER, KEYWORD, OPERATOR, etc.).
Lexemes :-
For each token, the lexeme—the actual substring from the source code matching a
pattern—is recorded.
Attributes :-
Some tokens carry extra information (attributes), such as the value for a numeric literal or
an identifier's name.
Location Information (optional) :-
Line number, character position, or file for error tracing.
How it Works
❑ 1. Specifying Patterns with Regular Expressions :-
Regular expressions are formal notations for describing patterns within text.
For each type of token (like identifiers, numbers, keywords) in a programming language, a
regular expression describes what form those sequences take (e.g., [a-zA-Z][a-zA-Z0-9]* for
identifiers, [0-9]+ for integers).
❑ 2. Translating Regular Expressions into Finite Automata :-
The system systematically converts each regular expression into a finite automaton:
Typically, the regex is first turned into a nondeterministic finite automaton (NFA). This
process can be done using algorithms like Thompson’s construction.
The resulting NFA can be further transformed into an equivalent deterministic finite
automaton (DFA), which is easier and faster for computers to process.
❑ 3. Lexical Analysis Using Automata :-
The automaton (usually a DFA for efficiency) reads the source code one character at a
time.
As it reads, it transitions between states according to its rules, which encode the
structure of the pattern from the original regex.
If the automaton ends in a matching (accepting) state after reading a sequence, that
sequence is recognized as a valid token.

❑ 4. Efficiency and Simplicity :-


The use of DFA makes the recognition process extremely fast and deterministic—each
character causes exactly one state transition.
This process enables the lexical analyzer to efficiently break down raw code into a
stream of tokens for further syntactic analysis.
conclusion
Lexical analysis is a fundamental first phase in the compiler design process, transforming
raw source code into a structured sequence of tokens. These tokens are units of meaning
defined by:
Lexemes: actual substrings in the source code,
Patterns: formal rules (often regular expressions) that describe valid lexemes,
Tokens: abstract categories assigned to lexemes based on the patterns.
Regular expressions provide a powerful and concise way to specify token patterns, while
finite automata (NFAs and DFAs) serve as efficient computational models to recognize
these patterns in input text. This synergy allows lexical analyzers to quickly and accurately
scan source code and output token streams for parsing.
Tools like Lex and Flex automate the generation of lexical analyzers from such pattern
definitions, saving developers from writing complex scanning code manually. Error
handling mechanisms during lexical analysis detect invalid sequences early, report
meaningful diagnostics, and help maintain resilient compilation by recovering from errors.
THANK YOU

You might also like