UNIT 4
CHAPTER 9
                                       SYMBOL TABLES
Symbol table
       Symbol table is a table that has entries for all the symbols in the program.
       A compiler has to collect all the names appearing in the source program and enter them in
        symbol table.
       Each entry in the symbol table is of the form name, information
       Each time when a name is encountered, Symbol table is searched to see whether the name
        is already there.
       If it is a new name, then it is inserted in the symbol table.
       Information about the name is entered in lexical and syntactic analysis.
9.1 Contents of Symbol table
Symbol table is a table with 2 fields (name and information)
Name – name of the symbol (label, identifier, constant, procedure name, array name)
Information – Attributes , Parameters, offset
Features:
We must be able to do the following things
   1.   determine whether a given name is present in the table
   2.   add new name in the table
   3.   Access the information associated with the name, if the name is present
   4.   Add new information for the given name
   5.   Delete a name or group of names from the table
There may be separate tables for variable names, labels, procedure names, constant and field
names etc.
Data structures used for the symbol table
      Linear list
                - It is very slow
                - But, Simple and Easy to implement
      Hash table
                - It is fast
                - But more complex
      Tree structure
                - Intermediate in performance
Information field in the symbol table should include :
   1. String denoting the name. If the same name is used in more than one block or procedure,
      then an identification of which block or procedure the name belongs to.
   2. Attributes of the name. (type , scope ) and information
   3. Parameters ( number of dimension, upper limit, lower limit)
   4. Offset describing the position in storage.
Name and symbol table records
Linear array of records can be used to implement a symbol table.
Direct method:
    If there is a upper limit on the length of the identifier, then direct scheme can be used.
    FORTRAN uses direct scheme
                               Record
                                    .                     .
                                    .                     .
                                    .                     .
                               Identifier             DIMPLE
                              Information         Integer, variable
                               Identifier             AMAX
                              Information          Integer, const
                                   .                      .
                                   .                      .
                                   .                      .
Indirect method
    If there is no limit on the length of the identifier, then indirect scheme can be used.
    ALGOL uses indirect scheme
                              .
                              .
                   Length of the identifier(6)
               .
                   Information (attributes)
                   Length of the identifier(9)
               .
                   Information (attributes)
                              .
                              .
 . . . R F D I M P L E Q U . . .
 Separate string array
Reusing symbol table space :
      A compiler can be designed to run in lesser space, if the space occupied by the identifiers
       can be reused.
      If there are no further references to the identifier, then the space can be reused in subsequent
       passes.
      Eg. The space occupied by the string table(indirect scheme) can be reused in subsequent
       passes.
      In the direct scheme also, if the identifier and information are stored as 2 separate arrays,
       then the space occupied by the identifiers can be used in subsequent passes.
Array names :
    If the identifier is array name , then all the subscript information should be entered in the
      symbol table.
    FORTRAN limits to 3 dimensions
    In all the case, the lower limit is 1
The information regarding array subscript will look like:
                 UL1     UL2
                 UL3     B
A(3)(2)(10)
UL1     3
UL2     2
UL3     10
9.2 Data structures used for symbol tables :
       3 data structures can be used to represent symbol table
                 Linear list
                 Search trees
                 Hash tables
Linear lists:
                           Name 1
                           Information 1
                           Name 2
                           Information 2
                           .
                           .
                           .
                           Name n
                           Information n
  Available
      Simplest scheme
      Single array or several arrays can be used to store names and associated information.
      New names are added to the list, in the order in which they occur.
      To get information for a name, searching is done from the beginning of the array upto the
       position marked by the pointer called AVAILABLE.
      AVAILABLE pointer points to the beginning of the empty portion of the array.
      When the name is found, the information is read from the next location.
      If AVAILABLE is reached without finding names, then fault occurs.(undefined name)
      To insert a new name, scan down the list from the beginning(to ensure it is not already
       there)
      If it is available, fault occurs (multiply defined name)
      If it is not available, then the new name is stored immediately following the AVAILABLE
       and then pointer is incremented.
       Time:
      If a symbol table contain n names, then to insert new name, the time needed is proportional
       to n
      To find a data about name, on an average n/2 names have to be searched.
       Cost of inquiry is proportional to n
      To insert n names and m inquiries, the total cost is cn(n+m), where c is a constant
    Merits:
          1. Very easy to implement
          2. Minimum space is taken
    Demerits:
         1. Very slow in retrieving data
         2. Useful only for small jobs
         3. Performance is poor when n and m becomes large.
Self organizing lists:
      LINK field is added to each record.
      List is searched in the order pointed by the LINK field.
      Every time when a name is referred, it is moved to the front of the list.
      When a set of names are often referred, they will come to the front.
The figure a) shows the e.g of 4 name symbol table
      FIRST gives the position of first record in the list
      LINK field indicates the next record in the list
The order is NAME3 , NAME 1 , NAME 4 , NAME 2
The figure b) shows the effect of moving NAME4 to the front of the list.
      While searching for NAMEi, remember previous name on the list NAMEp
      Temporarily remove entry i from the list
      Link of Previous name(Linkp) should point to the link, where link of i (Linki) points.
      Make Linki point where FIRST points. Then we move NAMEi to the front of the list by
       making first point to NAMEi
firstN3N1N4N2
Eg: search for N4.
   Previous name: N1
   Temporarily remove N4
   Link of name1 should point to link2, where link 4 points (ie name2)N1N2
   Link of name4 should point to the place where first points.(ie name3) ie N4N3
   Move name4 to the front or FIRSTN4
   first N4N3N1N2
Merits:
         Frequently accessed records are moved to the front. Hence time is saved
Demerits:
           If the references are Random, it takes more time to locate the record.
Binary search trees
      Add 2 link fields to each record. Left, Right
      These fields are used to link the records into a binary search tree
       Search tree property:
      All the names Namej, accessible from Namei following the left link Lefti will precede
       Namei in alphabetical order. Namej < Namei
      All the names Namek, accessible from Namei following the Right link Righti will
       follow Namei in alphabetical order. Namei < Namek
      During search,
             if Name < Namei, follow left link
             if Namei < Name, follow right link
             if Name = Namei, item found
  Algorithm for finding a NAME in binary search tree
    While p≠ null do
    If NAME = NAME(P) then
            …../* Name found, take action on access */
    else If NAME < NAME(P) then
            P= LEFT(P)    /* visit the left child */
    else
          /* NAME(P) < NAME */
           then P= RIGHT(P)     /* visit the right child */
   /* if we fall through the loop, we have failed to find NAME*/
 Time:
    If names are encountered in random order, average length of the path is proportional to log n
    To insert n names and m inquiries, time is proportional to (n+m) log n
Merit:
   Binary search gives better performance than linear search
   Search is narrowed quickly
Demerit
   Implementation is difficult than linear list
Hash Tables:
    Two tables (hash table and storage table) are used.
    Hash table contain k words numbered from 0,1,2,…. K-1.
    These words are pointers to the storage table to the heads of k separate linked lists
    Each record in the symbol table appears exactly in any one of the lists.
To enter a                                                     NAME in symbol table :
    Create a record for it, in the available space in storage field
    Link that record to the beginning of that particular list
To determine whether the name is in the symbol table
    Apply a hash function h to NAME.
    h(NAME ) is an integer between 0 to k-1
    follow the link corresponding to that link.
To inquire about a name :
     Calculate h(name)
     Follow that list alone
We have to design a hash function h such that,
      h will distribute names uniformly among the k lists
      h is easy to compute for names consisting of string of characters
Cost :
To insert n names and m inquiries, time is proportional to n (n+m) / k
Advantage:
    This method is superior to linear search and binary search.
    Provides best Performance
Disadvantage:
   more programming effort is needed
   Also it needs more space
                                                UNIT 5
                                            CHAPTER 10
                          RUNTIME STORAGE ADMINISTRATION
10.1 Implementation of stack allocation scheme
      Local variables can be accessed only by the procedure in which it is declared.
      Global variables can be accessed and are available to all procedures.
   -   A Runtime stack starts from the high numbered memory location
   -   Low numbered memory locations contain code for the procedures and space for global data.
   -   High numbered memory locations(ie bottom of the stack ) have a runtime stack
   -   A procedure P (main program ), calls procedure Q which in turn calls procedure R
Figure shows the organization of memory
                      Static area for
                       programs and global data
           TOP
                      Extra storage for R
                      Activation record for R
               SP
                      Extra storage for Q
                      Activation record for Q
                      Extra storage for P
                      Activation record for P
              Direction of growth
                      Memory organization for C program
A stack has 2 pointers - 2 pointers are 2 registers
        Stack pointer SP points to the particular position in the activation record of the currently
         active procedure
        Top pointer points to the top of the stack
Organization of a C Activation record:
5 items are available in the activation record
        Value of actual parameters
        Count of number of arguments
        Return addresses
        Return value
        Value of SP for the activation record below
In addition to the local data, the activation record contain all 5 items
               Local data
SP             Old SP
               Return value
               Return address
               Arg count
               Actual parameters
Fig                                  Activation record for a C prodcedure
     -   SP points to the position below the local data
     -   Hence local data can be accessed by a negative offset from SP and the parameters can be
         accessed by +ve offset from SP
     -   In a c language, all the local variables including arrays are of fixed size.The size of the
         activation record can be accessed by the compiler
The local name can be accessed by X[SP] where x stands for offset x
Procedure calls in C
       Param T1
       Param T2
       .
       .
       Call P,n
       Each Param, T instruction is converted to PUSH (P)
Push(p)
Top= top-1
 *top=x
These instructions push T onto the runtime stack.
Translation of Call P,n
 Store the argument count n, return address, old stack pointer and then jump to the procedure
called.
    PUSH(n)         store the argument count
    PUSH(l1)        l1 – label of return address
    PUSH()          leave space for return value
    PUSH(SP)        store old stack pointer
    Goto l2         l2 is the first statement in the called procedure
    The 1st statement of the called procedure must be a special 3 address statement
     procbegin
    This statement sets the stack pointer to the place holding oldsp and sets top to the top of
     activation into 3 address stmt record
    SP=TOP
    Top= SP +sp (sp is the size of P, no. of words taken by the local data for proc P)
    Return (expn) in c is translated into return T
    The end of procedure is marked by a 3 address stmt. procend
10.2 Implementation of block structured languages
          In block structured languages, procedures and blocks may define their own data. Hence the
           activation records or portion of the activation records must be reserved for blocks.
          Block structured language permits arrays of variable length
      Data referencing environment of procedure / block includes , all procedures and blocks
       surrounding it in the program
Displays :
    Display is a method that will have direct access to nonlocal data
    It consists of an array of pointers to the currently accessible activation record
Format of activation record:
    The activation record contain space for local data, return address, return value, argument
     count, actual parameter and old value of Sp.
    It also contain pointers for the display
    If adjustable arrays are permitted, their size cannot be calculated, until the procedure is
     called.
    Hence, only pointers to the array and their data descriptors are placed in the activation
     record
    The array and the descriptors are placed above the activation record between top and SP.
     The descriptor can appear in the activation record itself.
EG:
       Q calls R(X,Y) and R has the following declaration
              integer i;;
              real array A[ 0:n-1, 1:m];
              real array B[2:10];
        The fig shows the activation record for R
Activation record
         Data descriptor and
              elements of B
            Data descriptor and
              elements of A
         Display pointer for MAIN
               Pointer to A
               Pointer to B
                                        Local data
                 Value of i
                  Old sp
SP                                  
              Return address
               Arg count=2
            Actual parameter y
            Actual parameter x
Procedure calls
Suppose that we are currently executing a procedure Q and in that procedure we call R(x,y)
The level of Q [ number of procedures and blocks in static environment] is 3 - (main, P, Q)
So when Q is in execution, the display has 3 pointers to the top of activation records of main, P and
Q
Procedure R is at a level of 2 (main and itself)
Hence P and Q pointers must be replaced by a pointer to the record R
Procedure Main();
          Procedure P(a);
              Procedure Q(b);
                     R(x,y);
              endQ;
              Q(Z);
         endP;
        Procedure R(c,d);
        endR;
       ….
       …..
End Main;
Main();
Parameter passing
The operand of Param 3 address statement will be treated as a pointer to the value of the actual
parameter or as the value itself
Eg:
 Suppose we call the procedure R (A+B*C, D) where A, B,C and D are integers
The translation to 3 address code is:
      T1:= B*C
      T2:=A+T1
      Param T2
      Param D
      Call R, 2
The translation into object code of the statement:
            Param X is PUSH(addr(X)) if call by reference is used,
                       PUSH(X) if call by value is used.
 The translation of an array declaration such as
       Real array A[0:n-1,1:m]
   Enter information about A into symbol table, and generate code to calculate A’s size and data
descriptor at runtime.
Returns
When a return statement is encountered in Procedure Q (which in turn return to Procedure R), the
procedure Q must do :
    Calculate any return value and put the value in activation record
    Obtain the return address from Qs activation record
    Set TOP to the value it had before P called Q
    Set SP to the value of the old SP found in Q’s activation record
    Jump to the return address to complete the return
Blocks
Blocks are “Parameter less procedures”
Blocks may be given display pointers similar to procedures.
Procedure P(X,Y);
       Integer I;
       Real array A[40];
       Begin
               Integer I, J;
               Integer array B[50];
               .
               .
               Begin
                       Integer K;               B1
                       .
                       .                      B2
                       Go to L;
               End;
       End;
       .
       .
L:
       Begin
               Real array C[60];      B3
               .
               .
       End;
       .
       .
End P
The fig shows the activation record when B2 and B3 are active
            B2 is active                      B3 is active
                                              UNIT 5
                                          CHAPTER 11
                           ERROR DETECTION AND RECOVERY
11.1 Errors
Properties of good error diagnostic system
     Error messages should pinpoint the errors
     Messages should be understandable by the user
           Missing right parenthesis instead of OH 17
     Messages should be specific and they should localize the problem
           ZAP not declared in procedure B1 rather than missing declaration
     Messages should not be redundant
           If a variables A is not declared, it should be said once and not every time A appears in
           the program
Sources of errors:
       Errors can occur at every stage of the process
    1. The design specification of the program may be wrong (design time errors)
    2. The algorithm may be incorrect (algorithmic errors)
    3. Programmer may introduce errors in implementing the algorithm (coding errors)
    4. When the program is keyed on to cards, errors may occur (keypunching or transcription
       errors)
    5. Program may exceed machine limit or compiler limit (more number of dimensions or array
       size larger) - program error
    6. Compiler can insert errors while compiling (compiler errors)
Different types of transcription errors
    1. Insertion of extra character or tokens (Insertion error)
   2. Deletion of character or token (deletion error)
   3. Replacement of a character by incorrect character (replacement error)
   4. Transposition of 2 adjacent characters (transposition errors)
From the view of compiler, errors can be classified as 2 categories
    Syntactic errors
    Semantic errors
Syntactic errors: Errors in the Syntax
Errors detected by the lexical phase or syntactic phase of the compiler
Semantic errors: Errors in the semantics
Errors detected by the compiler both at compile time and run time
Syntactic errors:
    Missing right parenthesis (deletion errors) MIN(A,2*(3+B)
    Extra comma (insertion errors)              DO 10, I=1,100
    Colon in place of ; (replacement errors)    I=1: J=2;
    Misspelled keyword (transposition errors) PORCEDURE SUM
    Extra blank (insertion errors) /* COMMENT * /
In some cases it is difficult to say, where the errors have occurred.
We could not find the place of errors.
       A=B-C*D+E)
Errors may not be noted for a long time
      IfA = B then ….. (the detection of the error occurs after a long distance)
 No place between IF and A
Minimum distance correction of syntactic errors
Way of defining errors and their location is called minimum hamming distance method
We define a collection of error transformations
We say that a program P has k errors, if the shortest sequence of error transformations that will map
any valid program into P has length k.
Eg: IFA = B THEN
          SUM=SUM + A
    ELSE
         SUM=SUM – A;
In this example, insert one blank to get correct code. Hence the distance is 1
This method is very expensive. Hence rarely used
Insertion of one character or deletion of one character makes the code error free.
Minimum distance mean the least number of insertions, deletions, and symbol modifications
necessary to transform one string to another
Semantic errors:
Semantic Errors can be detected both at compile time and runtime
Compile time errors:
Errors of declaration ad scope can be detected at compile time
     Undeclared identifiers
     Multiply defined identifiers
     Type incompatibilities between operator and operands
     Type incompatibilities between formal and actual parameters
Type checking differs from one language to other:
     Some languages have only one data type. So no type misagreements [bliss]
     Some languages have many types. But the type of every name and expression should be
        known at compile time. Strongly typed language [pascal]
     Some language do automatic type conversion between operators and operands of different
        types.[PL/1]
Dynamic errors:
    In some languages, Some errors can be detected only at runtime. Some languages (APL,
     SNOBOL) have several data types and type of a name can change at runtime. Hence type
     checking cannot be done at compile time. They should be postponed till runtime
   Range checking can be done only at runtime. Range for checking array subscripts and case
     stmt selection can be done only at runtime. [ subscript out of range, Arbitrary value in case
     stmt]
Errors seen by each phase
Fig shows the plan of error detector/corrector portion of the compiler.
It contains routines to recover from lexical error and syntactic error
Also it contain routine to detect semantic errors and routines to print the diagnostics.
The diagnostic routine communicate with the symbol table, to avoid printing redundant messages.
Errors can be classified based on the phase in which the compiler detects them
Lexical Phase errors
Syntactic phase errors
Semantic phase errors
11.2 Lexical phase errors
[ function of lexical analyzer : break the input not stream of tokens]
If after some processing, the lexical analyzer discover that no prefix of the remaining input, fits
into any token classs, it calls error recovery routine. If the input does not match into any token
class, it calls error recovery routine
The problem of recovering the lexical phase errors may sometimes cause problems
Eg . IF ( long condition .OR. X .LT. Y ) goto 30
 Assume that after X, the line split’s to next line with continuation mark.
Suppose if the continuation mark is wrongly entered in 8th column, , after reading X it finds the end
of statement
Statement does not match with any syntax. Here syntax error occurs, Parser must do recovery
Reduce the statement with IF to a nonterminal statement
Then the lexical analyser must process the continuation
1.LT. Y ) goto 30
1. real constant ( variable)
LT.  variable
Insert = sign in between (id=id)
.Y (delete . and try to find next token.)
Y (idn)        --- it tries to add + id +id
Then )     cause syntax error.
At some point , the syntax error corrector will construct a completely erroneous statement and
proceed.
Deleting many characters from the source program can cause severe difficulties for the syntax
analyzer and next phases of the compiler
Minimum distance matching:
Minimum distance matching is used to correct the spelling of the token
Suppose we know that the next token is keyword
But the input contains X, which is not the keyword, then we can determine which keyword is
closest to X
If we suspect that an identifier X is misspelled because no declaration for x is in symbol table
We can search for an identifier in the symbol table that is closest to X
For finding the correct keyword or correct identifier from symbol table, minimum error distance
matching algorithm is used.
Easier to implement
Runs faster
Very expensive
11.3 Syntactic phase errors
 Error detection and recovery is centered around syntax analysis.
 In syntactic phase, the parser detects an error, when it has no legal move from its current position
 To recover from error, the parser should locate the position of the error, correct the error, and
  revise it’s configuration and then continue with parsing
Time of detection:
LR parsers have the valid prefix property
ie. they will announce the error as soon as they find a prefix of the input with invalid continuation
LR parsers announce the errors at the earliest
Advantage of a parser with valid prefix property
      Reports the error at the earliest
      Limits the amount of erroneous output passed to subsequent passes.
Panic mode :
In the panic mode of the recovery, the parser discard the input symbols until a synchronizing token
is encountered. (delimeter such as ; or end)
The parser then deletes stack entries until it finds an entry such that it can continue parsing
Advantage:
Simple to implement
It never get into infinite loop
Error recovery in operator precedence parsing;
There are 2 cases in which the operator precedence Parsing discovers syntactic errors
     If no precedence relation holds between the terminal on the top of the stack and current
        input symbol
     If a handle is found, but there is no production with this handle as the right side
These 2 errors will be noted and recovered
Handling errors during reduction
    1. Normally the checker checks and displays the following errors
      If +or * is reduced, it checks whether non terminals appear on both sides. If not, error
      message will be displayed.
      MISSING EXPRESSION
   2. If id is reduced, it checks whether there is no nonterminal to the right or left.
      If so, 2 expressions not connected by operator message will be displayed
   3. If ( ) is reduced, it checks whether there is a nonterminal between the parenthesis
      If not, null expression between parenthesis will be displayed
   4. Nonterminal should not appear on the sides of ( and ) Parenthesis
      If so, 2 expressions not connected by operator message will be displayed
Handling shift reduce errors
In the precedence matrix, the blank entries will be filled with the names of error handling routines.
e1: ($ - e1 is called when the expression ends with left parenthesis
action : pop ( from stack
msg: Illegal left parenthesis
e2: id or ) is followed by id or (
[ id id, id(, )id, )( ]
action : insert + on the input
msg: missing operator
e3: $) [when the expression begins with right )     ]
[ id id, id(,, )id, )( ]
action : delete ) from the input
msg: illegal ) parenthesis
e4: $$ (when the expression is null)
[ id id, id(,, )id, )( ]
action : insert id on to the input
msg: missing expression
Error recovery in LR parsing
LR parser will detect an error when it consults the parsing action table and finds an error entry
Errors never result from GOTO table
Examine each error entry in the parsing table and decide on the possibility of getting those errors
An appropriate recovery procedure will be constructed
Automatic error recovery in YACC
In YACC, the user identifies the major nonterminals for programs , blocks or statements
The user then adds error productions of the form A  error α to the grammar
When the parser encounters an error, it finds the top most state on the stack. Which is of the form
Aerror α
Then it shifts the error, as if a token is found
The parser then discards the input symbols until it finds a symbol with which the parsing can
proceed
The semantic routine will not manipulate it and it calls the diagnostic message
Adhoc error recovery for LR parsers
In the LR parsing table, fill each blank entry in the ACTION field with a pointer to the error
routine.
e1 is called , when an operator + or * or $ sign was found instead of id or ( in the beginning of an
expression
missing operand
e2 is called , when there is extraneous ) parenthesis in the input
unbalanced right paranthesis
e3 is called, when instead of operator, id or ) is found
missing operator
e4 is called when the end of i/p is found instead of operator or right )
missing right paranthesis
Error recovery in LL parsing
The error correction strategies for LL parser is similar to LR parser
There is a panic mode of correction for LL parser
On finding the error, we skip on the input to the next synchronizing token such as ;
Pop the stack until a symbol X appears on the top of stack
If X is not the synchronizing token, then X have to be expanded.
Symbols in the stack should be popped in such a way that synchronizing token appears on the stack
top
11.4 Semantic errors
Primary sources of semantic errors:
     Undefined names
     Type incompatibility
We have to recover from such errors and we should suppress the duplicate error messages
When the first time, we encounter an undeclared name, we make an entry in the symbol table for
that name with necessary attributes
In addition to this, a flag is set to indicate that the entry was made due to semantic error and is not a
declaration
When the name appear next time, a check is made to determine whether a previous instance of
error msg is noted in the symbol table. If so, no new error message is printed
If not, error message is printed, and new erroneous usage is now added to the symbol table
Hence duplicate error messages are avoided.