Text Representation and Compression
Module 2
              Dr Pushpavathi.K.P, BMSCE
                          Types of text
• Unformatted text – Plain text.
• Formatted text – Rich text.
• Hypertext.
                                Dr Pushpavathi.K.P, BMSCE
                     Unformatted text
• ASCII character set.
• Printable characters- alphabetic, numeric, punctuation.
• Control characters:
1. Format control character- BS, SP, DEL, ESC.
2. Information separators - FS, RS.
3. Transmission control characters – SOH, STX, ETX, ACK, NAK,
    SYN.
• Mosaic characters- supplementary version – to create relatively
    simple graphical images.
                            Dr Pushpavathi.K.P, BMSCE
   Character set to produce unformatted text
ASCII
character set
                      Dr Pushpavathi.K.P, BMSCE
  Character set to produce unformatted text (cntd.)
Mosaic characters
                      Dr Pushpavathi.K.P, BMSCE
                         Formatted text
• Publishing sector.
• Characters of different styles and variable size.
• Structure a document into chapters, paragraph.
• Graphics and pictures inserted at appropriate points.
• Enter a specific command.
• Reserved format control character.
• Numeric or alphabetic character.
                               Dr Pushpavathi.K.P, BMSCE
                   Formatted text (cntd.)
a) formatted text string b) printed version of string
                               Dr Pushpavathi.K.P, BMSCE
                         Hyper Text
• Pages
• Defined linkage points-hyperlinks
• Electronic version of the documents.
• Browser.
• Home page
• URL
• HTML- presentation of the document.
• Directives – page formatting commands sandwiched between pair of
  tags , eg: <p> , <B>text</B>
                            Dr Pushpavathi.K.P, BMSCE
Electronic document written using hypertext
                 Dr Pushpavathi.K.P, BMSCE
Dr Pushpavathi.K.P, BMSCE
Digital document
    Dr Pushpavathi.K.P, BMSCE
                Digital document (cntd.)
• Vertical resolution – 3.85 or 7.7 lines/mm or 100 or 200 lines/inch
• o/p of scanner – Resolution of apprx. 8 pels
• Bitonical images –black and white
• One binary digit –represent each pel
                      0 for white , 1 for black pel
• Typicla image – stream of 2 million bits - uncompressed
                             Dr Pushpavathi.K.P, BMSCE
              Facsimile conversion codes
• ITU-T group 3 and group 4- standards
a) Termination codes – run length – 0 to 63
b) Make-up codes – run length – multiples of 64
• Table of code words –produced - based on relative frequency of
  occurrence of no. of white and black pels in scanned lines.
• Code words are fixed
                             Dr Pushpavathi.K.P, BMSCE
Facsimile conversion codes (cntd.)
                  Dr Pushpavathi.K.P, BMSCE
                 Compression principles
• Source encoders and destination decoders.
• Lossless and lossy compression.
• Entropy encoding.
• Source encoding.
                             Dr Pushpavathi.K.P, BMSCE
Source Encoders and Destination Decoders
                Dr Pushpavathi.K.P, BMSCE
          Lossless and lossy compression
• Lossless compression algorithm – reversible.
• Example – text file.
• Lossy compression algorithm – version of it.
• Higher level of compression – more approximated the received
  version.
• Example – images, audio and video streams.
                             Dr Pushpavathi.K.P, BMSCE
                     Entropy encoding:
• Lossless and independent.
• Information representation.
• Two examples:
1. Run length encoding.
2. Statistical encoding.
                                Dr Pushpavathi.K.P, BMSCE
                 Entropy encoding (cntd.)
• Entropy of the source – minimum average no of bits required to
  transmit a particular source stream
• Qualitatively - defined as the useful information
• Ebits(C) = -log2(PC) , where Ebits(C) –entropy of character C, - in bits
                          PC – probability of character C in given data
• If the stream contains n characters or symbols ,
  Η = 𝜂 = − σ𝑛𝑖=1 𝑃𝑖𝑙𝑜𝑔2(𝑃𝑖 ) = σ𝑛𝑖=1 𝑃𝑖𝑙𝑜𝑔2(1/𝑃𝑖 )
• Higher the probability of occurrence of a symbol, lower the entropy
                                Dr Pushpavathi.K.P, BMSCE
                   Run length encoding:
• Repetitive Character encoding
• Used - Source information comprises of long substrings of same
  character or bit.
• Transmitted as different set of code words.
• Indicates particular transmitted bit and the number of characters in
  substring.
• Destination knows the set of code words being used.
                               Dr Pushpavathi.K.P, BMSCE
                   Statistical encoding:
• Variable length code word.
• Prefix property.
• Huffman encoding algorithm.
• Entropy.
                                           entropy of source
• Efficiency of encoding scheme =
                                  average number of bits per codeword
Where, average number of bits per codeword = σ𝑛𝑖=1 𝑁𝑖𝑃𝑖
                             Dr Pushpavathi.K.P, BMSCE
                        Source encoding:
• Differential encoding.
• Transform encoding.
                            Dr Pushpavathi.K.P, BMSCE
                   Differential encoding:
• Amplitude of signal covers larger range, but difference in amplitude
  between successive values is relatively small.
• Smaller set of codewords.
• Indicates difference in amplitude between current signal and
  immediately preceding value.
• Can be lossless or lossy – depending on number of bits to encode
  different value.
                              Dr Pushpavathi.K.P, BMSCE
                    Transform encoding
• Transforming the source information.
• No loss of information.
• Digitization of image produces two dimensional matrix.
• Rate of change in magnitude – Spatial frequency.
• Horizontal and vertical frequency components.
• Human eye is less sensitive to higher spatial frequency.
• DCT.
                              Dr Pushpavathi.K.P, BMSCE
Transform encoding: pixel pattern
                  Dr Pushpavathi.K.P, BMSCE
Transform encoding : DCT principle
                 Dr Pushpavathi.K.P, BMSCE
                    Text Compression
• Compression algorithm must be lossless.
• Entropy encoding – Statistical encoding.
• Two types of Statistical encoding:
   1. Optimum set of codewords – Huffman coding, Arithmetic coding
   2. Variable length characters – Lempel-Ziv (LZ) algoritm.
• Two types of coding :
1. Static coding.
2. Dynamic or adaptive coding.
                            Dr Pushpavathi.K.P, BMSCE
                  Static Huffman Coding
• Transmitted character string is analyzed.
• Determine character types and relative frequency.
• Coding operation involves creating an unbalanced tree.
• Huffman code tree – Binary tree.
• Branches are assigned the value 0 or 1.
• Root node (top), branch node , leaf node (termination point).
                              Dr Pushpavathi.K.P, BMSCE
                   Huffman code tree
Example - character string : AAAABBCD (Final tree with codes)
                            Dr Pushpavathi.K.P, BMSCE
Dr Pushpavathi.K.P, BMSCE
                 Decoding Algorithm
• Assume code words is available at receiver.
• Received bit stream – variable BIT-STREAM.
• Bits in each code word – variable CODEWORD.
• ASCII code word – variable RECEIVE-BUFFER.
                          Dr Pushpavathi.K.P, BMSCE
Dr Pushpavathi.K.P, BMSCE
Dr Pushpavathi.K.P, BMSCE
Decoding (cntd.)
• code words related to data.
• Done in two ways.
• Adaptive Compression.
• Disadvantage- new set of code words for new data- overheads
                             Dr Pushpavathi.K.P, BMSCE
       Dynamic (adaptive) Huffman Coding
• Both transmitter and receiver build the Huffman tree –dynamically.
• If the character to be transmitted is present in tree – its codeword is
  determined and sent in normal way.
• If not present (first occurrence) – character is transmitted in
  uncompressed form.
• Tree is updated by – 1. incrementing the frequency of occurrence
              or           2. introducing the new character to the tree.
                                Dr Pushpavathi.K.P, BMSCE
                     Adaptive Huffman tree
                                         Dr Pushpavathi.K.P, BMSCE
Ref: multimedia communication , components, Techniques, standards – Krishna kumar D N
Incrementing the count : if next symbol is ‘A’
                                             Dr Pushpavathi.K.P, BMSCE
    Ref: multimedia communication , components, Techniques, standards – Krishna kumar D N
Sibling property - swapping
One more increment in ‘A’
                                             Dr Pushpavathi.K.P, BMSCE
    Ref: multimedia communication , components, Techniques, standards – Krishna kumar D N
Swapping of internal nodes
Receiving two more ‘A’ s
                                             Dr Pushpavathi.K.P, BMSCE
    Ref: multimedia communication , components, Techniques, standards – Krishna kumar D N
                     Arithmetic Coding
• Complicated than Huffman coding.
• Assigns one code for each encoded string of characters – unlike
  Huffman that use separate codeword for each character.
• Coding start with a certain interval – read input symbol by symbol –
  use the probability of each symbol to narrow the interval
• High probability symbols contribute fewer bits to output
• Output is interpretedas a number in the range [0,1).
                              Dr Pushpavathi.K.P, BMSCE
               Arithmetic Coding (cntd.)
e=0.3, n=0.3, t=0.2, w=0.1, .=0.1                         ‘WENT’
                              Dr Pushpavathi.K.P, BMSCE
             Arithmetic Coding (decoding)
• Decoder knows:
   • Set of characters present in the received encoded messages
   • Segment to which each character has been assigned
   • Its related range.
• Follow the same procedure as in encoder.
• No. of decimal digits increase linearly with no. of characters in string.
• Complex messages –first fragmented into multiple smaller strings –
  encoded separately
• Resulting set of code words – sent as blocks of floating point numbers
  (binary) in known formats.
                                  Dr Pushpavathi.K.P, BMSCE
Lempel-Ziv Coding
• Uses strings of characters for the coding instead of single characters.
• Table holds character string of the text (all possible words).
• Encoder sends only the index of the word.
• Decoder use this index to access the word.
• Table is used as dictionary.
• Dictionary based compression algorithm.
• Word processing packages have dictionary-15 bits (25000 words).
• Shorter words – lower compression ratio, longer words – higher.
                               Dr Pushpavathi.K.P, BMSCE
               Lempel-Ziv-Welsh Coding
• Contents of dictionary is built dynamically by encoder and decoder.
• Initially dictionary consists of only the character set-ASCII.
• To determine compression level - no. of entries in dictionary – no.of
  bits needed for index.
• Ex : this is simple as it is
                              Dr Pushpavathi.K.P, BMSCE
Lempel-Ziv-Welsh (LZW) compression algorithm
Basic operation
                  Dr Pushpavathi.K.P, BMSCE
Dr Pushpavathi.K.P, BMSCE
Reference
• Multimedia communications- Applications , networks, Protocols and
  Standards – Fred Halsall
                             Dr Pushpavathi.K.P, BMSCE