19IT422T – INFORMATION EXTRACTION AND RETRIEVAL TECHNIQUES
UNIT I - INTRODUCTION INFORMATION EXTRACTION
Introduction – Origins – Text, Audio ,Image, Video Extraction – Visual object Feature Localization -
Entropy based Image Analysis – 3D shape Extraction Techniques - Semantic Multimedia Extraction using
Audio & Video – Multimedia Web Documents.
Introduction to Information Extraction
Information extraction (IE) is the automated retrieval of specific information related to a selected topic
from a body or bodies of text.
Information extraction tools make it possible to pull information from text documents, databases,
websites or multiple sources. IE may extract info from unstructured, semi-structured or structured,
machine-readable text. Usually, however, IE is used in natural language processing (NLP) to extract
structured from unstructured text.
Information extraction depends on named entity recognition (NER), a sub-tool used to find targeted
information to extract. NER recognizes entities first as one of several categories such as location (LOC),
persons (PER) or organizations (ORG). Once the information category is recognized, an information
extraction utility extracts the named entity’s related information and constructs a machine-readable
document from it, which algorithms can further process to extract meaning. IE finds meaning by way of
other subtasks including co-reference resolution, relationship extraction, language and vocabulary
analysis and sometimes audio extraction.
IE dates back to the early days of Natural Language Processing of the 1970’s. JASPER is a system for IE
that for Reuters by Carnegie Melon University is an early example. Current efforts in multimedia
document processing in IE include automatic annotation and content recognition and extraction from
images and video could be seen as IE as well.
Because of the complexity of language, high-quality IE is a challenging task for artificial intelligence (AI)
systems.
Origin
History. Information extraction dates back to the late 1970s in the early days of NLP. An early
commercial system from the mid-1980s was JASPER built for Reuters by the Carnegie Group Inc with the
aim of providing real-time financial news to financial traders.
Beginning in 1987, IE was spurred by a series of Message Understanding Conferences. MUC is a
competition-based conference that focused on the following domains:
MUC-1 (1987), MUC-3 (1989): Naval operations messages.
MUC-3 (1991), MUC-4 (1992): Terrorism in Latin American countries.
MUC-5 (1993): Joint ventures and microelectronics domain.
MUC-6 (1995): News articles on management changes.
MUC-7 (1998): Satellite launch reports.
Considerable support came from the U.S. Defense Advanced Research Projects Agency (DARPA), who
wished to automate mundane tasks performed by government analysts, such as scanning newspapers
for possible links to terrorism.
Origins of Information Extraction
The origins of information extraction can be traced back to the field of natural language processing (NLP)
and information retrieval. Here is a brief overview of the origins of information extraction:
Information Retrieval:
The field of information retrieval emerged in the mid-20th century with the goal of developing
techniques to search, retrieve, and organize large volumes of textual information. Early work focused on
keyword-based indexing and retrieval systems, where users could search for documents based on
specific terms or queries.
Named Entity Recognition:
In the 1990s, research on named entity recognition (NER) began to gain traction. NER aimed to
automatically identify and classify named entities, such as names of people, organizations, locations, or
other specific types of entities, within text documents. NER paved the way for more advanced
information extraction techniques.
Information Extraction:
The concept of information extraction as a field within NLP emerged in the late 1990s and early 2000s.
Information extraction aimed to go beyond simple keyword-based retrieval and focused on
automatically extracting structured information from unstructured or semi-structured text sources.
Techniques were developed to identify and extract specific types of information, such as relationships
between entities, events, or attributes from textual data.
Rule-Based and Template-Based Approaches:
Early information extraction systems often relied on rule-based or template-based approaches. These
approaches involved manually defining extraction rules or templates to identify and extract specific
information based on patterns or regular expressions. Although effective for specific domains or
applications, these approaches were limited in their scalability and adaptability to different data
sources.
Machine Learning Approaches:
The field of information extraction saw significant advancements with the adoption of machine learning
techniques, especially with the rise of statistical and probabilistic models. Machine learning approaches
allowed for the development of more flexible and data-driven extraction methods. Supervised learning
algorithms, such as Support Vector Machines (SVM) and Conditional Random Fields (CRF), were applied
to train models for information extraction tasks.
Relation Extraction and Event Extraction:
Within information extraction, specific subtasks gained attention, such as relation extraction and event
extraction. Relation extraction aimed to identify and classify relationships between entities, such as
"born in," "works for," or "married to." Event extraction focused on identifying and extracting specific
events or actions described in text documents.
Deep Learning and Neural Networks:
In recent years, deep learning and neural network-based approaches have revolutionized information
extraction. Techniques such as Recurrent Neural Networks (RNNs), Convolutional Neural Networks
(CNNs), and Transformer models, like BERT and GPT, have shown remarkable performance in various
information extraction tasks by capturing complex patterns and contextual dependencies within text
data.
Today, information extraction continues to advance, incorporating a combination of rule-based systems,
machine learning techniques, and deep learning models. The field has expanded beyond text to
encompass multimedia data, such as images, audio, and video, leveraging advanced techniques from
computer vision and signal processing. The ongoing developments in natural language understanding
and multimodal analysis promise further advancements in the extraction of meaningful information
from diverse sources.
Text, Audio ,Image, Video Extraction
Text, audio, image, and video extraction refer to the process of extracting relevant information from
these different media formats. Here's an overview of how extraction can be performed for each of these
formats:
Text Extraction:
Text extraction involves analyzing textual data to extract meaningful information. Techniques for text
extraction include natural language processing (NLP) tasks such as named entity recognition (NER),
entity linking, key phrase extraction, sentiment analysis, topic modeling, and text summarization. These
techniques enable the extraction of entities, relationships, sentiments, and other valuable information
from text documents.
Audio Extraction:
Audio extraction involves analyzing audio data to extract information. Techniques for audio extraction
include speech recognition, which converts spoken words into text, allowing for transcription and
analysis of audio content. Audio event detection and classification can identify and extract specific
sounds or events within audio recordings. Emotion recognition can be used to detect and analyze
emotional states conveyed in audio. These techniques enable the extraction of spoken content, sound
events, and emotional information from audio sources.
Image Extraction:
Image extraction involves analyzing images to extract information. Techniques for image extraction
include object detection, which can identify and locate specific objects or regions of interest within
images. Image classification can classify images into different categories or labels. Text recognition using
optical character recognition (OCR) can extract text from images. Image segmentation can be used to
separate different regions or objects within an image. These techniques enable the extraction of objects,
text, and other visual information from images.
Video Extraction:
Video extraction involves analyzing video data to extract information. Techniques for video extraction
include object tracking, which can track and analyze the movement of objects across video frames.
Action recognition can identify and classify specific actions or events occurring in a video. Speech-to-text
transcription can extract spoken content from video recordings. Facial recognition can detect and
identify individuals in the video. These techniques enable the extraction of object movements, actions,
spoken content, and facial information from videos.
Overall, text, audio, image, and video extraction techniques leverage various methodologies and
algorithms to extract meaningful information from these different media formats. The extracted
information can be utilized for applications such as data analysis, content indexing, search,
recommendation systems, and more.
Visual object Feature Localization
Visual object feature localization refers to the process of identifying and localizing specific objects or
features within an image or a video. It involves using computer vision techniques to detect and precisely
locate objects or regions of interest in visual data.
Here's an overview of how visual object feature localization can be achieved:
Object Detection:
Object detection algorithms are used to locate and identify multiple objects within an image or a video
frame. These algorithms analyze the visual data and output bounding boxes that enclose the detected
objects, along with their corresponding class labels. Popular object detection algorithms include Faster
R-CNN, YOLO (You Only Look Once), and SSD (Single Shot MultiBox Detector).
Object Recognition and Classification:
Once objects are detected, additional analysis can be performed to recognize and classify them into
specific categories. This involves assigning a label or a class to each detected object. Convolutional
Neural Networks (CNNs) are commonly used for object recognition and classification tasks.
Semantic Segmentation:
Semantic segmentation goes beyond object detection and aims to segment an image into different
regions corresponding to specific object classes or features. This technique assigns a label to each pixel
in the image, allowing for precise localization of objects and their boundaries. Semantic segmentation is
commonly achieved using CNN-based architectures such as U-Net, SegNet, or Mask R-CNN.
Key Point Localization:
Key point localization involves identifying and localizing specific points of interest or landmarks within an
image. These points could represent facial landmarks (e.g., eyes, nose, mouth), keypoints on objects
(e.g., corners, edges), or any other distinguishable features. Keypoint detection algorithms, such as SIFT
(Scale-Invariant Feature Transform) or Harris Corner Detection, are often employed for this purpose.
Image Captioning:
Image captioning techniques aim to generate textual descriptions that accurately describe the content
of an image. This process involves both object detection and natural language processing. By localizing
objects or regions within the image, relevant information can be extracted and used to generate
descriptive captions.
Visual object feature localization techniques find applications in various domains, including image and
video analysis, autonomous driving, robotics, surveillance, augmented reality, and many others. They
enable machines to understand and interact with visual data by identifying and precisely localizing
objects or features of interest.
Entropy based Image Analysis
Entropy-based image analysis can be used in information extraction from images to identify and extract
relevant information. Here are a few ways entropy-based analysis can be applied:
Text Extraction:
Entropy analysis can help identify regions in an image that contain text. Text regions often exhibit higher
entropy due to the presence of varying pixel values representing characters. By analyzing the entropy
distribution across the image, text regions can be localized and extracted for further processing, such as
optical character recognition (OCR) to convert the text into machine-readable format.
Object Detection and Segmentation:
High entropy regions in an image can indicate the presence of objects or regions with complex patterns
or structures. By segmenting the image based on entropy levels, objects or regions of interest can be
separated from the background. This approach can assist in tasks like object detection, where the high
entropy regions are likely to correspond to objects that need to be identified and extracted.
Image Classification:
Entropy analysis can be utilized as a feature in image classification tasks. The entropy of different image
patches or regions can provide information about their complexity or diversity. By incorporating entropy
as a feature, machine learning algorithms can learn to differentiate between different classes or
categories of images based on the entropy characteristics of the regions within them.
Image Forensics:
In the context of image forensics, entropy analysis can be used to detect tampering or manipulation.
Manipulated regions in an image often exhibit changes in entropy compared to the surrounding
unaltered regions. By analyzing the entropy distribution across the image, inconsistencies or anomalies
can be identified, aiding in the detection of forged or manipulated regions.
Salient Region Extraction:
Entropy-based analysis can help identify salient or visually important regions within an image. Higher
entropy regions often correspond to regions that contain visually distinct or unique information. By
analyzing the entropy map of an image, salient regions can be identified and extracted, which can be
useful in applications such as content-based image retrieval or attention-based image processing tasks.
By leveraging entropy-based image analysis techniques, it becomes possible to extract valuable
information from images. Whether it is text extraction, object detection, image classification, image
forensics, or salient region extraction, entropy analysis provides a useful measure to identify and extract
relevant information from images.
3D shape Extraction Techniques
Extracting 3D shape information from various sources is essential for applications like computer vision,
virtual reality, augmented reality, robotics, and more. Several techniques are employed to extract 3D
shape information from different types of data. Here are some common techniques for 3D shape
extraction:
Stereoscopic Vision:
Stereoscopic vision involves capturing images or videos of a scene using two or more cameras
positioned from slightly different viewpoints. By analyzing the disparities or differences between
corresponding pixels in the images, depth information can be calculated using techniques like
triangulation. This depth information can be used to reconstruct the 3D shape of the scene or objects.
Structured Light:
Structured light techniques project a known pattern of light, such as grids or stripes, onto the object or
scene. The deformation of the pattern on the object's surface is captured by one or more cameras, and
by analyzing the distortions, the 3D shape can be reconstructed. This approach is commonly used in
depth-sensing cameras like Microsoft Kinect.
Time-of-Flight (ToF):
Time-of-Flight cameras emit a modulated light signal, usually in the infrared spectrum, and measure the
time it takes for the signal to travel to the object and back. By measuring the time delay, the distance to
the object can be calculated, providing depth information. Multiple depth measurements can then be
used to reconstruct the 3D shape.
LiDAR (Light Detection and Ranging):
LiDAR uses laser light to measure distances to objects and generate 3D point clouds. It emits laser pulses
and measures the time it takes for the light to reflect back. By scanning the laser across the scene or
using a rotating scanner, a dense point cloud is generated, representing the 3D shape of the
environment.
Photometric Stereo:
Photometric stereo relies on capturing multiple images of an object under different lighting conditions.
By analyzing the variations in the pixel intensities across the images, the surface normals of the object
can be estimated. From these normals, the 3D shape of the object can be reconstructed.
Depth from Focus:
Depth from Focus techniques exploit the varying focus of an imaging system. By capturing multiple
images of the same scene with different focus settings, the depth information can be estimated by
analyzing the variations in sharpness or focus across the images.
Structure from Motion (SfM):
Structure from Motion techniques utilize a sequence of images captured from different viewpoints. By
tracking the movement of image features across the sequence, the 3D structure of the scene can be
reconstructed by estimating camera poses and triangulating feature correspondences.
These techniques offer different approaches to extract 3D shape information from various sources, such
as stereo images, depth sensors, LiDAR, and sequential image data. Depending on the available data and
the requirements of the application, the appropriate technique can be selected to obtain accurate and
detailed 3D shape representations.
Semantic Multimedia Extraction using Audio & Video
    Semantic multimedia extraction using audio and video involves extracting meaningful information,
    such as objects, events, actions, emotions, or concepts, from audio and video data. It aims to
    understand and interpret the content of multimedia sources at a higher semantic level. Here are
    some techniques commonly used for semantic multimedia extraction:
    Speech Recognition and Transcription:
Automatic Speech Recognition (ASR) techniques are used to convert spoken words in audio into
text. By transcribing the audio, the extracted textual information can be further processed and
analyzed for various applications, including indexing, search, and summarization.
Audio Event Detection:
Audio event detection focuses on recognizing and classifying specific sound events within the audio.
This involves training models to detect and identify sounds such as applause, laughter, sirens, or
musical instruments. This information can be used to understand the context or events occurring in
the audio.
Speaker Diarization:
Speaker diarization techniques aim to identify and distinguish different speakers in an audio or video
recording. By segmenting the audio into speaker-specific segments, it becomes possible to associate
speech with individual speakers, enabling speaker-related analysis or identification.
Visual Object Detection and Recognition:
Computer vision techniques can be applied to analyze the visual content of video frames. Object
detection algorithms can identify and locate specific objects or regions of interest within the video
frames. Object recognition goes a step further by classifying the detected objects into specific
categories.
Action and Event Recognition:
Action and event recognition techniques focus on identifying and categorizing specific actions or
events occurring in a video. By analyzing the motion patterns and temporal relationships between
objects in video frames, algorithms can recognize activities such as walking, running, or sports
events.
Emotion Recognition:
Emotion recognition aims to detect and understand the emotional states or expressions of
individuals in a video or audio. By analyzing facial expressions, body language, or voice
characteristics, algorithms can identify emotions such as happiness, sadness, anger, or surprise.
Concept and Scene Understanding:
Techniques such as image and video captioning, scene classification, or semantic segmentation can
be applied to extract higher-level concepts and understand the overall context of the multimedia
data. This involves associating textual descriptions or semantic labels with different elements or
scenes within the audio and video.
    These techniques can be combined and integrated to create comprehensive systems for semantic
    multimedia extraction from audio and video data. By analyzing and interpreting the content at a
    higher semantic level, it becomes possible to extract valuable information, enable advanced search
    and retrieval, support content recommendation systems, or enhance multimedia understanding for
    various applications.
Multimedia Web Documents
Multimedia web documents refer to web pages or documents that incorporate various forms of media,
including text, images, audio, video, and interactive elements. Extracting information from multimedia
web documents involves the process of analyzing and extracting meaningful data from these different
media formats. Here are some techniques used in information extraction from multimedia web
documents:
Text Extraction:
Text extraction techniques focus on extracting textual content from web documents. This can involve
parsing HTML or other document formats to identify and extract text elements, such as headings,
paragraphs, captions, or metadata. Natural language processing (NLP) techniques can be applied to
further analyze and extract specific information from the extracted text.
Image Analysis:
Image analysis techniques can be used to extract information from images embedded within web
documents. This can involve tasks such as object detection, image classification, or optical character
recognition (OCR) to recognize text within images. By analyzing the visual content, relevant information
can be extracted from images and associated with the web document.
Video Processing:
Video processing techniques are employed to extract information from videos embedded in web
documents. This can involve video summarization to extract key frames or representative segments,
object tracking and recognition within the video, or speech-to-text transcription to extract spoken
content. These techniques enable the extraction of valuable information from video elements in web
documents.
Audio Analysis:
Audio analysis techniques are utilized to extract information from audio elements within web
documents. This can involve speech recognition to transcribe spoken content, audio event detection to
identify specific sounds or events, or emotion recognition to determine the emotional states conveyed
in the audio. By analyzing the audio, relevant information can be extracted and associated with the web
document.
Multimedia Fusion:
Extracting information from multimedia web documents often requires fusing information from
different media formats. By combining the extracted information from text, images, audio, and video, a
more comprehensive understanding of the web document can be achieved. This can involve techniques
such as cross-media analysis, where information from one media format is used to enhance the
extraction and understanding of information from other formats.
Metadata Extraction:
Extracting metadata from multimedia web documents is also important. This involves analyzing the
document structure, HTML tags, or metadata attributes associated with different media elements.
Extracted metadata can provide valuable information such as authorship, publication dates, geolocation,
or licensing information, which enhances the understanding and categorization of the web document.
These techniques can be combined and applied in an integrated information extraction pipeline to
extract relevant information from multimedia web documents. The extracted information can be used
for tasks such as search and retrieval, content analysis, recommendation systems, or knowledge base
creation, enabling a deeper understanding of multimedia-rich web content.