0% found this document useful (0 votes)
21 views22 pages

Big Data Indexing

The document discusses the importance of indexing techniques in big data mining and analytics, highlighting their role in improving data retrieval efficiency. It outlines various indexing types, including bitmap, dense, and sparse indexes, and their applicability in handling large datasets. The chapter aims to provide insights into effective indexing strategies that can enhance the performance of big data analytical platforms.

Uploaded by

pratima depa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views22 pages

Big Data Indexing

The document discusses the importance of indexing techniques in big data mining and analytics, highlighting their role in improving data retrieval efficiency. It outlines various indexing types, including bitmap, dense, and sparse indexes, and their applicability in handling large datasets. The chapter aims to provide insights into effective indexing strategies that can enhance the performance of big data analytical platforms.

Uploaded by

pratima depa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/350574309

Indexing in Big Data Mining and Analytics

Chapter · April 2021


DOI: 10.1007/978-3-030-66288-2_5

CITATIONS READS
3 2,503

3 authors, including:

Usman Ali Rohiza Ahmad


Federal College of Education (Technical) Gombe Universiti Teknologi PETRONAS
14 PUBLICATIONS 118 CITATIONS 81 PUBLICATIONS 411 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Usman Ali on 03 October 2021.

The user has requested enhancement of the downloaded file.


Indexing in Big data mining and Analytics

Ali Usman Abdullahi1[0000−0001−9645−3642] , Rohiza Ahmad2 , and Nordin M


Zakaria2
1
Computer Science Education Department
Federal College of Education (Tech), Gombe, Nigeria
usmanali@fcetgombe.edu.ng
http://www.fcetgombe.edu.ng
2
Computer and Information Sciences Department
Universiti Teknologi PETRONAS, Seri Iskandar, Perak, Malaysia
rohiza ahmad@utp.edu.my, nordinzakaria@utp.edu.my

Abstract. Big data analytics is one of the best ways of extracting val-
ues and benefits from the hugely accumulated data. The rate at which
the global data is accumulating and the rapid and continuous intercon-
necting of people and devices is overwhelming. This has further poses
additional challenge to finding even faster techniques of analyzing and
mining the big data despite the emergence of specific big data tools. In-
dexing and indexing data structures have played an important role in
providing faster and improved ways of achieving data processing, mining
and retrieval in relational database management systems. In doing so, in-
dex have aided in data mining by taking lees time to process and retrieve
data. The indexing techniques and data structures have the potential of
bring the same benefits to big data analytics if properly integrated into
the big data analytical platforms. A lot of researches have been con-
ducted in that direction, and this paper attempts to bring forward how
the indexing techniques have been used to benefit the big data mining
and analytic. Hence, this has can bring the impact that indexing have
on RDBMS to the folds of big data mining and analytics.

Keywords: Big data · Big data analytics · Data mining · Indexing tech-
niques · Index.

1 Intorduction
The exponential growing nature of global data as was collated and presented by
the Statista in their report titled -Volume of data/information created worldwide
from 2010 to 2025. The report indicated that the world overall volume of both
created and copied data as of that 2020 year would be 50.5ZB . Also, this volume
is expected to rise three folds within five years estimated to be 175 by 2025 [40].
This fact gives the sense of the hugeness of data size that is termed as Big data
[10], [9]. This global data is accumulated from various forms, comprising of
structured, semi-structured and unstructured masses of data that need analysis
so dearly. Since, the data are collated and stored into datasets, then, those
enormous datasets are also referred to as Big Data/Datasets [36], [5], [46].
According to Chen et al[10] and Chen et al[9], the increase in volume indi-
cates how big the generation, the collection and the scaling of data masses have
become. While increase in velocity indicates the need for timely and rapid collec-
tion and analysis of Big data in order to maximally utilize its commercial value.
The increase in Variety indicates that big data comprises of various forms which
may include unstructured and semi-structured besides the usual structured data
[4].
The IDC on its part presented a different opinion on Big Data. In its 2011
report, IDC viewed big data from its technological architectural angle as the data
designed for extraction of economic value. The economic value comes from the
very large volume and widely varied data, which have high-velocity of discovery,
capture, and analysis [46].
The IDC definition added another ’V’ to make it 4Vs model with the fourth
’V’ been Value. Therefore, a broader definition of the big data could be derived
from all earlier ones as the term used in describing enormous data sets, which
contents are characterized by the Four Vs: i. Volume - very large amount of
data; ii. Variety - different forms of the data, i.e. structured, semi-structured,
and unstructured gathered from different sources including images, documents
and complex record; iii. Velocity - the data has constantly changing contents,
which come from complementary data collection, archived and streamed data
[10] [6] [47]; iv. Value - very huge value and very low density [10].
In addition, recent literatures include Veracity as the fifth ’V’ characteristic
of the big data after been convinced that none of the earlier described char-
acteristics of big data have covered that. By Veracity, it means that there are
uncertainty and effect of accuracy on the quality of collected data [5] [4] [44]
[46].
The value in big data is extracted by proper and efficient retrieval of the
data from the big datasets. Thus, fast processing of the retrieved data is deter-
minant to faster and timely analysis and access to the required data. Indexing
is one of the most useful techniques for faster data retrieval during processing
and accessing. Therefore, the industrial 4.0 data-driven processes are in need
of such faster data retrieval. Once the retrieval and access processes involving
big data usage are made faster a lot of benefits are achieved. These benefits in-
clude energy/power saving and improving hardware durability and reduce heat
generation.

1.1 Objective of the Chapter

The main objective of this chapter was to present an indexing techniques that
used in searching and effective retrieval of data during data mining. The strategy
of all the indexing techniques is to restrict the amount of the input data to
be processed during the mining of any dataset. The chapter highlights those
indexing techniques that have the potentials of working better with big data.
1.2 Taxonomy of the Chapter

The focus of the chapter is to include in it all relevant literatures that are
searched and download from the major publication databases. The databases in-
clude IEEEEXplore, DBLP, Scopus, Springer and others.Then, titles, abstracts,
introductions and conclusions of paper that covered application of indexing in
big data mining and analytics was conducted. This was done using a selection
criteria in order to pick the papers that matched and have highlighted the struc-
ture of various indexing approaches used for big data mining and analytics. In
addition, the chapter identifies the indexing approaches that have potentials bet-
ter for big data mining and those that have less potentials. Fig. 1 depicted the
diagrammatic sketch of the Taxonomy used for the chapter.

Fig. 1: A Sketch of the Taxonomy used for the Chapter

The chapter’s perspective is to center the discussion of literatures on the


indexing approaches, their data structure, their mode of application, their pros
and cons and prospect for big data mining. In addition, the chapter attempts to
cover indexing from its application on RDBMS up to it use in big data mining
and analytics. This may enable specialist and practitioners to big data mining
and analytics to have wider view on the indexing approaches and when, how and
where to apply each of the approaches for better results.
The remaining part of the chapter is organized as follows. The Section 2
presents the basic types of the indexing, Section 3 discusses the Online indexing
which is improvement of the basic indexing. Section 4 enumerates the indexes in-
built to MapReduce process which is a dedicated process of big data processing.
Then, Section 5 presents the user-defined indexes, and finally, Section 6 concludes
the paper.

2 Index and Indexing


Indexing, as information retrieval technique, is the process of generating all the
suitable data structures that allow for efficient retrieval of stored information
[37]. While the term index refers to the suitable data structure needed, to allow
for the efficient information retrieval [28]. Usually, the data structures used for
indexing in most cases do not store the information itself. Rather, it uses other
data structures and pointers to locate where the data is actually being stored.
A Meta index is also used to store additional information about stored data like
the external name of a document, its length, which can be ranked as the output
of the index.
In RDBMSs, the index is the data structure that speeds-up the operation of
data retrieval from database tables. The index comes with additional cost for
writing/reading it into/from the index table. This leads to more storage usage
in order for the extra copy of data created by the index to be maintained. The
indexes are used in locating data quickly by going straight to the data location
without having to search all the rows in the database table every time that
table is accessed. Indexes are the bases for providing rapid random lookups and
efficient access of ordered recorded.
While in Big data, if index is to be implemented the same way as it works
with the RDBMS, there is no doubt in will be so big to the extent the the
intended fast record retrieval may not be achieved. Therefore, there is need for
a closer look at the indexing technique in a way that it will b reasonably small
and maintain the targeted goal of faster record retrieval.

2.1 Index Architecture and Indexing Types

The index has two architectures: Non-Clustered and Clustered index as shown
in Figure 3. The Non-Clustered architecture presents data in an arbitrary order,
but maintains a logical ordering in which rows may be spread out in a file/table
without considering the indexed column expression. This architecture uses a tree
based indexing that has a sorted index keys and pointers to records at its leaf
node level. The Non-clustered index architecture is characterized by having the
order of the physical rows of the indexed data differing with the order in the
index [34]. The Non-clustered index sketch is displayed in Figure 3.

Fig. 2: A Sketch of Index Types


Fig. 3: A Sketch of Non-Clustered Index

On the other hand, Clustered index architecture changes the blocks of data
into a certain distinct order to match the index. This results in the ordering of
the row data. It can greatly increase the overall speed of retrieval in a sequential
accessed data or reverse order of the index or among a selected range of items.
The major characteristic of the Clustered index architecture is that the ordering
of the physical data rows is in accordance with that of the index blocks that
points to them [31]. The clustered index sketch is displayed in Figure 4.

Fig. 4: A Sketch of clustered Index

There are many types of indexing in existence, below is overview of some of


them. As while as some studies the involved each of them and their applicability
in big data situation.

2.2 Bitmap index

The Bitmap index uses bit array called bitmaps to store the bulk of its data.
Bitmap is a special type of index that uses bitwise logical operation on the
bitmaps to answer most of the queries run against it. The Bitmap index works
basically in situations where index values are repeated very frequently unlike
other index types commonly used. The other types are most efficient when in-
dexed values are not repeated at all or they are repeated smaller number of times.
The gender field of a database table is a good example for a bitmap index. This
is so, because no matter how many tuples there are in a database table, the field
will only have 2 possible values: male or female. A typical bitmap index data
structure is shown in Figure 5

Fig. 5: A Bitmap Index form from Gender Table

The bitmap index has been used in wide variety of areas and in big data an-
alytics. For instance, bitmap index application was used in the aspect of index
compression. The approach helps the compression to work better with high cardi-
nality attribute data. Wu el at [41], presents a study and analysis of some of the
compression techniques that use bitmap indexing namely: Byte-Aligned Bitmap
Compression (BBC) and Word Aligned Hybrid (WAH). These techniques were
able to reduce compressed data sizes and improved their performance. The au-
thors motivation was the fact that most of the empirical researches do not include
comparative analysis amongst these different techniques and the result of their
own work showed that compressed bitmap indexes appeared to be smaller in size
compared to that of B+-Tree with WAH occupying half of the space of B+-Tree
while the BBC occupies half the space of WAH.
Another work was done by Fusco et al [13], using bitmap index as a com-
pression approach to minimize CPU workload and consumption rate of disk. The
platform used for the work was a streaming network data, which require a real-
time indexing. The authors introduced what they called COMPAX, a variant of
compressed bitmap index that supersedes the Word-Aligned-Hybrid (WAH) in
terms of throughput of indexing, shorter retrieval time and higher compression
rate. The NETwork Flow Index (NET-FLI), which highly optimizes real-time
indexing as well as data retrieval from larger-scale repositories of network was
used. The NET-FLI synergies COMPEX and Locality Sensitive Hashing (LSH),
used for streaming reordering in an online set up to achieve the target of the
research. This combination results in higher insertion rates of up to 1 million
flows per second many folds over what is obtainable in typical commercial net-
work flow. The ISPB, also allows the performance of complex analysis jobs by
administrators.

In addition, an effort was made to automate indexing process as well as


resolving the Index Selection problem (ISP) using bitmap by [29]. They came
up with a technique that merges the features of Index Selection Techniques (IST)
and those of Linear programming for optimization to minimize cost. The result
was a new method that solves ISP externally and uses optimizer for choosing the
set of indexers to be used. The clustering data mining technique was used and
when benchmarked it out performed Microsoft SQL Server Index Selection Tool
(IST) in terms of speed of selection and suggestion of indexes. Even though, the
bitmap index was designed for RDBMS, it has also been used on some HLQLs
on big data to improve information retrieval. For example, a group of students
incorporated bitmap index into Hive to improve the retrieval of Facebook users
information using gender field [24].

The Bitmap technique as a candidate of indexing in big data have been tried
and in some specific situation and it gives the some improved results. However,
the Bitmap index does generate large volume of data as its data structure along
side the volume of the big data itself. Hence, cannot be used for general data
retrieval in big data, because the big data contains variety of data values.

2.3 Dense Index

The Dense indexing approach uses a file/table with pair of keys and pointers
to each record in the data file/table, which is sorted. Every key in the dense
index is associated to a specific pointer to one of such records. If the underlining
architecture of the indexes is a clustered one with duplicate keys, dense index
just points to the first record with the said key [15]. Figure 6 displayed a sample
of The Dense index. In the dense index, each index entry consists of a search-key
and a pointer. The search-key holds the value to be searched while the pointer
stored the identifier to the disk location containing the corresponding record and
the offset that identifies the point where the record starts within the block.
Fig. 6: A Dense Index on Employee Table

The dense index as an ordered index either stores index entry for all records
when the approach it is using is Non-Clustered, or it just stores the index entry
to the first search-key when using the clustered index approach [37]. The dense
index uses file in storing its index data structure and there is a dedicated pointer
to each record and the data has to be ordered. These two facts suggested that
the index is going to grow so big that it will be close to the size to that of the
stored data. Hence, there are no studies using dense index in big data retrieval.

2.4 Sparse Index

The sparse indexing method also uses a file/table that contains pair of search-
keys and pointers. The pointers are pointing to blocks instead of individual
records in the data file/table, which is sorted in the order of the search-keys.
In sparse index, index entries appeared for some of search-key values. An index
entry is associated to a specific pointer to one of such blocks. If the underlining
architecture of the indexes is a clustered one with duplicate key then the index
just points to the lowest search-key in each block [15]. To locate any search-key’s
record, an index entry is searched with the largest value that is less than or equal
to the given search-key value. Then, the searching starts at the record pointed
to by the index entry and move down until the target is found [34] and [37].
Figure 7 depicts the explanation given above for sparse index.
Fig. 7: A Sparse Index on Employee Table

The dense and sparse indexes are the most common type of ordered index that
most relational databases use for generating query execution plans. However, for
both the dense and sparse index types, the use of files/tables to keep the pairs of
search-keys and pointers may make them very unsuitable for big data indexing
due to two reasons: 1. The records and block of data file for big data will be
distributed over different clusters, which will make it very difficult to maintain
such files. 2. The volume of the big data will make the size of such index files to
be unnecessarily very large, which may lead to unreasonable costs of space and
maintenance time.

3 Online Indexes

3.1 Online Indexing


Since the most techniques of indexing are having one drawback or the other.
Thus, the big data indexing requires the study of other modified indexing tech-
niques. Some of these modified indexes include the work of Chaudhuri, surajit,
Narasayya and Vivek [8], which introduced the Online Indexing. The Online In-
dex is an extension to the use of external tuning tools to optimize the physical
design of a database through the analysis of representative workload set.
This technique works by monitoring the steps of an actual workload without
knowing it up front, to create an index automatically when one of the query’s
execution plan is generated. The index is created at the background of a running
workload in one complete go. The online indexing has lessened the work on the
side of the Database Administrator and on the side of the system by avoiding
the need of creating the index externally.
However, the online indexing usage has not covered the multi and partial
indexes, which are necessary and important approaches for query processing in
both OTP and OLAP. The results of comparison between the Online indexing
and the conventional indexing showed that the Online Index performed better
than the conventional one due to the following: knowledge of workload is not
required before creating the index. Since, the index is created as side effect of
query execution, its entries cover only what is specified in the query’s predicates.
Hence, the reason for better performance. The Online Index can use any of the
basic indexing data structure for it implementation, be it B-Tree or ordered
tables based index [8], [16] and [22].
Amir et al [2], used online indexing to solve the problem of managing unbounded-
length keys that are found in XLM paths, IP addresses, multi-dimensional points,
multi key data and multi precision numbers. This type of data is categorized as
big data. These type of string based keys are usually atomic and indivisible,
hence require a customized comparison data structure. Their proposed Online
indexing happens to work with any data structure with a worst-case complex-
ity of O(log n), which they reported to be the best based on their experiment.
A suffix tree was cited as one of the application area of the proposed Online
Indexing [2].

3.2 Database Cracking

Another effort in the direction of ’on-the-fly’ indexing was made by Idreos et


al [22]. The authors introduced a combination of automatic index selection and
partial indexes, called Database Cracking. This approach uses the side effect of
running current query and future ones to refine the structure of its index. The
Database Cracking substituted the normal scanning of stored data for index
creation and query processing, with pivoted partitioning of the data using the
predicates from the query.
By doing that, one of the partitions will be containing only the tuples that
answers the query. The underlining data structure used by the cracker index
could be quick sort generated B-Tree or a hash, whose partitions expected to be
beneficial to arriving queries.
In database cracking no external index or external tuning tools are required,
but just the monitoring of queries. The present and arriving queries are used
to build and optimize the index. Database cracking proves to be advantageous
compared to Online indexing, and by extension the ordinary query processing.
For the fact that, the partitioning has to be carried out in small number for
every predicate (fan out of 2-3), it takes a lot of queries to get the index fully
optimized [18], [17] and [43].
Figure 8 displayed an example of Database cracking. The database cracking
was implemented in one ’MonetDB’ as a successful alternative to index scanning
[7] [30].
However, the database cracking was observed to be CPU intensive rather
than I/O bound. Pirk et al [30], proposed, an enhancement that has taken the
database cracking from being CPU bound to I/O bound.Input output bound
operation is best suited for big data processing. The authors used approaches
such as predication, vectorization, data parallelism and CPU multi-thread on
SIMD instructions to achieve the method. The results of their work showed
that, it is 25 times faster than the first database cracking.

Fig. 8: An Example of Database Cracking

3.3 Adaptive Merge


Graefe, Goetz, Kuno and Harumi [18] have introduced a modification of database
cracking called Adaptive Merge. The Adaptive Merge also creates index by using
the predicates specified in a query and used the partial index to answer that given
query. The process also, increments and optimizes the index with the help of the
underlining data structure, the partitioned B-tree. Adaptive Merge uses merge
sort for the index optimization and works based on ”monitor queries then build
index” approach.
Unlike database cracking that uses partitioning of its data structure and
works only on in-memory database, the adaptive merge works on external block
access like flash memory and disc as well as in-memory. In terms of performance,
the adaptive merge indexing optimizes with far less number of queries compared
to database cracking. This is due to the fact that merge process’s fan-in is un-
limited as against limited fan-out of partitioning [19] [23]. Then Idreos et al
[23], made a further attempt to bring the two approaches into one, by replacing
the index initial creation stage of the Adaptive Merge. The authors suggested a
data structure that does not require much resource to build such as array, but
optimizes by merging. This attempt was believed to work well with bug datasets
Figure9 presents the Partitioned B-Tree that was generated by Adaptive Merge
index.

Fig. 9: An Adaptive Merge Index is based of B-Tree

The last three indexing methods also referred to as Oblivious indexes work
by creating the index on-the-fly and automatically. The indexes use the concept
of monitoring any issued query and then build the index as the side effect of the
query’s execution. However, all the above-discussed indexing techniques work
with RDBMS and in an OLTP approach. The OLTP usually has short queries
that are frequently posed to the database as the main operational system. So,
there are number of arriving queries to increment and /or optimize an index if
the need arises. On the contrary, the situation is different in the case of OLAP,
especially in batch-oriented situations, which are the most common analysis
when it comes to big data. Also, the mentioned cases are mostly different when
MapReduce is to be used for such analysis.
The drawbacks of the RDBMS in processing big data implies that their cor-
responding index types have the same drawbacks. This prompted researchers of
big data to customize and in some cases, develop different indexing strategies for
the big data analysis as mentioned earlier. These developed indexing strategies
for big data information retrieval systems are being used by big data analytics.

3.4 Big Data Analytics Platforms


Upon the realization of the drawbacks of the traditional database systems, re-
search efforts continue with more focus in the direction of fast processing of big
data especially with OLAP queries (big data analytics). The analytical chal-
lenge of big data indicates that it has out grown that capacity of traditional
DBMS resources. This further prompts for the development of new technologies
to solve the challenge. As a result of the above, Dean and Ghemawat [11] had
presented the notable new technologies of Google File System (GFS), which is
a distributed file system used for data storage across clusters. The authors also
introduced the Google MapReduce programming model, which is used for data
analysis and indexing [10,9,14,25].
The MapReduce is a programing paradigm that uses divide and conquer
approach to problem solving. MapReduce uses functional programing approach
with only two functions: Map and Reduce. The functions are deployed to the
chunks of data that are stored on the distributed file system. The tasks to be
carried out by the functions are to be defined by the user and they are ex-
ecuted in parallel on the various blocks of data. The MapReduce works in a
key/values filtering and aggregation manner to solve the problem in question.
[20,26,38,45,25].
The MapReduce remained the most powerful and the most accepted approach
for the big data mining and analytics [39,26]. One of the major reasons for it
popularity among researchers and industry experts is its flexibility. Users can
intuitively write code to solve virtually any problem. Besides that, MapReduce
also supports very high parallel programing, even though at low-level. MapRe-
duce design nature was to eradicate input output (I/O) problem associated to
Extract, Load and Transfer (ELT) [25].
The design migrates the computation to various computing units instead of
moving data into memory for computation. Furthermore, the Hadoop MapRe-
duce implementation remains the fastest, one reported, in handling very large
data analysis. In addition, larger percentage of research works’ experiment in-
volving big data analytics that are found in the literatures are done using the
Hadoop MapReduce or they are related to it or its associated tools. Lastly, re-
ports and releases coming from the big IT companies indicate that most them
use tools that are supported by the Hadoop/MapReduce in performing their
big data analysis [39,26,33], [35]. Table 1 highlight some features of the three
attempted solutions for big data analytics.

Table 1: Comparison Between Big Data Analytics Approaches


S/No Big Data Parallel Dynamic Schema SQL Sup- Extract
Analytics Program- Flexibil- Less port Load
Approach ing ity Transfer
1 MapReduceYes Yes Yes No (Not No
directly)
2 Algeabric Yes No Yes Yes
Workflow
3 AterixDB Yes No (SQL No (Semi- Yes Yes
like) structured)
4 Inherent Indexes in MapReduce

A simple Inverted index is said to be inherently and trivially implemented in


MapReduce as one of effective task for textual data retrieval. This is highlighted
by Dean and Ghemawat [11] in their Original paper on MapReduce and cited
by Graef and kuno [19]. When performing inverted index with MapReduce, the
map function parses the split covering certain input, and emits sequence of <key,
value >pairs. The reduce function accepts all pairs of the same key, sort the cor-
responding values and emits <key, list(value) >pair. The set of all output pairs
form a simple inverted index. McCreadie [27], deducted that two interpretations
of the above scenario can be a Per-Token indexing or a Per-term indexing in
relation to the indexing of corpus datasets:
The Per - Token indexing strategy involves emitting of <term, doc-ID >pairs
for each token in a document by the map function. While the reduce function,
does the aggregation of each unique term with its correspond doc-ID to obtain
the term frequencies (tf), after which, the completed posting list for that term
is written to disk. So, if a term appears tf times it will be indicated as such. The
advantage of this strategy is that it makes the map phase very simple. However,
it has the potential of costing more memory and time due to storage of large -
intermediate results, network traffic, and prolonging sort phase.

Fig. 10: A Sample of Per Token index using MapReduce

On the other hand, the Per-Term indexing uses the map function to emit
tuple in the form <term, (doc-ID, tf)>this reduces the number of emit operation
as only unique term per document are emitted. The reduce function in this
inter-operation, only sorts the instances by document to obtain the final posting
list sorted by ascending doc-ID. Ivory Information Retrieval System uses this
approach. Also, a combiner function can be used to generate tfs by performing
a localized merge on each map task’s output.

Fig. 11: A Sample of Per Term index using MapReduce

4.1 Per Document Indexing

This is the inverted indexing technique used by Nutch platform on the Hadoop
to index document for faster search. Nutch tokenized the document during map
phase and the map function emits tuples in form of <document, doc-ID >while
the reduce phase writes all index structures. Though, this strategy emits less,
but the value of each emit use to have more data and have reduced intermediate
results. Thus, achieve higher levels of compression than single terms. Documents
are indexed on same reduce task easily due to the sorting of document names
[27].

4.2 Per-Posting list Indexing

This indexing technique is based on a single-pass indexing, which splits data


onto multiple map tasks with each operating on its own data sub-set. The map
task serves as the scanning phase of the single-pass indexing. As this process run
on the document, compressed posting lists are built in memory for each term.
This partial index is flushed from the map task when the memory run low or
when all the documents are processed. The flushing is done by emitting a set in
the form of <term, posting list>pairs for all terms present in the memory. Before
Fig. 12: A Sample of Per Document index using MapReduce

taking up of intermediate results by reduce task, the flushed partial indexes are
sorted and stored on disk first by map number and then by flushed numbers.
In order to achieved globally correct ordering of posting list for each term, the
posting lists are merged by map number and flushed number. The term, posting
lists are merged together by the reduce function to form the standard index
comprising of full posting lists. The standard index is compressed using Elias-
Gamma technique by storing only the distance doc-IDs [27].
All of the four strategies of inverted indexing in MapReduce discussed above,
are the main task focused on and carried out by the MapReduce job. This is
against the primary function of indexing in RDBMS, which is to speed-up access
of stored data in order to improve the performance of other processes. Thus, the
aim of indexing is not only to use to improve parallel processing of the document
contents. Rather, the primary aim of indexing is to improves the performance of
parallel processor itself. This improvement is to be achieved in addition to the
underlining parallel processor that MapReduce is programmed to accomplish.
Thus, there is need for additional indexing scheme that works with the parallel
processor and serves the same purpose with what index does in RDBMS.

5 User defined Indexing in MapReduce

For the user defined indexing used in MapReduce, and big data analytics the
Yang and Parker [42], have employed HDFS’s file component as B-Tree nodes to
achieve indexing. In their approach each file contains data and pointer to lower
files in the tree hierarchy, which is considered its children. During query process-
ing the tree is traversed to locate the required segment of data to be processed
using an improved Map-Reduce-Merge-Traverse version of MapReduce. After
locating the data then the map, the reduce and the merge tasks are performed
on it to return the record set that answers the given query.
Also, An, Wang and Wang [3], have used blockIds from the HDFS as the
search keys of their B+-Tree based index. When a given query is to be processed
the B+-Tree based index is first searched to determine the start and the end of
contiguous blocks that formed the index and the result of the search formed the
input data to be scanned. Then, only the blockIds that are returned from such
search are used by MapReduce for main query processing. Hence, through this
process preventing the full scan of the input data is done. In addition, Richter
et al [32], used the copies of replica stored by HDFS, to index different data
attributes, which may likely be used as incoming query’s predicates. When a
MapReduce query arrives, their library checks the fields contained in the query’s
predicates and used the clustered index built on that field to return the blockIds
of the data required to answer the given query.
Furthermore, in all the mentioned studies, the authors used indexing data
structure that scale logarithmically thereby improving data processing and re-
trieval. This is done by preventing the MapReduce from full scan of input data
by guiding the process to just scanning and processing the data that corresponds
to the output of the indexes. Moreover, there many other researches that were
conducted on indexing using different types of big data, however, those research
are closely tied to that type of big data as the indexing data structure and the
index implementations are determined by the nature of the data itself [21]. Table
2 displays the summary of the index approaches, their memory requirement and
big data potentials.

6 Conclusion

Moreover, it can be simply deducted from the above reviews that user defined
index and content indexing as tool for optimizing the performance of information
retrieval has been successful. It can be added also, that there is high potential for
improving big data analytics through the use of advanced indexing techniques
and data structures. Particularly, if these techniques and data structures are
hybridized, improved and customized to work with MapReduce they will surely
improve the performance of big data analytics.
It has been highlighted that, MapReduce is one of the most popular tools
for big data analytics. However, the low-level nature of its implementations has
given rise to the development of the HLQLs. HLQLs ease the programmers
task of handling the analysis. The Review also highlighted the different types of
index in both RDBMS and those used in big data analytics using different big
data analytics platforms including MapReduce and its index approaches. The
improvement can be achieved by changing the indexing approach or using more
efficient data structure.
Table 2: Comparison Between Big Data Analytics Approaches
S/No Indexing Data Memory/ Storage Potentials For Big
Ap- Struc- Data Mining
proach ture
Wu el at [41],Fusco Bitmp Tabular Requires large space Has good potentials
et al [13] of memory and stor- Big data mining
age
Gracia,Ullmanand Dense Tabular Requires large space Not a good approach
Widom [15], Sil- of memory and stor- for big data mining
berschez et.al.,[37] age
Gracia,Ullmanand Sparse Tabular Requires large space Not a good approach
Widom[15],Rys of memory and stor- for big data mining
[34] and Silberschez age
et.al. [37]
Chaudhuri, surajit, Online In- Vectors
Requires small space A good approach for
Narasayya and dexing of memory and stor- big data mining
Vivek [8] age
Idreos et al [22], [22]
Database Arrays Requires small space A good approach for
Cracking of memory and stor- big data mining
age
Graefe, Goetz, Adaptive Tree Requires Average A very good approach
Kuno and Harumi Merge Based space of memory and for big data mining
[18] storage
Yang and Parker B-Tree Tree Requires Average A good approach for
[42] based space of memory and big data mining
storage
An, Wang and B+-Tree Tree Requires small space A good approach for
Wang [3] Based of memory and stor- big data mining
age
Richter et al [32] Hail File based Requires large space A good approach for
of memory and stor- big data mining
age

References

1. Alsubaiee, S., Altowim, Y., Altwaijry, H., Behm, A., Borkar, V., Bu, Y., Carey, M.,
Cetindil, I., Cheelangi, M., Faraaz, K., et al.: Asterixdb: A scalable, open source
bdms. Proceedings of the VLDB Endowment 7(14), 1905–1916 (2014)
2. Amir, A., Franceschini, G., Grossi, R., Kopelowitz, T., Lewenstein, M., Lewenstein,
N.: Managing unbounded-length keys in comparison-driven data structures with
applications to online indexing. SIAM Journal on Computing 43(4), 1396–1416
(2014)
3. An, M., Wang, Y., Wang, W.: Using index in the mapreduce framework. In:
Web Conference (APWEB), 2010 12th International Asia-Pacific. pp. 52–58. IEEE
(2010)
4. Bachlechner, D., Leimbach, T.: Big data challenges: Impact, potential responses
and research needs. In: Emerging Technologies and Innovative Business Practices
for the Transformation of Societies (EmergiTech), IEEE International Conference
on. pp. 257–264. IEEE (2016)
5. Bajaber, F., Elshawi, R., Batarfi, O., Altalhi, A., Barnawi, A., Sakr, S.: Big data
2.0 processing systems: Taxonomy and open challenges. Journal of Grid Computing
14(3), 379–405 (2016)
6. Berman, J.J.: Principles of big data: preparing, sharing, and analyzing complex
information. Newnes (2013)
7. Boncz, P.A., Zukowski, M., Nes, N.: Monetdb/x100: Hyper-pipelining query exe-
cution. In: Cidr. vol. 5, pp. 225–237 (2005)
8. Chaudhuri, S., Narasayya, V.: Self-tuning database systems: a decade of progress.
In: Proceedings of the 33rd international conference on Very large data bases. pp.
3–14. VLDB Endowment (2007)
9. Chen, C.P., Zhang, C.Y.: Data-intensive applications, challenges, techniques and
technologies: A survey on big data. Information Sciences 275, 314–347 (2014)
10. Chen, M., Mao, S., Liu, Y.: Big data: A survey. Mobile Networks and Applications
19(2), 171–209 (2014)
11. Dean, J., Ghemawat, S.: Mapreduce: a flexible data processing tool. Communica-
tions of the ACM 53(1), 72–77 (2010)
12. Dias, J., Ogasawara, E., de Oliveira, D., Porto, F., Valduriez, P., Mattoso, M.:
Algebraic dataflows for big data analysis. In: Big Data, 2013 IEEE International
Conference on. pp. 150–155. IEEE (2013)
13. Fusco, F., Vlachos, M., Stoecklin, M.P.: Real-time creation of bitmap indexes on
streaming network data. The VLDB JournalThe International Journal on Very
Large Data Bases 21(3), 287–307 (2012)
14. Gani, A., Siddiqa, A., Shamshirband, S., Hanum, F.: A survey on indexing tech-
niques for big data: taxonomy and performance evaluation. Knowledge and Infor-
mation Systems 46(2), 241–284 (2016)
15. Garcia-Molina, H., Ullman, J.D., Widom, J.: Database system implementation,
vol. 654. Prentice Hall Upper Saddle River, NJ:, 2nd ed edn. (2014)
16. Glombiewski, N., Seeger, B., Graefe, G.: Waves of misery after index creation.
BTW 2019 (2019)
17. Graefe, G., Idreos, S., Kuno, H., Manegold, S.: Benchmarking adaptive indexing,
pp. 169–184. Springer (2011)
18. Graefe, G., Kuno, H.: Self-selecting, self-tuning, incrementally optimized indexes.
In: Proceedings of the 13th International Conference on Extending Database Tech-
nology. pp. 371–381. ACM (2010)
19. Graefe, G., Kuno, H.: Self-selecting, self-tuning, incrementally optimized indexes.
In: Proceedings of the 13th International Conference on Extending Database Tech-
nology. pp. 371–381. ACM (2010)
20. Hong, Z., Xiao-Ming, W., Jie, C., Yan-Hong, M., Yi-Rong, G., Min, W.: A opti-
mized model for mapreduce based on hadoop. TELKOMNIKA (Telecommunica-
tion Computing Electronics and Control) 14(4) (2016)
21. Ibrahim, H., Sani, N.F.M., Yaakob, R., et al.: Analyses of indexing techniques on
uncertain data with high dimensionality. IEEE Access 8, 74101–74117 (2020)
22. Idreos, S., Kersten, M.L., Manegold, S.: Database cracking. In: CIDR. vol. 7, pp.
7–10 (2017)
23. Idreos, S., Manegold, S., Kuno, H., Graefe, G.: Merging what’s cracked, cracking
what’s merged: adaptive indexing in main-memory column-stores. Proceedings of
the VLDB Endowment 4(9), 586–597 (2011)
24. John, S.: Indexing in apache hive- facebook (2011), https :
//www.f acebook.com/note.php?notei d = 1015016842773390
25. Khasawneh, T.N., AL-Sahlee, M.H., Safia, A.A.: Sql, newsql, and nosql databases:
A comparative survey. In: 2020 11th International Conference on Information and
Communication Systems (ICICS). pp. 013–021 (2020)
26. Lee, S., Jo, J.Y., Kim, Y.: Performance improvement of mapreduce process by
promoting deep data locality. In: Data Science and Advanced Analytics (DSAA),
2016 IEEE International Conference on. pp. 292–301. IEEE (2016)
27. McCreadie, R., Macdonald, C., Ounis, I.: On single-pass indexing with mapreduce.
In: Proceedings of the 32nd international ACM SIGIR conference on Research and
development in information retrieval. pp. 742–743. ACM (2009)
28. McCreadie, R., Macdonald, C., Ounis, I.: Mapreduce indexing strategies: Studying
scalability and efficiency. Information Processing and Management 48(5), 873–888
(2012)
29. Nang, J., Park, J.: An efficient indexing structure for content based multimedia
retrieval with relevance feedback. In: Proceedings of the 2007 ACM symposium on
Applied computing. pp. 517–524. ACM (2007)
30. Pirk, H., Petraki, E., Idreos, S., Manegold, S., Kersten, M.: Database cracking:
fancy scan, not poor man’s sort! In: Proceedings of the Tenth International Work-
shop on Data Management on New Hardware. p. 4. ACM (2014)
31. Ramakrishnan, R., Gehrke, J., Gehrke, J.: Database management systems, vol. 3.
McGraw-Hill New York (2010)
32. Richter, S., Quiané-Ruiz, J.A., Schuh, S., Dittrich, J.: Towards zero-overhead static
and adaptive indexing in hadoop. The VLDB Journal 23(3), 469–494 (2014)
33. Roy, S., Mitra, R.: A survey of data structures and algorithms used in the context
of compression upon biological sequence. Sustainable Humanosphere 16(1), 1951–
1963 (2020)
34. Rys, M.: Xml and relational database management systems: inside microsoft sql
server 2005. In: Proceedings of the 2005 ACM SIGMOD international conference
on Management of data. pp. 958–962. ACM (2005)
35. Sevugan, P., Shankar, K.: Spatial data indexing and query processing in geocloud.
Journal of Testing and Evaluation 47(6) (2019)
36. Shireesha, R., Bhutada, S.: A study of tools, techniques, and trends for big data
analytics. IJACTA 4(1), 152–158 (2016)
37. Silberschatz, A., Korth, H.F., Sudarshan, S., et al.: Database system concepts,
vol. 4. McGraw-Hill New York (1997)
38. Silva, Y.N., Almeida, I., Queiroz, M.: Sql: From traditional databases to big data.
In: Proceedings of the 47th ACM Technical Symposium on Computing Science
Education. pp. 413–418. ACM (2016)
39. Sozykin, A., Epanchintsev, T.: Mipr-a framework for distributed image processing
using hadoop. In: Application of Information and Communication Technologies
(AICT), 2015 9th International Conference on. pp. 35–39. IEEE (2015)
40. Statista: Volume of data worldwide from 2010-2025.
https://www.statista.com/statistics/871513/worldwide-data-created/ (2020)
41. Wu, K., Otoo, E., Shoshani, A.: On the performance of bitmap indices for high
cardinality attributes. In: Proceedings of the Thirtieth international conference on
Very large data bases-Volume 30. pp. 24–35. VLDB Endowment (2004)
42. Yang, H.C., Parker, D.S.: Traverse: simplified indexing on large map-reduce-merge
clusters. In: International Conference on Database Systems for Advanced Applica-
tions. pp. 308–322. Springer (2009)
43. Ydraios, E., et al.: Database cracking: towards auto-tunning database kernels. SIKS
(2010)
44. Zakir, J., Seymour, T., Berg, K.: Big data analytics. Issues in Information Systems
16(2), 81–90 (2015)
45. Zhang, Q., He, A., Liu, C., Lo, E.: Closest interval join using mapreduce. In: Data
Science and Advanced Analytics (DSAA), 2016 IEEE International Conference on.
pp. 302–311. IEEE (2016)
46. Zhang, Y., Ren, J., Liu, J., Xu, C., Guo, H., Liu, Y.: A survey on emerging com-
puting paradigms for big data. Chinese Journal of Electronics 26(1) (2017)
47. Zikopoulos, P., Eaton, C.: Understanding big data: Analytics for enterprise class
hadoop and streaming data. McGraw-Hill Osborne Media (2011)

View publication stats

You might also like