Tesis "Design and Implementation of a Scalable Method for Indexing Distributed Data Regardless of its Structure"
Alumno: Jaime Iván López Veyna
Asesor: Dr. Víctor Jesús Sosa Sosa
Sinodales: Dr. Hiram Galeana Zapién, Dr. José Guadalupe Rodríguez García, Dra. Sonia Mendoza Chapa, Dr. Edgar Iván Benítez Guerrero
Keyword Search has been recognized as a viable alternative for information search in semistructured and structured data sources. Examples of keyword-search techniques over relational databases are: Steiner Trees, Candidate Networks, Tuple Units and Virtual Documents. The acceptance of these techniques is limited due to the scalability issues that arise when they are implemented on large datasets. For instance, it is demonstrated that a Steiner Tree approach most of the time implies a NP-hard problem.
As another option, Candidate Networks usually require high performance equipment to evaluate all possible Candidates generated from a query. The Virtual Document (VD) approach produces efficient and accurate results, but with high storage requirements caused by the excessive redundancy generated when processing large datasets. The results of my research work whose main aim was the design and implementation of a novel method for indexing large datasets regardless of its structure (unstructured, semi-structured and structured) for a keyword-based search engine are presented.
The design and implementation of two keyword-based searching platforms (KESOSD and K-Searcher) that integrate a novel method for indexing large information sources comprising structured, semi-structured and unstructured datasets summarize the main contributions of this research work. KESOSD (Keyword Search Over Structured Data) works with structured and semi-structured information. It models the information as data graphs and proposes the use of the Virtual Document approach. An Inverted Index of Virtual Documents called VDII was implemented. The VDII index is created from a set of logical structures called Virtual Documents that capture the implicit structural relationships (semantics) depicted in the schemas of the structured and semi-structured data sources.
KESOSD produces fast and accuracy responses for the experiments with the IMDB and DBLP databases. The second Keyword Search platform (K-Searcher) takes advantage of the Virtual Documents technique, but alleviating some limitations detected in the KESOD approach. KSearcher includes unstructured data in its data model. It uses data graphs for modeling unstructured, semi-structured and structured data, taking all of the datasets as one integrated data source. KSearcher presents an algorithm to detect and create the Maximal Virtual Documents in order to reduce the redundancy detected in the creation of Virtual Documents. Also K-Searcher modifies its index structure with the design and implementation of a new Virtual Document Structure Aware Inverted Index (VDSAII).
This new index is processed over the Hadoop distributed platform and it is stored in an HBase database. We conducted some experiments with four real datasets. The results with the IMDB, DBLP, Mondial and Wikipedia datasets show how K-Searcher achieves high efficiency and accuracy responses. Our two approaches have demonstrated that including structural information in an index structured that is created from a Virtual Document approach produces important advantages in terms of efficiency and effectiveness. K-Searcher is prepared to scale out and manage datasets with millions of tuples without degrading its eectiveness and is capable to index data regardless of their structure. It was tested in a big-scale environment, using a single computer or a cluster of several computers, showing that with minimal changes in the configuration K-Searcher is ready to scale and work in these environments.