Tesis "Method for Automatic Identification, Classification and Integration of Web Query Interfaces based on Domains of Interest"
Alumno: Heidy Marisol Marin Castro
Asesor: Dr. Víctor Jesús Sosa Sosa
Sinodales: Dr. Jesús A. González Bernal, Dr. José Guadalupe Rodríguez García, Dr. Hiram Galeana Zapién, Dr. Javier Rubio Loyola
The amount of information contained in specialized databases (electronic commerce: hotels, airfares, car rental; sciences: biology, mathematics, medicine, etc.) available on the Web has grown explosively in the last years. This information, also known as the Deep Web, is heterogeneous and dynamically extracted by querying these specialized databases through a special type of HTML forms called Web Query Interfaces (WQIs). The access to the information in the Deep Web is a great challenge, the existing information usually is not indexed by conventional search engines (Google, Yahoo, Bing, etc.) due to it is not available in explicit way, requiring the use of specialized interfaces. WQIs represent a means to access the Deep Web. Several efforts have been performed to achieve automatic identification and/or classification of WQIs as information sources, as well as the building of a single (unified) WQI that allows user to query and integrate information available from different databases belonging to a specific domain. However, all these tasks are still a great challenge due to the constantly growing and content heterogeneity of databases in the Web. The main objective of this thesis is to develop a method that allows automatic identification, classification and integration of WQIs. Each task required a detailed study of the structural and semantic components contained in WQIs.
The main contributions of this thesis can be summarized by the obtained results: i) the development of a strategy for automatic identification of WQIs based on a guided selection of components in HTML forms, the use of specific heuristic rules for filtering HTML forms and machine learning techniques for classifying new forms; ii) the design and implementation of a strategy for automatic classification of WQIs based on the construction of domain dictionaries addressing the problem of semantic ambiguity that appears among WQIs; iii) the development of a schema tree called Visual Reduced Tree (VR-Tree) for modeling the visual content of WQIs and iv)the design and implementation of a strategy for the automatic integration of WQIs based on the homogenization and unification of VR-Trees. According to the results obtained by the developed strategies, these are efficient and competitive in terms of precision and recall compared to previous works reported in the literature.