Entity Resolution for Large-Scale Databases PDF Download

Are you looking for read ebook online? Search for your book and save it on your Kindle device, PC, phones or tablets. Download Entity Resolution for Large-Scale Databases PDF full book. Access full book title Entity Resolution for Large-Scale Databases by Kunho Kim. Download full books in PDF and EPUB format.

Entity Resolution for Large-Scale Databases

Author: Kunho Kim
Publisher:
ISBN:
Category :
Languages : en
Pages :

Book Description
Entity resolution involves the problem of identifying, matching, and grouping the same entities from a single collection or multiple ones of data. Real-world databases often comprise data from multiple sources; hence, this process is an essential preprocessing step for correctly processing queries on a particular entity. An example of entity resolution is finding a person's medical records from multiple hospital records. In entity resolution, there commonly arise two main problems. One is the issue of disambiguation (or deduplication), which involves clustering records that correspond to the same entity within a database. The other problem is record linkage which involves matching records between multiple databases. In this dissertation, we focus on studying entity resolution on large-scale structured data such as CiteSeerX, PubMed and the United States Patent and Trademark Office (USPTO) patent database in several aspects. First, we review our proposed entity resolution framework, and discuss how to apply the framework on two practical problems; inventor name disambiguation on the USPTO patent database and financial entity record linkage. Second, we investigate building a web service to improve ease of using entity resolution results in several scenarios. We define two types of queries--attribute and record-based ones--and discuss how we design the web service to handle those queries efficiently. We demonstrate that our algorithm can accelerate the record-based query by a factor of 4.01 compared to a baseline naive approach. Third, we discuss improving the entity resolution in two directions. One direction is to improve the blocking method to reduce unnecessary comparison to improve scalability on author name disambiguation problems. We show that our proposed conjuctive normal form (CNF) blocking tested on the entire PubMed database of 80 million author mentions efficiently removes 82.17% of all author record pairs. Another direction is to improve accuracy; we study enhancing pairwise classification, which estimates the probability of a pair of records being from the same name entity. Our purposed hybrid method using both structure-aware and global features shows an improvement on mean average precision by up to 7.45% points. Finally, we discuss entity and attribute extraction. Entity extraction is important in terms of improving the input data quality for entity resolution and can also be used to extract useful entities from external sources. In this dissertation, we study the problem of extracting entities for task oriented spoken language understanding in human-to-human conversation scenarios. Our proposed bidirectional LSTM architecture with supplemental knowledge extracted from web data, search engine query logs, prior sentences, and task transfer demnstrates an improvement in F1-score by up to 2.92% compared to existing approaches.

Entity Resolution in the Web of Data

Author: Vassilis Christophides
Publisher: Springer Nature
ISBN: 3031794680
Category : Mathematics
Languages : en
Pages : 106

Book Description
In recent years, several knowledge bases have been built to enable large-scale knowledge sharing, but also an entity-centric Web search, mixing both structured data and text querying. These knowledge bases offer machine-readable descriptions of real-world entities, e.g., persons, places, published on the Web as Linked Data. However, due to the different information extraction tools and curation policies employed by knowledge bases, multiple, complementary and sometimes conflicting descriptions of the same real-world entities may be provided. Entity resolution aims to identify different descriptions that refer to the same entity appearing either within or across knowledge bases. The objective of this book is to present the new entity resolution challenges stemming from the openness of the Web of data in describing entities by an unbounded number of knowledge bases, the semantic and structural diversity of the descriptions provided across domains even for the same real-world entities, as well as the autonomy of knowledge bases in terms of adopted processes for creating and curating entity descriptions. The scale, diversity, and graph structuring of entity descriptions in the Web of data essentially challenge how two descriptions can be effectively compared for similarity, but also how resolution algorithms can efficiently avoid examining pairwise all descriptions. The book covers a wide spectrum of entity resolution issues at the Web scale, including basic concepts and data structures, main resolution tasks and workflows, as well as state-of-the-art algorithmic techniques and experimental trade-offs.

Entity Resolution and Information Quality

Author: John R. Talburt
Publisher: Elsevier
ISBN: 0123819733
Category : Computers
Languages : en
Pages : 254

Book Description
Entity Resolution and Information Quality presents topics and definitions, and clarifies confusing terminologies regarding entity resolution and information quality. It takes a very wide view of IQ, including its six-domain framework and the skills formed by the International Association for Information and Data Quality {IAIDQ). The book includes chapters that cover the principles of entity resolution and the principles of Information Quality, in addition to their concepts and terminology. It also discusses the Fellegi-Sunter theory of record linkage, the Stanford Entity Resolution Framework, and the Algebraic Model for Entity Resolution, which are the major theoretical models that support Entity Resolution. In relation to this, the book briefly discusses entity-based data integration (EBDI) and its model, which serve as an extension of the Algebraic Model for Entity Resolution. There is also an explanation of how the three commercial ER systems operate and a description of the non-commercial open-source system known as OYSTER. The book concludes by discussing trends in entity resolution research and practice. Students taking IT courses and IT professionals will find this book invaluable. First authoritative reference explaining entity resolution and how to use it effectively Provides practical system design advice to help you get a competitive advantage Includes a companion site with synthetic customer data for applicatory exercises, and access to a Java-based Entity Resolution program.

The Four Generations of Entity Resolution

Author: George Papadakis
Publisher: Springer Nature
ISBN: 3031018788
Category : Computers
Languages : en
Pages : 152

Book Description
Entity Resolution (ER) lies at the core of data integration and cleaning and, thus, a bulk of the research examines ways for improving its effectiveness and time efficiency. The initial ER methods primarily target Veracity in the context of structured (relational) data that are described by a schema of well-known quality and meaning. To achieve high effectiveness, they leverage schema, expert, and/or external knowledge. Part of these methods are extended to address Volume, processing large datasets through multi-core or massive parallelization approaches, such as the MapReduce paradigm. However, these early schema-based approaches are inapplicable to Web Data, which abound in voluminous, noisy, semi-structured, and highly heterogeneous information. To address the additional challenge of Variety, recent works on ER adopt a novel, loosely schema-aware functionality that emphasizes scalability and robustness to noise. Another line of present research focuses on the additional challenge of Velocity, aiming to process data collections of a continuously increasing volume. The latest works, though, take advantage of the significant breakthroughs in Deep Learning and Crowdsourcing, incorporating external knowledge to enhance the existing words to a significant extent. This synthesis lecture organizes ER methods into four generations based on the challenges posed by these four Vs. For each generation, we outline the corresponding ER workflow, discuss the state-of-the-art methods per workflow step, and present current research directions. The discussion of these methods takes into account a historical perspective, explaining the evolution of the methods over time along with their similarities and differences. The lecture also discusses the available ER tools and benchmark datasets that allow expert as well as novice users to make use of the available solutions.

Transactions on Large-Scale Data- and Knowledge-Centered Systems XXIX

Author: Abdelkader Hameurlain
Publisher: Springer
ISBN: 3662540371
Category : Computers
Languages : en
Pages : 142

Book Description
The LNCS journal Transactions on Large-Scale Data- and Knowledge-Centered Systems focuses on data management, knowledge discovery, and knowledge processing, which are core and hot topics in computer science. Since the 1990s, the Internet has become the main driving force behind application development in all domains. An increase in the demand for resource sharing across different sites connected through networks has led to an evolution of data- and knowledge-management systems from centralized systems to decentralized systems enabling large-scale distributed applications providing high scalability. Current decentralized systems still focus on data and knowledge as their main resource. Feasibility of these systems relies basically on P2P (peer-to-peer) techniques and the support of agent systems with scaling and decentralized control. Synergy between grids, P2P systems, and agent technologies is the key to data- and knowledge-centered systems in large-scale environments. This, the 29th issue of Transactions on Large-Scale Data- and Knowledge-Centered Systems, contains four revised selected regular papers. Topics covered include optimization and cluster validation processes for entity matching, business intelligence systems, and data profiling in the Semantic Web.

Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXVIII

Author: Abdelkader Hameurlain
Publisher: Springer
ISBN: 3662583844
Category : Computers
Languages : en
Pages : 173

Book Description
This, the 38th issue of Transactions on Large-Scale Data- and Knowledge-Centered Systems, contains extended and revised versions of six papers selected from the 68 contributions presented at the 27th International Conference on Database and Expert Systems Applications, DEXA 2016, held in Porto, Portugal, in September 2016. Topics covered include query personalization in databases, data anonymization, similarity search, computational methods for entity resolution, array-based computations in big data analysis, and pattern mining.

Advances in Databases and Information Systems

Author: Tatjana Welzer
Publisher: Springer Nature
ISBN: 3030287300
Category : Computers
Languages : en
Pages : 463

Book Description
This book constitutes the proceedings of the 23rd European Conference on Advances in Databases and Information Systems, ADBIS 2019, held in Bled, Slovenia, in September 2019. The 27 full papers presented were carefully reviewed and selected from 103 submissions. The papers cover a wide range of topics from different areas of research in database and information systems technologies and their advanced applications from theoretical foundations to optimizing index structures. They focus on data mining and machine learning, data warehouses and big data technologies, semantic data processing, and data modeling. They are organized in the following topical sections: data mining; machine learning; document and text databases; big data; novel applications; ontologies and knowledge management; process mining and stream processing; data quality; optimization; theoretical foundation and new requirements; and data warehouses.

Innovative Techniques and Applications of Entity Resolution

Author: Hongzhi Wang
Publisher: Information Science Reference
ISBN: 9781466652019
Category : Data mining
Languages : en
Pages : 414

Book Description
"This book draws upon interdisciplinary research on tools, techniques, and applications of entity resolution and provides a detailed analysis of entity resolution applied to various types of data as well as appropriate techniques and applications"--

Advances in Databases and Information Systems

Author: Jaroslav Pokorný
Publisher: Springer
ISBN: 331944039X
Category : Computers
Languages : en
Pages : 358

Book Description
This book constitutes the thoroughly refereed proceedings of the 20th East European Conference on Advances in Databases and Information Systems, ADBIS 2016, held in Prague, Czech Republic, in August 2016. The 21 full papers presented together with two keynote papers and one keynote abstract were carefully selected and reviewed from 85 submissions. The papers are organized in topical sections such as data quality, mining, analysis and clustering; model-driven engineering, conceptual modeling; data warehouse and multidimensional modeling, recommender systems; spatial and temporal data processing; distributed and parallel data processing; internet of things and sensor networks.

Functional Future for Bibliographic Control

Author: Shawne D. Miksa
Publisher: Routledge
ISBN: 1351566202
Category : Language Arts & Disciplines
Languages : en
Pages : 279

Book Description
The quest to evolve bibliographic control to an equal or greater standing within the current information environment is on-going. As information organizers we are working in a time where information and communication technology (ICT) has pushed our status quo to its limits and where innovation often needs the pressure of do or die in order to get started. The year 2010 was designated as the Year of Cataloging Research and we made progress on studying the challenges facing metadata and information organization practices. However, one year of research is merely a drop in the bucket, especially given the results of the Resource and Description and Access (RDA) National Test and the Library of Congress’ decision to investigate the possibility of transitioning the MARC21 format. This book addresses how information professionals can create a functional environment in which we move beyond just representing information resources and into an environment that both represents and connects at a deeper level. Most importantly, it offers insight on transitioning into new communities of practice and awareness by reassessing our purpose, re-charting our efforts, reasserting our expertise in the areas that information organizer have traditionally claimed but are losing due to stagnation and lack of vision. This book was published as a double special issue of the Journal of Library Metadata.

Data Matching

Author: Peter Christen
Publisher: Springer Science & Business Media
ISBN: 3642311644
Category : Computers
Languages : en
Pages : 279

Book Description
Data matching (also known as record or data linkage, entity resolution, object identification, or field matching) is the task of identifying, matching and merging records that correspond to the same entities from several databases or even within one database. Based on research in various domains including applied statistics, health informatics, data mining, machine learning, artificial intelligence, database management, and digital libraries, significant advances have been achieved over the last decade in all aspects of the data matching process, especially on how to improve the accuracy of data matching, and its scalability to large databases. Peter Christen’s book is divided into three parts: Part I, “Overview”, introduces the subject by presenting several sample applications and their special challenges, as well as a general overview of a generic data matching process. Part II, “Steps of the Data Matching Process”, then details its main steps like pre-processing, indexing, field and record comparison, classification, and quality evaluation. Lastly, part III, “Further Topics”, deals with specific aspects like privacy, real-time matching, or matching unstructured data. Finally, it briefly describes the main features of many research and open source systems available today. By providing the reader with a broad range of data matching concepts and techniques and touching on all aspects of the data matching process, this book helps researchers as well as students specializing in data quality or data matching aspects to familiarize themselves with recent research advances and to identify open research challenges in the area of data matching. To this end, each chapter of the book includes a final section that provides pointers to further background and research material. Practitioners will better understand the current state of the art in data matching as well as the internal workings and limitations of current systems. Especially, they will learn that it is often not feasible to simply implement an existing off-the-shelf data matching system without substantial adaption and customization. Such practical considerations are discussed for each of the major steps in the data matching process.