Using graph distances for named-entity linking

Roi, Blanco; Paolo, Boldi; Marino, Andrea

doi:10.1016/j.scico.2015.10.013

Entity-linking is a natural-language-processing task that consists in identifying strings of text that refer to a particular item in some reference knowledge base. When the knowledge base is Wikipedia, the problem is also referred to as wikification (in this case, items are Wikipedia articles). Entity-linking consists conceptually of many different phases: identifying the portions of text that may refer to an entity (sometimes called “entity detection”), determining a set of concepts (candidates) from the knowledge base that may match each such portion, and choosing one candidate for each set; the latter step, known as candidate selection, is the phase on which this paper focuses. One instance of candidate selection can be formalized as an optimization problem on the underlying concept graph, where the quantity to be optimized is the average distance between the selected items. Inspired by this application, we define a new graph problem which is a natural variant of the Maximum Capacity Representative Set. We prove that our problem is NP-hard for general graphs; we propose several heuristics trying to optimize similar easier objective functions; we show experimentally how these approaches perform with respect to some baselines on a real-world dataset. Finally, in the appendix, we show an exact linear time algorithm that works under some more restrictive assumptions.

Using graph distances for named-entity linking / Blanco Roi, Boldi Paolo, Marino Andrea. - In: SCIENCE OF COMPUTER PROGRAMMING. - ISSN 0167-6423. - ELETTRONICO. - (2016), pp. 1-13. [10.1016/j.scico.2015.10.013]