Abstract

Many NLP applications entail that texts are classified based on their semantic distance (how similar or different the texts are). For example, comparing the text of a new document to that of documents of known topics can help identify the topic of the new text. Typically, a distributional distance is used to capture the implicit semantic distance between two pieces of text. However, such approaches do not take into account the semantic relations between words. In this article, we introduce an alternative method of measuring the semantic distance between texts that integrates distributional information and ontological knowledge within a network flow formalism. We first represent each text as a collection of frequency-weighted concepts within an ontology. We then make use of a network flow method which provides an efficient way of explicitly measuring the frequency-weighted ontological distance between the concepts across two texts. We evaluate our method in a variety of NLP tasks, and find that it performs well on two of three tasks. We develop a new measure of semantic coherence that enables us to account for the performance difference across the three data sets, shedding light on the properties of a data set that lends itself well to our method.

This content is only available as a PDF.

Author notes

* Department of Computer Science, University of Toronto, 6 King's College Road, Toronto, Ontario M5S 3G4, Canada. E-mail: vyctsang@cs.toronto.edu.

** Department of Computer Science, University of Toronto, 6 King's College Road, Toronto, Ontario M5S 3G4, Canada. E-mail: suzanne@cs.toronto.edu.