Computer Science Building Room 151
Entities and the relations among them are central for representing our knowledge about the world. This work concentrates on discovering relations from information sources available to us, including unstructured text corpora and structured data. Our first approach is based on distant supervision that aligns an existing knowledge base with a text corpus. This circumvents the need for labeled training data. We develop models that take into account the compatibilities between relation types and their argument types, known as selectional preferences. We also introduce models addressing the challenge that not all sentences mentioning the entity pair from the knowledge base express the relation they bear. Distant supervision relies on having a pre-structured database that contains all entity and relation types that we care about. To discover arbitrary relation types, we explore unsupervised approaches. We develop generative models for discovering latent semantic relation clusters. We also model the ambiguity of surface patterns. These techniques assume semantic equivalence among patterns falling into the same cluster. This fails to represent the diversity and ambiguity of the patterns. So I present universal schema: the union of all relations seen among surface patterns and available structured KBs. This representation keeps the textual patterns as they are and allows us to explore semantic meanings of these patterns. We generalize these relations by learning implications among them using matrix factorization. Preliminary experimental results are encouraging. In this thesis, I propose to address challenges in universal schema--handling missing data, exploring tensor representation and coupled factorization models, incorporating local context information for pattern sense disambiguation, and applying the framework for knowledge representation.
Advisor: Andrew McCallum