Abstract
Although the concept of a knowledge graph has been discussed since at least 1972 [93] it wasn’t until
Google [96] unveiled their version in 2012 that it truly took off. Since then, several companies have
started working on their own knowledge graphs, including Google, Amazon [54], eBay [82], Twitter,
IBM [21], LinkedIn [37], Microsoft [95], and Uber. There has been a rise in the number of academic
publications devoted to the topic of knowledge graphs in recent years, reflecting the growing interest
in this idea. There are several books [79, 85, 52] and papers [84, 24] that detail knowledge graphs, as
well as unique techniques to generating and analysing knowledge graphs and assessments of various
knowledge graph aspects [103].
Apart from these enterprise Knowledge Graphs which are private and not accessible to public, there
are a number of a number of public knowledge graphs published where the content is accessible for
users of the web. The publicly available Knowledge Graphs are called open KGs. DBpedia [58], YAGO
[40], Freebase [101], Wikidata [8] are the top few examples of such open KGs which are multi-domain
and are built using the data from Wikipedia.
Automatic extraction of information from text and its transformation into a structured format is an
important goal in both Semantic Web Research and computational linguistics. Knowledge Graphs (KG)
serve as an intuitive way to provide structure to unstructured text. A fact in a KG is expressed in
the form of a triple which captures entities and their interrelationships (predicates). Multiple triples
extracted from text can be semantically identical but they may have a vocabulary gap which could lead
to an explosion in the number of redundant triples. Hence, to get rid of this vocabulary gap, there
is a need to map triples to a homogeneous namespace. In this work, we present an end-to-end KG
construction system, which identifies and extracts entities and relationships from text and maps them to
the homogenous DBpedia namespace. For Predicate Mapping, we propose a Deep Learning architecture
to model semantic similarity. This mapping step is computation heavy, owing to the large number of
triples in DBpedia. We identify and prune unnecessary comparisons to make this step scalable. Our
experiments show that the proposed approach is able to construct a richer KG at a significantly lower
computation cost with respect to previous work.
Over recent years, document similarity has grown to become the foundation of various natural language processing activities, which are crucial to information retrieval, automatic question answering,
machine translation, dialogue systems, and document matching. For document or topic similarity, focusing on the semantics within the text has been the most common and pursued direction of effort Our novel KG-based similarity classifier works on the limitations of previous approaches and also provides reasoning behind the classification. Our results show that we’re able to score similarity between
Wikipedia documents accurately. Furthermore, the accuracy of our approach is able to withstand against
noise and paraphrasing. We also see that our classifier can be used for category outlier detection in DBpedia.
In this thesis, we will focus on Knowledge Graph construction from unstructured text and document
similarity. Our knowledge graph construction approach performs better > 25% better than Cosine
& Rule-based approaches, and is also computationally cheaper than state-of-the-art T2KG [53] by a
magnitude of atleast 106
. We use these constructed knowledge graphs to classify similarity between
two documents, which is used to detect category outliers in DBpedia and find highly similar documents.