At Identrics we tackled one of the most challenging problems in Natural Language Processing (NLP) — extracting semantic relationships between entities in text. More specifically, we became interested in discovering the relations between event-spans and their dependant attribute entities — subjects, objects and dates.
The most common work-case scenarios we deal with in our R&D projects and when delivering client solutions require the extraction of some key semantic information conveyed by a document or a piece of text. This information can often be thought of as the answer to a general binary or categorical question — like “Is this document relevant to topic X or not?”, “What is the overall sentiment expressed in the text towards its main subject?” or “Which domain does this article belong to?”.
For this kind of problems the Force is strong with NLP — there already exists a myriad of text classification solutions based on both statistical learning theory such as Support Vector Machines (SVMs) and piecewise objective function optimisation techniques — more popular as Artificial Neural Networks (ANNs). While some methods are sometimes better than others, it mostly depends on the size and properties of the dataset in question. One can usually find a good-enough fit for the classification task at hand by trying out several approaches and choosing the best trade-off between performance and complexity, based on the service requirements and the available computational resources.
However, when it comes to answering more specific semantic questions, this is not quite the case anymore. For example, one might be interested in asking “Who are the key figures in the story told by this article?”, “What sentiment is expressed towards them?” and “How do they relate to each other and to other mentioned entities?”, or “What are the main events and phenomena discussed in this document?, “What are their relationships?”, “Which is the subject causing them and which is the object of their impact?”.
Answering any of these correctly seems to require very particular insights about the contents and meaning of the target text. Learning the language features which appear to be most relevant to the task of discriminating a piece of text into some category is simply not enough. Тhis is exactly what text classifiers mostly rely on in order to answer yes-no or multiple-choice questions.
Recent advancements of machine learning (ML) for NLP have leveraged the descriptive power of ANNs for automatically learning key features directly from the underlying patterns in text data. Deep language models and word-embedding algorithms are now steadily becoming the go-to processing tool for representing natural language. The dense vectors obtained using these methods offer computation-handy representations of text that are sensitive to both semantics and structure. As a consequence, cutting-edge performance has been gained on some text classification tasks, but at significant computational and data-resource costs. On the other hand, answering some of the more elaborate questions about what is expressed in human language using information technology is becoming ever-more tractable.
The problem described here was not a trivial one even with the state-of-the-art tools of deep learning, as it combines questions from several advanced NLP topics. We constructed a compound knowledge extraction pipeline that addresses each of them with some care. There were many choices and trade-offs to consider and years of experience in the data intelligence industry to take into account.
Pre-processing the texts:
In order to perform any computation on our aggregated and annotated text data, we first transform it into a numeric shape. Here we chose to experiment with a variety of dense text representations, learned by some of the best embedding algorithms and language models out there. We further looked for the optimal combinations of these, for each performed NLP task, as they can easily be concatenated into a stack for each represented text element or token.
Named Entity Recognition:
Named Entity Recognition (NER) is the first technique to apply to data after the pre-processing step in our pipeline. NER is the information extraction task that seeks to locate named entity mentions in unstructured text and label them with predefined categories such as person, organization, location, time expression, quantity etc.
The purpose of extracting entities at this initial stage is to filter the candidate entities, between which we are looking to detect relationships — more on this further on.
As we were no strangers to the NER task, we went directly to our best per-language solutions at current. These are mostly based on Recurrent Neural Networks (RNNs) and Conditional Random Fields (CRF) training algorithms. We used publicly available NER corpora for some more common languages such as English, Spanish, German, Russian and others. For other languages like Chinese and Bulgarian, we had to prepare our own corpora, annotated with entities of interest.
Predicting the relationship between two entities in text was the core problem we had to address. For the purposes of this task we researched a variety of current state-of-the-art solutions for both relation extraction (RE) as well as dependency parsing (DP).
We first investigated the methods for extracting part-of-speech tags (POS) and parsing their interdependencies (DP), as this was used a baseline for RE in the times when NLP was dominated by computational-linguistic and statistical approaches to text processing. The idea was to obtain detailed parsing trees of the constituent language elements and their constructing grammar in order to extract candidate entities for relation classification. Another tool that we also experimented with was chunkers that were used for extracting noun-phrases as potential entities or events.
We then followed a slightly different path — using the now-dominant deep learning (DL) techniques for text representation and named entity recognition (NER) for picking our candidates. To classify the candidates’ relationships we trained a Convolutional neural network (CNN) on a dataset of sentences containing pairs of entities with annotated relation types. Still, the joint POS + DP approaches served as an inspiration for devising our semantic pipeline.
Classification of event-entities’ subjects / objects / dates:
For this ultimate goal of the project we had to engineer a way of classifying entities in text as the corresponding subjects, objects or dates of given event entities. In order to do so, we employed the NER models from the second stage of our processing pipeline.
On the one hand, we used the deep sequence-tagging NER architectures to extract entities identifying people and organisations as candidates for subjects or objects of events, as well as date-entities as candidates for events’ dates. This was analogous to what we did for extracting candidate entities in sentences for relation classification. On the other hand — for identifying event entities candidates — interestingly, we re-trained very similar sequence-tagging models on event-annotated corpora to extract the candidate event entities.
Having completed all entity-extraction to a reasonable degree of precision, we could then treat the task of classifying events’ contextual attributes as subjects, objects or dates in a very similar fashion to the one of extracting different types of relationships between events and other entities. Only this time we would have one entity fixed as event and would look to classify their relationship to another candidate-entity into one of three types: event-subject, event-object or event-date relation. Hence we used a slight modification of our relation-extraction solution to tackle this task.
The Venn diagram above depicts the native distribution of annotated event-relations in sentences at the initial stage of the data aggregation process. The three ellipses represent event-subject, event-object and event-date relationships in dark, lighter and lightest grey respectively. Their overlaps mark the cases when multiple relationships (attributes) are present for the same event-entity example.
A sоmewhat similar distribution persisted as the amounts of data increased, so we had to handle the classification problem by devising disjoint sets of sentences with a single type of relationship in them.
To maximize our class samples we considered only event-object relations from sentences containing both event-subject and event-object annotations. Similarly we considered only event-date relations from some of the annotated sentences featuring them in combination with event-subject and event-object. By focusing model’s attention only on the entities of event attributes we care about in a given sentence we could train the machine to recognise them all independently. We would thereafter query it to classify all potential event-attribute candidate pairs of interest separately.
We then focused on adapting a binary relation-classifier such that we could use a larger number of examples to distinguish between the two more similar types of relationships: event-subject and event-object. We trained and optimised the model on samples from both classes to obtain some very promising results.
We extended this model to classify event-subject and event-object relation candidates in a three-way manner, so that we can predict whether there is a relationship between events and potential subjects/objects or none at all. Finally we combined this classifier with a binary one that determines whether there is a relationship between event and date candidates or not.
Conclusion and future work:
Our pipeline is designed to ultimately extract a variety of relationships between named entities in texts, as well as to predict event entities’ contextual subjects, objects and dates. At this stage we had researched, integrated and validated a set of advanced NLP methods as tools for solving the underlying problems that constitute the tasks of the pipeline. In the remaining phases of the project we will focus on refining them and putting all of them together at work to apply the semantic processing pipeline to large amounts of unstructured text data.