Our core services centre on extracting knowledge from unstructured text. To do that, we build machine learning models that make an inference about the topic of the document: what the named entities mentioned in the text are, what the sentiment tonality of the author is, etc.
However, the information gathered by these models is not, in itself, sufficient to explain what causes the machine to make one prediction over another — ie. what the provenance of the prediction is. Typically, models can provide inferences of differing quality, and the labels they return may vary depending on the project and specific task they’re solving.
To give an example: imagine wanting to extract named entities from a text that contains the word “China”. What you would often see is the word “China” labeled, for instance, as “LOCATION” and “ORGANIZATION” in different passages within the same document. At first glance, you might think one of the two labels is wrong, but depending on the schema of the underlying model — if it assumes that states are a specialisation of the more abstract term “ORGANIZATION” — both labelings may in fact be true. This is not to say that the initial assumption that the model made a mistake is not valid, but either way, you can’t deduce which assumption is true if you don’t have sufficient information about the context in which the word “China” is mentioned. Without context, you simply have two alternative records in the database for the document being examined:
1. “China” is used as a geographical term
2. “China” comes across as some kind of organization
In order to build the context surrounding these terms, we’ve implemented a layer of annotation objects to hold information about the span of the text as classified according to specific labels from the model, the model identifier itself, character positions of the span, etc. We’ve introduced the annotation modeling for both machine predictions and human annotators’ input when it’s used to train machine learning models.
To encode the context, we started using semantic networks to track the context of the predictions generated by machine learning models against analysed documents. Real-world NLP applications use multiple models that are typically trained at different tasks for multiple languages, which naturally escalates the problem by introducing heterogenous annotations. Additionally, these models are commonly trained with custom domain corpora and labeling systems, so their predictions are enriched with enough metadata about which model produced them.
Because annotations have to be modeled in an appropriate format for storage and external data exchange, we’re using Resource Description Framework (RDF) as a standard for expressing them. We opted for RDF as its semantic features offer possibilities for knowledge modeling and representation over annotations — they can be easily used to link the content (or portions of the content) to many existing terminologies and semantic conceptualisation standards, like FOAF, GeoNames and DublinCore ontologies, as well as many publicly available knowledge bases like DBpedia and Wikidata.
In adopting this approach, we’ve integrated a middle layer that stores metadata about the provenance of annotations, ie. a layer that provides information about which model generated a particular prediction and so telling us why “China” — to return to the earlier example — is an organisation rather than, say, a state, in a particular text we’ve processed. This also allows us to filter annotations per model, as relevant to specific business case, and sets a two-way communication standard between humans and machines.
How it works:
To illustrate the use of annotations as semantic representation, let’s look at our named entity recogniser’s (NER) output with this article:
Here, our NER returns a lot of mentions of people, organisations and miscellaneous things, like:
“George Washington” and “Charles Willson Peale” as “person”, and “Continental Army” as “organisation”, among others. The underlying model is CRF — a distance similarity-based model for English with 89-percent recognition accuracy against traditional media texts.
So, assuming that we’ll generate unique identifiers for every mention, the mention of “George Washington” can be modeled as an annotation in the RDF model, like so:
[blue: real-world instances; orange: terminology conceptualisation]