The humans behind
Data Scientist Iva Marinova and her colleague, Yolina Petrova, were the bold AI warriors who told the story of the breakthrough technology designed to improve the everyday working lives of countless communication professionals around the world. Iva’s experience covers deep learning and artificial neural networks, statistical data analysis and Named Entity Recognition (NER). She speaks four languages and holds the unlikely combination of a Master’s degree in Computer and Information Systems Security and a Bachelor’s in Theatre – a match made in heaven when it comes to uniting digital models with human behaviours.
Where it all started
The story begins a long time ago, in a galaxy not so far, far away… Back in 2015, when Identrics was created with the main goal of helping journalists and media analysts not only acquire and process the huge amounts of news data available on the web – but also understand it better. Since then, the company has expanded its expertise in all Natural Language Processing (NLP)-related tasks, such as parsing texts in more than 40 languages to obtain information and extract knowledge from them. Automatic summarisation was then distinguished as crucial in minimising the work effort of journalists and ensuring faster and cheaper delivery of analytical solutions for media monitoring.
The task ahead of Identrics was to create and design a system that could process automatically acquired news data in different languages and generate abstractive summaries in English so that these summaries could be used by journalists as a basis in retelling international news. Within this project, the Identrics team acquired data from the real working process of a company specialising in media monitoring. After filtering out everything that was too long or too short, or too noisy or repetitive, our colleagues ended up with around 200,000 examples to fine-tune and test the model solution. The process was intended to work in three consecutive steps. First, all texts were automatically translated using open-source neural machine translation with state-of-the-art quality. The texts were then processed to meet specific requirements. Fine-tuning came last.
The challenges along the way
Much as it was expected, problems occurred along the way. For example, in most solutions, the models’ attention focused mostly on the beginning of a text source, which meant that the information appearing at the end of the text was either truncated or simply ignored. Sometimes, there were missing facts or additional content. The so-called “hallucination” of the machine led to some mistakes and misunderstandings that were, at the very least, fun to read. Grammatical inconsistencies caused some trouble. The similarity to extractive summaries was, at times, a little too much. A particularly troublesome issue was the reproduction of existing biases, which sometimes even generated toxic language.
The way out of summarisation chaos
By validation, fact-preserving and fact-checking, Identrics managed to handle all these issues in the best way possible. They began performing topic, sense and length validation, and, most importantly, we included a “bad words” filter to sort toxic language. By the prompt engineering of the validation steps, absolutely all of the prompts were caught by the validation procedures. The feedback from users confirmed that the validated content was now correct and 100% inoffensive.
To ensure fact-preserving, Identrics used FastText, NetworkX, and PageRank. The company thus optimised 12.1% of the evaluated articles. Missing factology is a complex problem and its solution should be embedded in both the model and the specific data.
Fact-checking was the biggest challenge. Recent research shows that nearly 30% of the summaries generated by abstractive summarisation models contain fake facts. To address this problem, Identrics proposed a fact-checking algorithm with two sources of inspiration. On one hand, the algorithm is based on the manual evaluation of the fine-tuned model, performed by human domain experts. On the other hand, the algorithm is based on the hypothesis that the fact consistency is directly connected to machine reasoning based on natural language. Our suggestion in this regard was to implement the algorithm as incorporating two specific tasks – named entity recognition and textual entailment.
The algorithm works in several consecutive steps. It first checks whether the generated summary contains any day(s) of the week or month(s) which do not appear in the source text. Thereafter, it checks whether the generated summary contains named entities such as people or organisations that do not appear in the source text. By using Bert score, Identrics then aligned the sentences of the generated summary to the most similar sentences in the source text. Each pair of sentences was tested for textual entailment, determining whether the sentences were logically connected before finally performing spelling and grammar checks.
The comparison between human evaluation and the fact-checking algorithm showed that the fact-checking algorithm covered more than half of the factual errors detected by our human experts and detected around 13% further factual errors originally missed by them.
Our ethical focus
“With our work, we want to direct more attention not only to the end products of the widely used Transformers but also to their pre-training phase to ensure and validate the quality of a dataset. High data quality in the pre-training phase of the Transformers is very essential for their performance in order to ensure that their fine-tuned inheritants and the consecutively deployed systems not only perform safely and as intended, but also do not become a source of discrimination or misinformation,” Iva Marinova emphasised as a conclusion at the event.
The next step forward
Ethical frameworks and regulations for systems using AI are already being developed globally. One such example is Artificial Intelligence Act6, a regulation proposed by the European Parliament and European Council to the European Commission. Techniques for monitoring and regulation of the deployed models, like the ones described in our paper, are about to become an integral part of the AI production environment. The proposed methods are model agnostic and can be applied to any neural abstractive summarisation model.
This is why Identrics’ future priorities are to pay more attention to the biases and text characteristics in the datasets used in the pre-training phase; develop model agnostic procedures to validate the data used in pre-training; and, of course, keep focusing on the safe and productive deployment of AI.
And finally… the recognition
Our efforts and achievements in the field of abstractive summarisation have not remained unnoticed. Recently, Identrics has been recognised with a Gold Stevie trophy in the 18th Annual International Business Awards® (IBAs) in the Content Management Solution category. As proud as we are with this award, it is, above all, a confirmation for us that we are moving in the right direction. And we’ll keep moving.
Try the demo version of Identrics’ abstractive summarisation solution here: