Pilots

The pilots page features our in-house-developed solutions and tools that have gone through the research and development phases and been deployed for operational testing. New solutions are regularly introduced to the page — we’re constantly working to expand our service line-up and offer you even more. Explore the latest developments below.

Projects

  • Data saturation is everywhere. We’re generally inclined to believe that the more, the better, but that isn’t necessarily true with data. The rapid rise in our ability to collect data hasn’t been matched by our ability to support, filter and manage these volumes of aggregated data.

    When we come across large texts containing vast amounts of contextual information — such as web queries, questions and forums — there arises the need for contextual information to provide an explicit single link to its answers. Differentiating documents on the basis of their names has always been a complex issue and, considering how much data is out there, our need and ambition to is group information points and related entities based on their attributes grows day by day.

    Existing disambiguation methods have considered named entity prominence, context similarity and entity-entity relatedness to discriminate ambiguous entities, which are working at document or paragraph-level texts containing rich contextual information. We combine Lucene (TFIDF) and Semantic Vectors (Similarity Search) — two powerful java libraries — with a variety of NER approaches exposed through our internal service-based framework, and apply them to free natural language texts. We design those technologies and represent words, entities and documents in terms of underlying ideas structured around entities. The outcome can be used for many concept-aware tasks like similarity search, concept matching, clustering, entity disambiguation and knowledge representation. Like Jim Rohn said, “reading is essential for those who seek to rise above the ordinary.”

    The questions that arise here are:
    ● How much is enough?
    ● Is it enough?
    ● Is this the data I need?
    ● Did I cover a large variety of topics?
    ● Did I cover a large variety of entities?

    This approach helped increase the overall productivity and performance of our data analysts between 10 and 50 percent, with the key here being the amount of documents given — more documents are equal to higher performance. We’ve conducted a comparative experiment evaluating the combination of supervised and unsupervised approaches to real-world web data, which demonstrated the effectiveness of combined semantic similarity methods. The fact that the volume of documents/information that reach our data analysts daily is vastly decreased, of course, to the benefit of our clients — they get more knowledge quicker.

    *It must be noted that this methodology works best with data provided by search engines, full-text databases and big data storage solutions.

    References:
    https://github.com/semanticvectors/semanticvectors/wiki/SearchOptions
    https://lucene.apache.org
    https://nlp.stanford.edu/software/CRF-NER.html
    https://github.com/zalandoresearch/flair

  • We embarked on SIP_finder – a risk & compliance (R&C) project aiming to automatically identify Special Interest Persons (SIP) involved in specific criminal activity from large sets of news articles – in 2017. The project was launched in response to a request from a global news publisher that sought to streamline its internal intelligence-gathering processes. We were given a set of predefined criteria for SIP.

    The way we set about the task was to try to replicate the decision-making processes of our sister company A Data Pro’s due diligence analysts when assessing media sources in an artificial neural network. Using a human-in-the-loop (HITL) approach, we applied the knowledge available in-house – from A Data Pro’s R&C reporting practices and linguistic expertise – to train the model to not only recognise SIP within articles, as per the original client request, but to mine and systematise the contextual information about SIP and their crimes – including their industry, related persons and organisations, monetary value of the crime and stage of any ensuing legal proceedings.

    This enhanced SIP_finder solution achieved an unparalleled 93-percent recall rate in its pilot phase. In isolating the articles that provided the fullest picture about SIP and their crimes, and classifying and interrelating the contextual elements within that picture, it effectively tackled a due diligence analyst’s most time-consuming task – that of sifting through large volumes of information – automatically, making this model a potential a game-changer for both our client and A Data Pro’s analysts. Potentially for the industry, as well.

    SIP_finder – still in its pilot phase while we’re further refining it – won us a Silver Stevie International Business Award and WealthBriefing Award in 2018, as well as plenty of invitations to speak at events.

    We’re currently building on this success and know-how in knowledge extraction within an R&C framework to develop an NLP algorithm that extracts context about organisations from corporate news. This undertaking – currently in its final development stage and intended for in-house use by A Data Pro’s MMA teams – will automatically assess organisations’ behaviour according to news sources, as related to stock exchange movements, structural and political changes, redundancy or appointment of employees, and new service announcements.