Data saturation is everywhere. We’re generally inclined to believe that the more, the better, but that isn’t necessarily true with data. The rapid rise in our ability to collect data hasn’t been matched by our ability to support, filter and manage these volumes of aggregated data.
When we come across large texts containing vast amounts of contextual information — such as web queries, questions and forums — there arises the need for contextual information to provide an explicit single link to its answers. Differentiating documents on the basis of their names has always been a complex issue and, considering how much data is out there, our need and ambition to is group information points and related entities based on their attributes grows day by day.
Existing disambiguation methods have considered named entity prominence, context similarity and entity-entity relatedness to discriminate ambiguous entities, which are working at document or paragraph-level texts containing rich contextual information. We combine Lucene (TFIDF) and Semantic Vectors (Similarity Search) — two powerful java libraries — with a variety of NER approaches exposed through our internal service-based framework, and apply them to free natural language texts. We design those technologies and represent words, entities and documents in terms of underlying ideas structured around entities. The outcome can be used for many concept-aware tasks like similarity search, concept matching, clustering, entity disambiguation and knowledge representation. Like Jim Rohn said, “reading is essential for those who seek to rise above the ordinary.”
The questions that arise here are:
● How much is enough?
● Is it enough?
● Is this the data I need?
● Did I cover a large variety of topics?
● Did I cover a large variety of entities?
This approach helped increase the overall productivity and performance of our data analysts between 10 and 50 percent, with the key here being the amount of documents given — more documents are equal to higher performance. We’ve conducted a comparative experiment evaluating the combination of supervised and unsupervised approaches to real-world web data, which demonstrated the effectiveness of combined semantic similarity methods. The fact that the volume of documents/information that reach our data analysts daily is vastly decreased, of course, to the benefit of our clients — they get more knowledge quicker.
*It must be noted that this methodology works best with data provided by search engines, full-text databases and big data storage solutions.