Speaking the right language

CAMBRIDGE, U.K.—Linguamatics, headquartered primarily inCambridge, England, and secondarily in the Boston area in North America, hasjoined with Brandwatch and the University of Sussex to announce a joint projectfunded by the United Kingdom's Technology Strategy Board to address challengesfaced by automated language-processing software in harnessing diverse datasources.

As the parties describe it, "The project forms part of abroader Technology Strategy Board initiative focusing on enabling technologiesto harness 'big data' for economic growth." It is dubbed EVOKES, which standsfor "Exploitation of Diverse Data via Automatic Adaptation of KnowledgeExtraction Software," and the project is expected to run to the end of thisyear.

EVOKES seeks to improve automatic extraction of informationfrom scientific papers, news or social media for applications in research anddevelopment, marketing and competitive intelligence. David Milward, chieftechnology officer at Linguamatics, added in the news release about the dealthat "good-quality vocabularies are a key part of 'intelligent' text mining.This project will allow us to develop vocabularies much faster, and adapt themefficiently for new applications."

Although the effort is not specific to applications in thelife sciences, that market is certainly an area where EVOKES could prove quiteuseful, and Linguamatics is well-versed in life-sciences needs, as it toutsitself a leader in deploying natural language processing-based text mining forcomplex, high-value problem solving and notes that its solutions are used bynine of the world's top 10 pharmaceutical companies, as well as "otherprestigious commercial, academic and government organizations."

"Unstructured text contains key information requiredthroughout the drug development pipeline. This information can be accessed bysearch or, increasingly, by text mining. Both search and text mining requireknowledge of the different ways concepts can be expressed and use terminologiesfor this," Milward explains to ddn. "However, terminologies are oftenincomplete, with relatively common synonyms missing. Creating and editingterminologies by hand creates a bottleneck to the wider exploitation ofterminologies in search and mining. The EVOKES project aims to remove thisbottleneck through the use of automated or semi-automated methods."

"In contrast to other areas, life sciences has very complexand ambiguous terminology, but this is balanced by the wealth of existingterminology sources available, such as MeSH, UMLS and Entrez," he adds."Adaptation and expansion of terminologies is therefore very useful for lifesciences, whereas in other domains the focus may be more on creatingterminologies from scratch."

Looking to his company's partners in EVOKES, Milward saysthe University of Sussex is "an expert in the use of distributional methods,and is providing configurable software which analyzes a corpus of document todiscover distributional similarities," while Linguamatics and Brandwatch areconfiguring the software and building applications to improve terminologies andthe identification of concepts across diverse data sources.

The current generation of language processing has hadconsiderable success in extracting useful information from unstructured text,whether this is research literature or social media, the project partners note.However, adapting to a new domain is often a laborious process, with respectboth to the type of data (such as newswire vs. patent literature) and to theterminology used in a given domain (for example, in medical practice vs.pharmaceutical research). Humans can perform these tasks on small data sets,but experience a huge challenge in the face of massively increasing amounts ofelectronic text.

"We have done a lot of work on improving terminologies inthe past," Milward notes. "However, this is currently a very labor-intensiveprocess. EVOKES should make the process much faster. We also expect to be ableto improve end-user tools, such as for suggesting likely synonyms for concepts.The project is particularly timely given the wide range of data that customersare now interested in," he adds, and bringing it back to life-scienceapplications notes that "The emphasis is no longer just on Medline, but alsoclinical trials, patents, electronic health records, news and even Twitter. Newsources typically stretch existing terminologies that were developed with onedata source in mind."