EVENTS | VIEW CALENDAR
Speaking the right language
CAMBRIDGE, U.K.—Linguamatics, headquartered primarily in Cambridge, England, and secondarily in the Boston area in North America, has joined with Brandwatch and the University of Sussex to announce a joint project funded by the United Kingdom's Technology Strategy Board to address challenges faced by automated language-processing software in harnessing diverse data sources.
As the parties describe it, "The project forms part of a broader Technology Strategy Board initiative focusing on enabling technologies to harness 'big data' for economic growth." It is dubbed EVOKES, which stands for "Exploitation of Diverse Data via Automatic Adaptation of Knowledge Extraction Software," and the project is expected to run to the end of this year.
EVOKES seeks to improve automatic extraction of information from scientific papers, news or social media for applications in research and development, marketing and competitive intelligence. David Milward, chief technology officer at Linguamatics, added in the news release about the deal that "good-quality vocabularies are a key part of 'intelligent' text mining. This project will allow us to develop vocabularies much faster, and adapt them efficiently for new applications."
Although the effort is not specific to applications in the life sciences, that market is certainly an area where EVOKES could prove quite useful, and Linguamatics is well-versed in life-sciences needs, as it touts itself a leader in deploying natural language processing-based text mining for complex, high-value problem solving and notes that its solutions are used by nine of the world's top 10 pharmaceutical companies, as well as "other prestigious commercial, academic and government organizations."
"Unstructured text contains key information required throughout the drug development pipeline. This information can be accessed by search or, increasingly, by text mining. Both search and text mining require knowledge of the different ways concepts can be expressed and use terminologies for this," Milward explains to ddn. "However, terminologies are often incomplete, with relatively common synonyms missing. Creating and editing terminologies by hand creates a bottleneck to the wider exploitation of terminologies in search and mining. The EVOKES project aims to remove this bottleneck through the use of automated or semi-automated methods."
"In contrast to other areas, life sciences has very complex and ambiguous terminology, but this is balanced by the wealth of existing terminology sources available, such as MeSH, UMLS and Entrez," he adds. "Adaptation and expansion of terminologies is therefore very useful for life sciences, whereas in other domains the focus may be more on creating terminologies from scratch."
Looking to his company's partners in EVOKES, Milward says the University of Sussex is "an expert in the use of distributional methods, and is providing configurable software which analyzes a corpus of document to discover distributional similarities," while Linguamatics and Brandwatch are configuring the software and building applications to improve terminologies and the identification of concepts across diverse data sources.
The current generation of language processing has had considerable success in extracting useful information from unstructured text, whether this is research literature or social media, the project partners note. However, adapting to a new domain is often a laborious process, with respect both to the type of data (such as newswire vs. patent literature) and to the terminology used in a given domain (for example, in medical practice vs. pharmaceutical research). Humans can perform these tasks on small data sets, but experience a huge challenge in the face of massively increasing amounts of electronic text.
"We have done a lot of work on improving terminologies in the past," Milward notes. "However, this is currently a very labor-intensive process. EVOKES should make the process much faster. We also expect to be able to improve end-user tools, such as for suggesting likely synonyms for concepts. The project is particularly timely given the wide range of data that customers are now interested in," he adds, and bringing it back to life-science applications notes that "The emphasis is no longer just on Medline, but also clinical trials, patents, electronic health records, news and even Twitter. New sources typically stretch existing terminologies that were developed with one data source in mind."