Researchers and data scientists use text mining tools to extract and interpret facts, assertions, and relationships from vast amounts of published information. These types of AI and machine learning projects accelerate the research process, increase discovery, provide competitive intel, and help companies identify potential safety issues in the drug or product development pipeline. 

However, despite the many benefits of text mining, researchers face a number of obstacles before they even get a chance to run queries against the body of scholarly literature. 

Here are the three primary challenges we hear when companies build a collection of articles (or “corpus”) for their text mining projects, with tips to overcome them. 

1. Incomplete Information in Article Abstracts

Many researchers build their corpus using scientific article abstracts because they are easily accessible via databases such as PubMed. While data from abstracts provides some value, there are limitations as to what data can be found within an abstract. The ability to mine the full text of the article — including detailed descriptions of methods and protocols and the complete study results — ensures that researchers don’t miss vital data, discoveries and assertions. However, unlike article abstracts, full text is not often readily available from publishers in a format suitable for text mining.

Tip: The more data from multiple publishers/sources the better. Focus on full text to reduce FOMO, and ideally unify their format to make ingestion into mining tools simpler. 

2. Limited Access to XML-Formatted Content

 When companies have journal subscriptions, the documents are often available as PDFs, a format not intended for use with text mining software. Researchers and data scientists must then spend time converting the PDFs to XML (Extensible Markup Language), the preferred format for use in text mining software. XML is used to encode documents in a format that is easily read by computers or “machines” and is used widely so that computer programs can parse or display the content appropriately. To convert PDFs to XML, researchers must use additional software tools which is not only inefficient but also can create several problems with the document itself, including loss of data and tables, conflation of document sections into a “blob of text,” and the addition of bad characters and non-words – leaving open the risk of missing data. 

Tip: Lean more on original source XML versus conversion of PDFs, especially if it is normalized to a standard schema (like JATS).  This typically provides better quality results. Remember: bad data in, bad data out!  

 3. Inconsistent Licensing Terms and Fees

There are different approaches to defining a corpus of scientific literature – some projects require only several, dozens, or hundreds of articles, while others require the processing of hundreds of thousands, or even millions of articles.  To get the best results, varying projects often depend on access to a broad base of content, so businesses must work directly with multiple rightsholders and publishers for the use of full-text XML articles. This typically results in varying fee structures, inconsistent terms of use and ultimately reduced productivity. Without a common set of terms and conditions for the use of full-text content across publishers, researchers and/or information managers are left with the task of negotiating one-by-one with individual rightsholders to obtain the content and rights they need for text mining.

Tip: Save time and effort by taking advantage of collective licensing options available – let someone else take on the negotiating for you. This is also an important time to involve the person or department within your company who manages subscriptions (typically a knowledge or information manager) – they’ll have insights into what the company currently utilizes and may already have relationships with partners that can help streamline licensing.

Keep Learning:
How can CCC help?  

RightFind® XML enables researchers in R&D-intensive companies to make discoveries and connections that can only be found in article full-text. Learn more here.   


Author: Carl Robinson

Carl Robinson has been in publishing since 1995 and has worked for Pearson Education, Macmillan Education and Oxford University Press. At CCC, Carl’s focus is upon helping clients look at business vision, goals and strategies around their content and tooling to enable flexibility and readiness to meet the ever-changing demands of the digital market.
Don't Miss a Post

Subscribe to the award-winning
Velocity of Content blog