Knowledge Graphs | CCC's Velocity of Content Blog and Podcast Series https://www.copyright.com/blog/topic/knowledge-graphs/ Rights Licensing Expert Mon, 03 Apr 2023 14:33:35 +0000 en-US hourly 1 https://www.copyright.com/wp-content/uploads/2021/06/cropped-favicon-512x512-1-32x32.png Knowledge Graphs | CCC's Velocity of Content Blog and Podcast Series https://www.copyright.com/blog/topic/knowledge-graphs/ 32 32 The Data Quality Imperative: The Critical Importance of Data Quality https://www.copyright.com/blog/the-data-quality-imperative-the-critical-importance-of-data-quality/ Thu, 15 Dec 2022 14:12:13 +0000 https://www.copyright.com/?post_type=blog_post&p=41891 Babis Marmanis shares his experience in building a knowledge graph of researchers that have published articles related to Coronaviruses and provide some details of the data quality issues that were encountered while working with the bibliographic metadata.

The post The Data Quality Imperative: The Critical Importance of Data Quality appeared first on Copyright Clearance Center.

]]>
This is a summary of a pre-print article my team has written on this topic (with supporting data and references); it is accessible at Europe PMC.

In “Knowledge Graphs as Belief System Encapsulations,” I presented the view that “knowledge must necessarily be associated with a degree of confidence that expresses the strength of our conviction about the accuracy of the information.”

In this blog post, I will share our experience in building a knowledge graph of researchers that have published articles related to coronaviruses and provide some details of the data quality issues that we encountered while working with the bibliographic metadata.

Data Quality Issues Encountered in building CCC COVID-19 Author Graph

In order to facilitate the discovery of experts who are working in the COVID-19 field, we selected a set of published scientific articles in virology with special attention to coronaviruses and we used the associated bibliographic metadata to extract the authors, their affiliations, the titles of their articles, the journals in which they were published, the appropriate medical subject headings (MeSH) terms, and their citations.

The idea was to build a graph of the above-mentioned information so that someone could explore it through a visualization tool since that is a great way to interact with the data and quickly identify qualified experts in any field of interest. We built such a visualization tool so that everyone could benefit from that knowledge since the tool does not require any special skills, intuition suffices to enable the exploration of the graph.

Knowledge Graphs, Knowledge Systems, and Data Quality

A knowledge graph is the product of a knowledge system. A knowledge system takes a set of data as its input and, through a series of data processing steps, extracts as much of the “actionable information” in the data as it is possible or desirable; the term “actionable information” is our working definition of “knowledge.”

Suppose that we could quantify the maximum amount of knowledge that one could produce from a given set of data (S), let us call that amount Kmax(S). A good measure of success for a knowledge system would be the ratio of the knowledge actually produced (let us denote that with K(S)) over the maximum amount of knowledge that is theoretically possible to produce, i.e. the ratio K(S) / Kmax(S). That ratio can be used as a criterion for determining whether the data is “good enough” since, above a threshold value, the effort expanded for a given (percentage) “gain” in the ratio will grow exponentially fast. This is a fancy way of saying that, for a given input, the amount of knowledge that can be obtained diminishes with increasing effort beyond a certain level of effort. Now, the catch is that we do not know that theoretical maximum value for any given input that is not trivial. Nevertheless, sensible (domain specific) rules can be created to estimate that value.

Another empirical fact that we know is that the amount of work needed to produce a given amount of knowledge increases as the quality of input data decreases. The better our data are, the greater the amount of knowledge that can be produced for a given effort. This is important when you select the data that you want to work with. Sometimes, you may not have an option but other times you may be selecting from a list of vendors the data that you will be using. The higher the quality of your input data the smaller the amount of work that you would have to do, for a given amount of knowledge that you want to produce.

With these general remarks in mind, let us consider the problem that we were trying to solve with the COVID-19 author graph. In that context, “knowledge” is the information that can support the decision of selecting a specific researcher as the most qualified candidate for conducting the peer-review of a given manuscript; our prototype system did not address the issue of availability or any other matters of logistics. One can define what “most qualified” means as he or she pleases, and that point is important since it automatically implies that there is not one knowledge graph ‘to rule them all.’ What is most important here is that no matter what definition one selects, the confidence in supporting the decision can be quantified.

Data Quality Issues

When building the COVID Author Graph, we used data sourced from MEDLINE, a bibliographic citation repository for articles published in medicine and the life sciences produced by the US National Library of Medicine. We initially selected 184,187 scientific articles from the MEDLINE corpus by searching for terms related to corona viruses; if you are curious about the specifics of the query, we looked for the following terms: ‘upper respiratory infection’, ‘COVID-19’, ‘novel coronavirus’, ‘SARS-CoV-2’, ‘SARS’, ‘MERS-CoV’, and ‘severe acute respiratory syndrome’.

MEDLINE is a widely used, public source for data of over 30 million articles. However, there are several known data quality issues in MEDLINE, which surface quickly when one attempts to create a knowledge graph from them. Let us see what some of those issues are.

Issues concerning standard identifiers

The use of standard identifiers is an excellent way to ensure the correct identification of entities in the data. For example, since every article in MEDLINE has a PMID (that is an identifier produced by the creators of MEDLINE), we have a reliable way to identify a journal article correctly and repeatedly in MEDLINE.

To begin with, other standard identifiers were sparse in the data. Out of over 1.1 million author instances in the COVID dataset, around 10% had an ORCID attributed to the author and a number of these ORCIDs were invalid. In some cases, all authors of the paper were assigned the same ORCID, which clearly cannot be true for a “unique identifier.” Moreover, it should be noted that the number of authors with an ORCID is fewer than 3% for all of MEDLINE. ISNI and GRID identifiers were even less prevalent, with 2884 links to ISNI identifiers and 3276 links to GRID identifiers out of 112 million authors in the COVID dataset. There were roughly only 700 institutions represented by ISNI and GRID, respectively.

Further Issues concerning authors and affiliations

For an author graph, the author identity is an essential piece of information. We needed to extract the authors of the published articles from the bibliographic metadata reliably and with a measure of confidence as to whether the information about an author in the graph is accurate. Do the names “R S Baric,” “Ralph Baric,” “Ralph S Baric,” “Ralph A Baric,” and “Ralph Steven Baric” all refer to the same author?

Moreover, if we look at just the length of an author’s last name in 2020 for the COVID dataset, the number of names with a length of 1 character and those with a length >30 characters exploded. In just that metric, we can clearly see the impact (on the data quality) of rushing out so many COVID-19 publications in 2020.

Another challenge with Collective Names / Working Groups is that authors who are also part of working groups can end up being listed twice, or more times, in the author list. E.g., “Frédéric Choulet” appears 12 times in the author metadata for PMID 30115783.

Finally, data quality issues also appear with author affiliations. Author affiliations may be concatenated and duplicated across all the authors rendering correct attribution impossible. When we do have any author affiliation, that affiliation is not always useful. Believe it or not, the most common affiliation in MEDLINE is the institution-less ‘Department of Psychology.’, including the period! It appears over 8,000 times, whereas both ‘.’ and ‘,.’ make the top five.

Issues concerning publication dates

Another area where we observe data quality issues is with publication dates. The oldest publication year in all of MEDLINE is the year 0001, which is clearly wrong, while the next oldest publication year is 1041, Gutenberg was not alive yet let alone the subject matter of the paper. It turns out that, in the latter example, the number 1041 is actually the page number for the article misplaced in the date field! Months are meant to be a three-character string in MEDLINE but can also be numeric and this leads to other problems. These types of issues with publication dates are not uncommon in our experience nor are they unique to MEDLINE. In other datasets that we work with, we have seen publication dates far out in the future (e.g. 2104) that are clearly numerical character transpositions of the correct publication date (e.g. 2014).

Schema-level issues

The team also identified issues at the schema-level which I am not going to get into here. (For additional details, see my LinkedIn post, which references problematic elements of the 2019 DTD.) Briefly, the schema also misses out a key relationship, namely, the relationship between an Investigator and a Collective, which is referenced in the documentation:

For records containing more than one collective/corporate group author, InvestigatorList does not indicate to which group author each personal name belongs.

It then explains the NLM’s reasoning for repeating names:

In this context, the names are entered in the order that they are published; the same name listed multiple times is repeated because NLM cannot make assumptions as to whether those names are the same person.

Summary

When my team set out to build the COVID Author Graph, we did not want to build a “quick-and-dirty” graph of authors from the COVID-related scientific literature. Our aim was to build a knowledge graph of authors, in the sense that we wanted to quantify our confidence in the underlying quality of the data and sort out the litany of known problems, many of which I mentioned above. We wanted to disambiguate authors and their affiliations as well as we possibly could. Producing and maintaining knowledge is an iterative process that is based on continuous improvement. At a high level, all knowledge systems implement some version of the Deming cycle (a.k.a. the PDCA cycle) because the nature of knowledge is dynamic, and the preservation of quality (for its data) requires feedback loop mechanisms. The outcome of such systematic data engineering can be a variety of data products, where the degree of confidence for every piece of information can be quantified and its provenance be reported. Such data can then lead to more knowledge through inference — the Bayesians amongst us can see where all this is going!

This post has emphasized that even the simplest data sets are fraught with problems that impede the production of knowledge. Thus, every knowledge system must define and measure data quality systematically. The better the quality of our data, the greater the amount of knowledge that can be produced for a given effort.

The post The Data Quality Imperative: The Critical Importance of Data Quality appeared first on Copyright Clearance Center.

]]>
Behind the Scenes: Building CCC Expert View https://www.copyright.com/blog/behind-the-scenes-building-ccc-expert-view/ Tue, 21 Jun 2022 13:57:11 +0000 https://www.copyright.com/?post_type=blog_post&p=38273 CCC’s Expert View facilitates exploration of hundreds of thousands of authors, including interconnections between them, their publications, and areas of interest. Here’s a behind-the-scenes look at how we built it.

The post Behind the Scenes: Building CCC Expert View appeared first on Copyright Clearance Center.

]]>
Since May of 2020, we have seen over 2000 articles per week published on COVID-19. That’s a lot of new information to absorb and digest. And of course, with COVID, we have not only the problem of too much information too quickly, but the urgency–especially back in Spring 2020–to find answers and experts now.

It was this context that provided CCC the impetus to take some experimental work we were doing in-house on data pipelines and build a COVID author graph. In April 2020 we released a prototype of a knowledge graph of authors who specialize in COVID and related fields of study. That prototype was the precursor to CCC Expert View.

Why did we do that? We believe that knowledge graphs, and their ability to quickly answer questions from large datasets of entities and relationships, are an appropriate tool for finding people and experts in a dataset like the COVID literature.

How We Built CCC Expert View

The knowledge graph is comprised of two key elements: a data pipeline that produces graph data from source data, and an application that allows the user to explore and interact with that data.

We start with article metadata and journal data, medical subject headings (MeSH) for our ontology, and institution data from Ringgold. The source data is standard XML and tabular data. As a source of information, it presents many of the challenges that we discuss in more detail here (no clear entities, voluminous, few explicit relationships, of unknown data quality).

Next, we take this data and run it through our data pipeline. This pipeline is a series of processing steps whose purpose is to extract the relevant entities and their relationships in the form of graph data. There are five types of entities, namely: authors, articles, institutions, journals, and fields of study. And there are many different types of relationships between them, such as connections between authors and authors, authors and articles, and authors to affiliated institutions. 

 What Happens at Each Stage of the Pipeline
Gather reference data

These are reference frames that are externally available. We bring in standard identifiers (NLMID, ISSN, MeSH, Ringgold identifiers). We are using known identifiers to build our reference framework. This is the non-article data.

Select Content

Next, we bring in our article data and select which content we want to process based on certain customer criteria. This is both selecting the appropriate metadata to use and filtering for the domain of interest. 

Create Distinct Authors

Subsequently we create the list of distinct authors. This is the heart of the process where we determine which of the authors represented in article source data are actually distinct individuals and what variations of a name correspond to the same physical person. 

Conduct Statistical Analysis

Next, we conduct a statistical analysis both for quality assurance purposes and to calculate our level confidence, or degree of belief. 

Finally, we produce our final datasets.

The final graph that we produce is a product of a knowledge system; a term used to indicate that there is an iterative nature of refinement built into our processing of the data with the goal of obtaining knowledge. Our learning architecture sets the foundation for improving the quality of the data in the graph over time by quantifying each assertion and providing benchmarks of quality. 

To keep learning, check out:

  • The Data Quality Imperative” from CTO Babis Marmanis. He discusses the impact data quality has on knowledge production, with examples from our experiences working with bibliographic raw metadata for the CCC COVID Author Graph.

Interested in knowing more about how CCC Expert View can help your organization identify experts and key opinion leaders? Learn more. 

The post Behind the Scenes: Building CCC Expert View appeared first on Copyright Clearance Center.

]]>
Using a Knowledge Graph to Identify Researchers & Key Opinion Leaders https://www.copyright.com/blog/using-knowledge-graphs-to-identify-researchers-key-opinion-leaders/ Tue, 14 Jun 2022 14:02:33 +0000 https://www.copyright.com/?post_type=blog_post&p=38217 Whether working in business development, human resources, medical affairs, clinical affairs, or competitive intelligence teams, people need to identify researchers and key opinion leaders (KOLs) in different therapeutic areas.

The post Using a Knowledge Graph to Identify Researchers & Key Opinion Leaders appeared first on Copyright Clearance Center.

]]>
Whether working in business development, human resources, medical affairs, clinical affairs, or competitive intelligence teams, people need to identify researchers and key opinion leaders (KOLs) in different therapeutic areas. The following are some of the reasons for this need: 

  • to speak on their behalf about their products at conferences and educational events
  • to collaborate and form research partnerships
  • to identify candidates for hiring
  • to endorse products 

One way to identify researchers is from large corpora of scientific literature. But without the right tools, this can be challenging.

Here are some of the challenges our clients come across when working with bibliographic data:

Challenge 1: Entity Resolution and author disambiguation

How can you easily know who is who and what is what? If you look at a dataset like LitCOivd, or even more broadly, PubMed, the data does not provide clear ways to distinguish one author from another. By our own research, under 3% of author records in PubMed provide a standard author identifier, such as ORCID. Even fewer records (<1%) provided any sort of standardized institutional ID.

Challenge 2: No Clear Entry Point

Large datasets that focus on a single domain can be daunting. For example, The National Library of Medicines LitCovid corpus has added around 2000 articles per week since May 2020. A dataset of this volume and growth often offers no clear point of entry. How does one even go about answering their own questions?

Challenge 3: Manual, word-of-mouth processes

Getting answers from large datasets requires the ability to systematically explore the data and pursue lines of inquiry through various connections. One of the problems we hear from customers today is that their current effort to find researchers, who fit certain research profiles, is highly manual and often a matter of chance (word-of-mouth, people-they-know). 

Challenge 4: Answers without Explanations

Advanced analytical techniques can help to provide answers, but sophisticated analytical solutions also risk presenting answers without explanations.

Where Knowledge Graphs Play a Role 

It was the challenges above that provided CCC the impetus to take some experimental work we were doing in-house on data pipelines and build a COVID author graph. In April 2020, we released a prototype of a knowledge graph of authors who specialize in COVID and related fields of study. Today, is it the basis for CCC Expert View.

Knowledge graphs allow for the statistical calculation of accuracy and precision, so you can know how good, or how bad, your answers are. Graphs allow for systematic, nimble, and nuanced exploration of relationships and facilitate entry into the data from specific questions, like:

  • Who are the top authors in a given domain area?  
  • Who is working with whom?
  • Which topics are being researched at a specific institution? 
To keep learning, check out: 
  • The Data Quality Imperative” from CTO Babis Marmanis. He discusses the impact data quality has on knowledge production, with examples from our experiences working with bibliographic raw metadata for the CCC COVID Author Graph. 

The post Using a Knowledge Graph to Identify Researchers & Key Opinion Leaders appeared first on Copyright Clearance Center.

]]>
Using Knowledge Graphs to Drive Decision Making https://www.copyright.com/blog/using-knowledge-graphs-to-drive-decision-making/ Fri, 20 Nov 2020 09:16:43 +0000 https://www.copyright.com/?post_type=blog_post&p=28960 CCC's knowledge graph can aid in the identification of peer reviewers and understanding of the research landscape for coronavirus-related publications and has uses ranging from publication ethics to marketing analysis.

The post Using Knowledge Graphs to Drive Decision Making appeared first on Copyright Clearance Center.

]]>
These comments were originally presented at the Outsell Signature Event on 12 November 2020, where CCC President & CEO Tracey Armstrong participated in a panel discussion with CCC EVP & CTO Babis Marmanis and moderator David Worlock on “Using AI to Create Collaboration, Partnership, and New Business Opportunities: Launching the CCC Knowledge Graph.”

2020 has provided more than its share of disruption and, in many parts of the information and publishing industry, leaders have risen to the occasion and introduced powerful new information tools and services to meet market demand. Those organizations that began 2020 with their data house in order were able to pivot quickly, provide new value, and produce strong margins. Their leaders demonstrated the importance of viewing disruption as an opportunity for transformation, reinvention and growth.

Among our top priorities as information industry leaders, we must focus on:

  • Investing in knowledge engineering and infrastructure. Having moved up from “nice-to-have” to “must-have,” well-managed data acquisition and curation, and knowledge production are essential to internal collaboration, partnership and innovation.
  • Using data derived knowledge to inform decisions. The emergence of knowledge graphs and their intersection with AI is starting to inform decisions at the boardroom level.
  • Partnering in new ways around data opportunities. Because proprietary data is not sufficient, organizations are identifying new data partnerships to augment, complement, and enrich the data that produce the required knowledge.

We’re all data scientists now.
Ten years ago, companies were racing to hire data science specialists, experts who would tell us all kinds of things from our data once it was cleansed, normalized and enriched. But these data scientists were subject matter experts in the science of manipulating data to derive outcomes – and not subject matter experts in publishing, finance, chemistry, agriculture, energy, education, biology or the myriad of other sectors we serve in the information industry. Business goals must drive data investments, which means we’re all data scientists now. Every SME we hire at CCC today is trained with a data science orientation in mind, and our business leaders across the company, from marketing to product management to sales and beyond, have expanded their data skill sets.

Seeing data in new ways with knowledge graphs
We can no longer think of data and data analysis as separate from the core business, and that’s not just in publishing and the information industry. All business leaders are embracing new ways of understanding data through the use of knowledge graphs, for example, to gain valuable insights that can serve their customers, and using data in AI to drive innovation and position themselves well ahead of their competition.

Given our collective licensing role, CCC is actively engaged in the intersection of copyright and AI, having recently submitted comments to both the World Intellectual Property Organization and the U.S. Patent and Trademark Office to bring the voice of rightsholders into the conversation. And as an organization that builds both packaged software solutions and creates custom or bespoke technology solutions, we’re collaborating with our publisher and licensing partners to innovate.

Using knowledge graphs lets you supercharge your understanding of a particular area and gain unexpected insights. Knowledge is buried in documents and other unstructured data, and it is very hard to surface it at the right time. Knowledge graphs capture relationships between entities and make it easier to index, process, and find “nuggets of knowledge”. They help us think in a reference frame that grounds our interpretation of data and enable us to search for knowledge by using “things” rather than “strings”.

Investment in knowledge graphs pays off because they are not single-purpose models; they inform multiple decisions, not just one. Knowledge graphs are like living organisms, they are constantly changing, and we can continuously benefit from our investments as a result. When we use knowledge graphs in combination with Artificial Intelligence, our data output moves up from simply being “stated” to being “informed.” This gives us greater confidence in the inferences we make from the data and provides us a better ROI from our investments in graphs.

A surge in pandemic-related research created a corresponding surge in the number of scholarly journal articles in need of peer review, putting pressure on the scholarly publishing ecosystem. CCC’s investment in knowledge engineering enabled us to rapidly develop a data-driven solution — a knowledge graph of authors derived from bibliographic citations to address this major market need. The CCC Author Graph enables exploration of a collection of authors and experts in coronavirus-related research and the analysis of the interconnections between them, their publications, and areas of interest. This knowledge graph can aid in the identification of peer reviewers and understanding of the research landscape for coronavirus-related publications and has uses ranging from publication ethics to marketing analysis.

Establishing unconventional data partnerships.

Those leaders actively investing in data have seen the power of expanding and integrating data sets. With the eruption of the COVID-19 pandemic this year, we see the emergence of new dynamics in collaboration as the world rushes to find therapies, vaccines, and ways to cope with the novel coronavirus and its effects. Recent news about vaccines highlight how organizations partnering to share resources – staff, systems, and data – are making incredible strides and previously unanticipated data partnerships are happening, too. For example, we’re seeing the combination of consumer and B2B data sets with municipal and federal government data in interesting and informative new ways.

Industry leaders making investments in knowledge engineering, identifying new opportunities for data partnerships, and leveraging new visual data analysis tools can put the power of data to good use. As we’ve learned in 2020, leaders must take steps to be ready for the next disruption, and investments in data readiness will serve them well.

The post Using Knowledge Graphs to Drive Decision Making appeared first on Copyright Clearance Center.

]]>