FAIR data plays a crucial role in AI creating a foundation for trust and success.

On September 18th, we co-hosted a special event in Leiden (Netherlands), on “the evolving role of data in the AI era.” The focus of the meeting was FAIR data and their importance in a world where the adoption of machine learning and AI is becoming ubiquitous. On a dreary day weather-wise, this was a very vibrant event with distinguished speakers in the birthplace of the FAIR principles. Key ideas were discussed, and actionable advice was shared. Overall, a very successful and productive event. I left Leiden convinced more than ever about the crucial role of FAIR data in establishing AI trust and making AI deployments a success.

In the context of scientific research, access to FAIR data will enable AI systems to access automatically, and at scale, these data and perform analyses that would be impossible otherwise. If an AI system needs certain data, it can explore the “feasibility space” (I use this term here as one would in formulating an optimization problem) automatically without the need for human guidance or review. The interoperability aspect of FAIR data will ensure the appropriate semantic mapping between the system of reference of the calling AI system and the system of reference of the supplier (of the FAIR data). The vast number of explorations that a fully autonomous AI system can perform will almost certainly increase the efficiency and effectiveness of any research task.

However, the value of FAIR data goes well beyond their direct application in data science pipelines. As everyone knows, the major thrust in this new renaissance of AI is, of course, the use of generative AI systems and, in particular, the creation and adoption of large language models (LLMs). The net effect of the latter will be that many more people, not just data scientists, will be able to direct AI systems to assist them with their work. No need for long and complex SQL queries, or other technical mumbo jumbo, anymore!

How do FAIR data come into the picture?

It is true that LLMs have demonstrated remarkable performance on a number of tasks, such as summarization, cloze testing, Q&A, and much more. Despite their potentially transformative impact, these new capabilities are not yet fully understood. It is my firm belief that LLMs will play a key role in intelligent systems of the future, but they will simply be a component of a larger architecture, which takes advantage of the linguistic advantages that LLMs can offer and ameliorates their deficiencies by other means. It is in the latter aspect of these intelligent system architectures where FAIR data will be essential for the reliability in the performance of these systems.

For example, in an architecture known as “In-Context Retrieval-Augmented Language Modeling (RALM)” (Ram et al., 2023), the LM architecture remains unchanged, and the system prepends grounding documents to the input. In that case, it has been shown that in-context RALM, which uses off-the-shelf general-purpose retrievers, can provide surprisingly large LM gains across model sizes and diverse corpora. Moreover, the document retrieval and ranking mechanism can be specialized to the RALM setting to further boost performance. So, in-context RALM has considerable potential to increase the prevalence of LM grounding, particularly in settings where a pretrained LM must be used without modification and even only via API access. It is not hard to imagine cases where FAIR scientific data are used in a similar manner to augment or improve the performance of LLMs in specific tasks.

In order to establish trust in AI systems, their developers should provide provenance, license, security, and other useful information in a transparent manner through an open standard so that the consumers of these systems can ascertain what data have been used in building the model. An AI system could be trained on a set of data stemming from many sources. It is imperative that a method for establishing trust and security in AI systems should be developed and the FAIR-ification of the training data will be essential for that purpose.

Topic:

Author: Haralambos Marmanis

Dr. Haralambos Marmanis is CCC’s Executive Vice President & CTO, where he is responsible for driving the product and technology vision as well as the implementation of all software systems at CCC. Babis has over 30 years of experience in computing and leading software teams. Before CCC, he was the CTO at Emptoris (IBM), a leader in supply and contract management software solutions. He is a pioneer in the adoption of machine learning techniques in enterprise software. Babis is the author of the book "Algorithms of the Intelligent Web," which introduced machine learning to a wide audience of practitioners working on everyday software applications. He is also an expert in supply management, co-author of the first book on Spend Analysis, and author of several publications in peer-reviewed international scientific journals, conferences, and technical periodicals. Babis holds a Ph.D. in Applied Mathematics from Brown University, and an MSc from the University of Illinois at Urbana-Champaign. He was the recipient of the Sigma Xi innovation award and an NSF graduate fellow at Brown.
Don't Miss a Post

Subscribe to the award-winning
Velocity of Content blog