From Research Overload to Actionable Insights: AI in Biomedical Data Extraction

AI in Biomedical Data Extraction, by Ioannis Mollas, Ph.D., Data scientist

Each day more than 10,000 English biomedical articles are published. This staggering amount of articles overwhelms academic researchers and professionals interested in staying updated on emerging trends and industry outcomes who do not have the time to read them through. One potential solution could be  generating metadata for each article for more efficient research. However, annotating each document with tags, categories, and relations requires a significant amount of human resources. Automated systems are the ideal tools to carry out such tasks by utilizing artificial intelligence (AI), in particular machine learning (ML). 

Machine learning enables researchers to access critical insights for each article, search for  a specific entity, such as a chemical or a food, within the article or uncover latent  associations between those entities and a specific medical condition. It also aids researchers in synthesizing findings from numerous studies quickly, discovering potential new therapies by identifying previously unnoticed connections, and revealing breakthroughs regarding the treatments of comorbidities.

| “Given a text, we want to extract triplets in the form <subject, predicate, object>!”

Among the various AI tools that can perform the above tasks, relation extraction (RE) is the most suitable approach for recognizing and assigning relations between two entities within a specific context. RE identifies triplets <source, relation, target> in the text.

PIPA accelerates scientific breakthroughs and discovery of actionable insights by focusing on two main pillars: supervised relation extraction and open information extraction, covering discoveries in diverse fields, and providing enterprise applications to partners. 

Leveraging current achievements in the cross-domain of natural language processing (NLP) and large language models (LLMs), supervised relation extraction can identify relationships in biomedical data with great precision and accuracy. Let’s take a closer look at how this technology works, explore its key methods, and decipher how it powers our state-of-the-art AI products.

Understanding Supervised Biomedical Relation Extraction

To understand how supervised relation extraction (RE) works, we present a scenario to clarify the dependencies and establish the necessary assumptions.

Figure 1: Example of extracting triplets from text
Figure 1: Example of extracting triplets from text

Given a piece of text, such as, a sentence, and the text’s entities annotated in predefined categories (e.g., chemicals, plants, foods, diseases), RE identifies pairs of entities and assigns them a relation based on a specific context. Figure 1 shows a sentence with a plant entity and various disease entities annotated. Supervised RE will examine all possible pairs between the source (plant) and target (disease) pairs to determine the correct relation in a specific context (sentence), in this case treat.

However, there are restrictions between the different entity pairs regarding the assigned relations. For example, a chemical-disease pair can be linked to a treat relation, while a chemical-food pair cannot. These limits are either determined by domain experts (such as PIPA’s Bioinformatics and Data Science teams) or can be inferred from publicly available datasets. In the case of supervised RE, the latter is the ideal option. 

Supervised RE works as follows: 

a) Given a dataset, which should contain at least the following: a sentence that works as the context of the triplet, the name of the source entity, the name of the target entity, the positions of both entities in the text, and the relations between the two entities, 

b) a ML model is trained given examples of pairs in a specific context (text) to predict a relation among a few predefined relations from this dataset, and 

c) the trained model can classify new pairs in their context to one of the predefined relations with which it was trained.

An example of a dataset’s example could be:

  • Sentence: Apium graveolens L. is a traditional Chinese medicine prescribed as a treatment for hypertension, gout, and diabetes.
  • Source Entity Name: Apium graveolens L.
  • Source Entity Position: (0, 19)
  • Target Entity Name: Hypertension
  • Target Entity Position: (84, 96)
  • Relation: Treats

Identifying high-quality datasets is challenging. Apart from the publicly available datasets of a high standard, PIPA enhances its models through in-house methodologies led by its expert team, ensuring the creation of top-tier datasets for training supervised ML models.

Leveraged Technologies for Biomedical Relation Extraction 

To effectively train supervised Relation Extraction (RE) models in the biomedical domain, we leverage LLMs, specifically transformer models pre-trained on biomedical corpora that are available in known online repositories like HuggingFace. This allows us to tap into the vast domain-specific knowledge and linguistic patterns found in biological literature. To ensure optimal performance, we continuously evaluate cutting-edge biomedical models to identify the most effective one. Additionally, we explore various training methods by reviewing the latest research on supervised relation extraction (RE), and selecting the approaches best suited to our specific tasks and data.

LEAP™: Revolutionizing Evidence Synthesis and Discovery

Our AI app, LEAP™, is an AI co-pilot for accelerated evidence synthesis and bioactive discovery. It  allows researchers and product developers to discover known and novel bioactives and their health effects and collect supporting evidence and Gen-AI powered summaries to inform their R&D efforts from literature reviews and hypothesis generation through product messaging and regulatory compliance.

Challenging the Future

Biomedical RE is expected to undergo major transformations by AI-driven advancements in the upcoming years. Future methods might include gathering new, high-quality data for emerging category pairs and relationship types or supplementing supervised RE models with explainability capabilities for additional details at the evidence synthesis level.  As we move forward, more sophisticated AI models, such as generative models that parse and analyze complicated medical texts with greater nuance and accuracy, will emerge. Such models might be able to assign relations between entities and provide further insights like reasoning to be supplemented for each decision.

At PIPA, we believe the future starts now. We’re on a mission to accelerate discovery and innovation in the food, nutrition, and health sectors through the power of AI. Stay tuned as we’ll dive into a real-world example that showcases how our technology is shaping the future of biomedical research. 

Subscribe to our newsletter to be notified when Part II is available.


  1. Check HuggingFace in https://huggingface.co/
  2.  More details about Pipa’s LEAP™ in https://pipacorp.com/leap-ai-platform/ 
Share post:

Want to stay up-to-date with our updates?

Subscribe to our newsletter and be the first to learn the latest developments in predictive AI.

Subscribe to our newsletter