Learn how unparalleled data, domain expertise, and technologies enable AI-powered solutions that are purpose-built for healthcare.


In April, we introduced IQVIA Medical Reasoning 8B (Med-R1-8B)—IQVIA’s 8 billion-parameter large language model (LLM) specialized in biomedical reasoning. We built Med-R1 to go beyond simply answering the user’s question, equipping it with capabilities to reason across a wide range of sources including electronic health records (EHR), clinical trials, FDA records, and biomedical literature.
Today, we’re excited to introduce the Med-R1 Deep Research Agent—an agentic system that builds on the power of Med-R1 8B, and goes a step further: it digs into multiple sources, pieces the information together, and gives you a clear, detailed answer instead of a quick reply. When we evaluated Med-R1 Deep Research Agent on benchmarks designed to test multi-hop medical reasoning, the results mirrored what we and many across the field have observed: even the most powerful models — including frontier systems from OpenAI, Google, and others — face significant challenges in connecting evidence across diverse biomedical sources.
These results underscore a broader reality: multi-stage biomedical reasoning remains an unsolved frontier — one that requires connecting evidence across heterogeneous and often incomplete sources to ensure that insights are not only accurate, but also verifiable and safe to use, a challenge where most large models struggle.
Nevertheless, within that context, our Med-R1 Deep Research Agent represents a step forward, showing how domain-specific models combined with an agentic framework can improve evidence synthesis, traceability, and trust. Therefore, our intent in sharing these findings is to contribute constructively to the broader conversation — helping to define what responsible progress looks like for the next generation of biomedical AI.
Built on the Med-R1 8 billion parameters, our deep research agent competes with much larger systems—Perplexity’s Sonar (built on Meta’s Llama 3.3 70 billion parameters) and frontier reasoning models from OpenAI (o3) and Google (Gemini 2.5 Pro), whose exact sizes are undisclosed but are widely estimated to lie in the hundreds-of-billions to ~trillion-parameter range, Med-R1 Deep Research Agent surpasses them on some multi-hop medical reasoning tasks. Our current evaluation highlights that domain-specific models, when paired with agentic strategies can deliver more reliable, real-world insights than larger general-purpose LLMs, at a fraction of the cost.
In real-world healthcare decision-making, answering a clinical or regulatory question often requires navigating multiple sources, synthesizing scattered facts, and drawing multiple interdependent logical inferences. This is known as multi-hop reasoning, and it lies at the core of clinical research, drug evaluation, and medical policy.
Search (retrieval-only): a single-pass approach reading the top results, and answer without explicit planning, backtracking, or cross-source synthesis. Best for straightforward, 1-hop lookups.
Deep Research (agentic): a multi-step workflow that plans the path, iteratively searches and refines queries, cross-checks across sources (e.g., ClinicalTrials.gov, FDA, PubMed), and decides when to stop based on evidence sufficiency. Built for 2–5+ hop, real-world tasks requiring synthesis and verification.
But there’s a catch: as agents are asked to perform more steps (or “hops”), their performance rapidly declines. Even state-of-the-art systems like Gemini 2.5 Pro DeepResearch drop below 10% accuracy on 5-hop tasks (see Table 1).
Deep research agents take a different approach. Rather than relying on single-shot retrieval, they plan, search iteratively, reflect on their progress, and decide when to stop. Our goal was to test whether Med-R1, augmented with this framework, could succeed in such a demanding context.
Figure 1: Med-R1 Deep Research Agent Workflow: A step-by-step illustration of how the Med-R1 Deep Research Agent plans, searches, reflects, and synthesizes results to produce a verified answer with citations
In this section, we illustrate step-by-step how Med-R1 Deep Research agent answers an example real-world query. Each figure maps to a stage of the workflow: question/prompt, iterative research cycles, reflective refinement, and the final answer (see Figure 1).
As shown in Figure 2, the first step consists of the user entering the research question in the Med‑R1 UI: “How do heat shock proteins respond to different training loads, and what implications does this have for periodization strategies?”. In other words, the question is asking how the body’s stress-protection proteins react to different workout intensities and what that means for planning effective training cycles.
Figure 2: Ask the Question
The system moves from input to action. Med-R1 accepts the query and launches a research run (Figure 3). “Processing…” means the agent is initializing the workflow, allocating the run, loading tools, and preparing the first set of search queries
Figure 3: Agent Run Starts
The agent first decomposes the prompt and generates targeted search queries (Figure 4). In this example, it frames variants around heat‑shock proteins (HSPs), training loads, athletes, and exercise adaptations to maximize recall before narrowing.
Figure 4: Generate Search Queries and Web Research Rounds
Med-R1 orchestrates a parallel retrieval pass, expanding the query set and collecting the first batch of sources (peer-reviewed papers, reviews, articles). In the current example (Figure 4), the results include resistance‑training load effects on HSPs (e.g., HSP27, HSP70, β‑crystallin) and foundational reviews on HSPs and exercise. The agent extracts structured notes from each source.
After reading, the agent reflects on what’s known vs. unknown, identifies gaps, and proposes follow‑ups (Figure 4). This reflection step guides the next search wave (e.g., training‑load thresholds, sex differences, and heat + exercise interactions).
As highlighted in Figure 5, another brief reflection integrates the new evidence and outlines the answer: summarize which HSPs are induced by which loads, note the role of heat stress, and translate findings into periodization guidance (e.g., selecting intensities/volumes to elicit desired HSP responses).
Figure 5: Reflection and Synthesis Plan
As shown in Figure 6, the final step consists of Med-R1 delivering a synthesized conclusion with inline citations, source links appear next to claims where the user can trace evidence to the original papers and reviews. In this example, key takeaways include:
Why this matters: citations are central to deep research—they enable verification, reproducibility, and quick follow-up reading.
Figure 6: Final Answer
MedBrowseComp is the first purpose-built benchmark to evaluate deep research agents in the biomedical domain. Released in 2025 by researchers from Harvard, MIT, and others, it contains over 1,000 physician-curated questions that simulate real-world clinical research tasks.
These aren’t synthetic trivia questions—they mimic the workflows of actual medical analysts and researchers. For example: “Identify a 55-year-old male patient with [DISEASE] → find clinical trials within 20 miles of [Zip Code] that are currently recruiting → select the trial that is geographically nearest → retrieve its eligibility criteria → check whether a myocardial infarction 5 months ago excludes the patient.”
We evaluated Med-R1 Deep Research agent on a 50-question subset of MedBrowseComp and compared it against leading deep research systems, including Llama3 8B from Meta, Gemini 2.5 Pro from Google, Sonar from Perplexity, and o3 from OpenAI. For fairness and reproducibility, results for comparative systems—including o3, Gemini 2.5 Pro, and Sonar—are re-published directly from the official MedBrowseComp benchmark paper (Chen et al., 2025), ensuring all model comparisons reference standardized evaluation settings and scoring criteria.
|
Question Depth |
o3 |
Gemini2.5pro |
Sonar |
Llama3 |
Med-R1 |
|||||
|
search |
deep |
search |
deep |
search |
deep |
search |
deep |
search |
deep |
|
|
1-hop (n=10) |
10/10 |
10/10 |
10/10 |
10/10 |
8/10 |
10/10 |
8/10 |
8/10 |
8/10 |
10/10 |
|
2-hop (n=10) |
5/10 |
6/10 |
4/10 |
8/10 |
2/10 |
5/10 |
2/10 |
3/10 |
5/10 |
6/10 |
|
3-hop (n=10) |
1/10 |
3/10 |
1/10 |
3/10 |
0/10 |
3/10 |
0/10 |
0/10 |
1/10 |
3/10 |
|
4-hop (n=10) |
3/10 |
5/10 |
0/10 |
3/10 |
0/10 |
2/10 |
0/10 |
0/10 |
1/10 |
2/10 |
|
5-hop (n=10) |
0/10 |
1/10 |
0/10 |
0/10 |
0/10 |
0/10 |
0/10 |
0/10 |
0/10 |
0/10 |
|
Total (n=50) |
19/50 |
25/50 |
14/50 |
24/50 |
10/50 |
20/50 |
10/50 |
11/50 |
15/50 |
21/50 |
Table 1: Small model, big results — Med-R1 8B delivers 21/50 (Deep Research) — outperforms Sonar (20/50) and beats Llama 3 8B (11/50), while landing within 3–4 answers of o3 (25/50) and Gemini 2.5 Pro (24/50) — all with only 8B parameters.
As seen across all models, 4-hop and 5-hop questions consistently remain beyond the state of the art, Med-R1’s “Search” variant answered 15 out of 50 questions correctly, demonstrating its strength as a lightweight, single-pass retriever for simpler biomedical queries. It surpasses the search accuracy of Llama3 and Sonar. In 1-hop tasks, it matched Llama3 and sonar. At 2-hop depth, Med-R1 Search continued to deliver reliable performance, achieving 5/10 matching the performance of o3 and outperforms Llama3, sonar and Gemini 2.5 Pro. In 3-hop tasks, Med-R1 Search outperforms both Llama3 and Sonar, while achieving results comparable to the best systems, including Gemini 2.5 Pro and o3.
Med-R1's “Deep Research” variant answered 21/50 questions correctly, outperforming Llama3 and sonar on deep research tasks. In 1-hop and 2-hop tasks, Med-R1 was on par with the best systems. In 3-hop tasks, where real reasoning kicks in, Med-R1 Deep Research achieved similar performance to o3, demonstrating its capability to handle multi-step biomedical reasoning with competitive accuracy. In 4-hop tasks, Med-R1 Deep Research achieved comparable results to Sonar, and showed consistent gains compared to its retrieval-only counterpart, Med-R1 Search.
Figure 7: Comparative Performance of Med-R1 Deep Research vs. Frontier Models on MedBrowseComp. Med-R1 (8B) achieves competitive multi-hop accuracy across 1-5 hops biomedical reasoning tasks. Results for comparative systems (o3, Gemini 2.5 Pro and Sonar) are reproduced directly from the MedBrowseComp benchmark paper (Chen et al., 2025).
Unlike o3, Sonar, and Gemini 2.5 Pro, which are large-scale frontier models with tens or even hundreds of billions of parameters, Med-R1 is a lean 8B-parameter model purpose-built for biomedical tasks. Its strong performance underscores how domain-specific models, when paired with intelligent agentic strategies, can outperform much larger general-purpose models that rely primarily on scale.
The Med-R1 Deep Research Agent is being developed as part of IQVIA’s broader agentic automation strategy. Indeed, IQVIA hosts and curates biomedical data sources from across the world – with over 200 million documents being processed with our proprietary language models, and include scientific literature, clinical trial records and results, full-text patents and drug labelling information, and much more. These sources are ready-to-use for high performance search with deep research models – such as Med-R1 Deep Research.
As this work progresses, we acknowledge the need for broader and more diverse benchmarks to rigorously evaluate Deep Research performance. While MedBrowseComp has been instrumental as the first benchmark for multi-hop medical reasoning, it remains limited in scope, relying on automated judging rather than full clinical review, and excluding broader domains such as sports science, genomics, and multimodal reasoning, amongst others. Future evaluations will need to address these gaps to better reflect real-world biomedical research workflows more comprehensively.
As we continue to refine Med-R1 Deep Research Agent and its evaluation framework, we remain deliberate and transparent in how we compare its performance to general-purpose, out-of-the-box models. Our goal is to better understand where agentic, domain-specialized reasoning provides tangible value — and where important challenges remain. This measured approach, grounded in collaboration across IQVIA and the wider scientific community, ensures that our development approach remains responsible and aligned with the real needs of healthcare and life sciences.
Contact us at AppliedAIScienceInfo@IQVIA.com for more information and to discuss how Med-R1 8B Deep Research Agent or other IQVIA AI solutions can help you solve your most pressing problems in Healthcare and Life Sciences with Healthcare-grade AI®.
References
Chen, S., Moreira, P., Xiao, Y., Schmidgall, S., Warner, J., Aerts, H., Hartvigsen, T., Gallifant, J. and Bitterman, D.S., Medbrowsecomp: Benchmarking medical deep research and computer use, 2025b. URL https://arxiv. org/abs/2505.14963.
Learn how unparalleled data, domain expertise, and technologies enable AI-powered solutions that are purpose-built for healthcare.