Blog
Navigating Data Complexities to Ensure Accuracy and Reliability
Part 2: Real World vs. Synthetic Data
Chris Beggs, Sr. Director of Product and Strategy, Information Solutions, IQVIA
Dec 10, 2025

As previously noted, in the life sciences industry, data integrity is critical. When an organization is making decisions that impact patient safety, drug efficacy, and regulatory compliance, relying on data and analytics that you can trust and verify is paramount. With the growing trend of using synthetic data (artificially generated datasets) in tandem with real-world data (collected from actual experiments or patients), analysts and decision makers have become increasingly challenged to identify and understand the difference between real and synthetic data, yet making commercial decisions based on accurate information has never been more crucial.

Real-World Data vs. Synthetic Data

The use of synthetic data, generated through modeling, while useful in specific scenarios, can also pose significant risks. Synthetic data may not accurately represent real-world patient experiences and can introduce unnecessary biases. To that end, it is essential to have a high level of transparency so that you know when resource allocations and engagement decisions are being made on synthetic versus real-world data.

Synthetic data is only as good as the quality and recency of the data used to train the model and may not capture anomalies or outlier events that are critical for decision-making. Relying heavily on synthetic data can lead to inaccurate conclusions and actions, so it is critical that any findings or models derived from synthetic data be tested and validated against real data by an analyst that specializes in understanding modelled results versus reality. This will guarantee a more realistic interpretation of any implications provided by real-world data versus a modelled dataset.

Key Differences When Comparing Real to Synthetic Data

Data Origin - Real vs. Synthetic

Real data is collected from actual clinical studies, patient engagements, or laboratory observations by sources that can be validated. Synthetic data is generated by AI or algorithms that simulate real data patterns.

Authenticity - Authentic or Simulated

Real data contains actual variabilities like noise, uncommon outliers and errors. Synthetic data may smooth out the rough edges of irregularity and potentially miss a rare anomaly that would otherwise be uncovered in a real data scenario.

Privacy Risk - High vs. Low

Real data may include personal identifiers that require the implementation of a tokenization process, while synthetic data is based on modelled individuals. While artificial people can lower overall privacy risk, they also increase the risk of inaccurate business projections.

Bias - Inherent vs. Inherited

Real data should reflect actual population biases for a better understanding of market dynamics. Synthetic data can create biases based on indications in its source and the risk of inferior training can lead to false outputs.

Usage - Evidence vs. Simulation

Real, validated data is required for all regulatory submissions and the definitive analyses needed for accurate decision making. Synthetic data can be used for simulations, model training, and augmenting real datasets, but it should never become the standalone proof required for drug approvals.

Techniques That Can Identify Synthetic Data Models
  1. Statistical Fingerprinting and Data Profiling - Profile the data statistically to see if it behaves like real-world data. This involves comparing distributions, summary statistics, and looking for anomalies in the dataset.
  2. Data Visualization and Pattern Recognition - Human intuition should always play a role in pattern recognition as visualizing data can quickly expose anomalies that numbers alone might miss. Analysts are able to use visualization to complement statistical tests.
  3. Machine Learning and AI-Based Detection - As synthetic data generation has grown more sophisticated (especially with AI techniques like GANs), machine learning itself can be employed to discern real from synthetic.
  4. Watermarking and Metadata Cues - A newer approach to differentiation involves cryptographic or algorithmic watermarks – essentially tagging synthetic data at its creation so it can be identified later.
  5. Contextual and Domain Knowledge - An analyst’s domain knowledge and intuition is critical in life sciences in that certain data just make more sense only when it is real and verifiable.
Understanding Data Imputation and Supplier Restrictions

Understanding the percentage of records with imputed payer, healthcare provider (HCP), or patient information is important for several reasons. Imputation involves using business rules, logic, or modeling to fill in missing data points, such as patient ID, HCP, or payer information. The accuracy of these imputations heavily depends on the quality of the business rules and models used. If the imputation logic is flawed, the resulting data may not accurately represent real-world patient experiences, leading to potential misinterpretations and misguided actions.

Moreover, many data suppliers impose restrictions on the use of HCP and payer information. If customers indirectly or directly assign HCPs or payers to claims that were not populated on the inbound at the time of delivery, it could be seen as a loose interpretation of supplier restrictions. This could jeopardize the long-term availability of the data, as suppliers may terminate their relationship with the data provider if they believe their restrictions are being violated.

For example, a specialty pharmacy supplier may restrict the use of payer information on their data and raise concerns about whether payers are being imputed or assigned, even indirectly, to their claims. Such concerns can lead to potential disputes or disruptions with suppliers, so it is essential to ensure that imputed data is used judiciously and that the quality of the imputation logic is regularly assessed to maintain data integrity and compliance with supplier guidelines. While data imputation can be useful, it is also important to recognize the limitations and potential inaccuracies and note that it may not always align with real-world patient experiences, leading to potential misinterpretations.

Chart showing life sciences data complexity, synthetic data shortcuts, supplier loss, and IQVIA approach.

This chart highlights the shortcuts that aggregators are taking to cover gaps created by supplier loss. These shortcuts include buying data from other aggregators, using synthetic data, loosely interpreting data supplier requirements, and over-relying on technology platforms. IQVIA prioritizes sourcing data close to the patient experience, integrating privacy and governance tightly, and using technology to complement data delivery.

Conclusion

When life sciences organizations make decisions that are driven by data, those decisions can impact everything from patient outcomes to regulatory approvals. To maintain data integrity, it is the responsibility of both the end user and the data supplier to ensure the data they are using is authentic and reliable no matter what the source. Misinterpreting any data type can lead to biased conclusions, poor decision making, compliance issues, and ultimately lost revenue.

When it comes to real versus synthetic data, the key to success is credibility, which means making sure that all stakeholders (analysts, regulators, clinicians) have a clear line of sight into data usage and that they understand which data are real and which are synthetic to maintain confidence in data reliability.

Ensuring Data Authenticity in Life Sciences

In life sciences, integrity, authenticity, and reliability of data are paramount, and the need for robust data supply, governance and delivery practices has never been more critical. Part 1 of this blog series delves into the intricacies of data authenticity, highlighting key considerations and best practices that leaders should be aware of when assessing a reliable data partner.

Related solutions

Contact Us