AI Data Security

Who's Training on Your Trial Data?

Where Your Data Goes When You Use AI in Clinical Research

CEO, Curebase

Storm Stillman presents "Who's Training on Your Trial Data?" at SCOPE Summit 2026

Key Takeaways

•A sponsor with 50 CRAs making three AI prompts a day creates 150 daily exposures, and that is before counting CRCs, medical monitors, or site staff.
•The same AI model accessed through different product tiers has completely different data security properties.
•A single AI voice agent call can route patient data through five separate vendors, each with different retention policies.
•"Zero data retention" is a contractual policy, not a technical guarantee.
•Local-first data processing can reduce data exposure by 98% while delivering the same functional outcome.

In the last 30 days, how many people at your organization have pasted clinical trial data into ChatGPT? A protocol section, site notes, a patient query. The honest answer is probably more than you think. That act has become second nature, almost like running a Google search. But it is nothing like a Google search.

Even at organizations with formal AI policies in place, the gap between what is supposed to happen and what actually happens is enormous. A 2025 workforce survey found that 66% of life sciences professionals use unauthorized AI tools on a weekly basis, and 85% of employees who have access to approved AI tools also use unapproved ones alongside them.

How cloud AI actually works (and what most people miss)

When you type a prompt into ChatGPT, Claude, or Gemini, your text makes a round trip to the AI provider's servers. It seems simple: your device sends a question, the cloud sends back an answer. What most people do not think about is what happens to that data after the answer comes back.

On consumer-tier products, it could be stored for 30 days or longer. It could be used to train the next version of the model. Employees at the provider might be reviewing your prompts for quality and safety. And once data enters a training dataset, there is no mechanism to extract it.

OpenAI's free, Plus, and Pro tiers use conversation data for model training by default. Anthropic reversed its previous stance in 2025 and now trains on consumer Claude conversations by default as well.

In an ideal setup (and this is what we do at Curebase) you have an enterprise AI account with a Business Associate Agreement in place. Data controls are enabled so the provider is not training on your inputs. You have audit logging so there is some accountability.

But the reality at most organizations looks different. Many people do not have a BAA in place. They are using a free or cheaper tier where data trains models by default. Sometimes it is on a personal laptop or a personal account.

Data Exposure Calculator

Adjust the sliders to see how quickly AI data exposures scale across your organization.

Number of CRAs50

5200

Prompts per person per day3

115

Working days per year: 250

Daily Exposures

150

Annual Exposures

37,500

This only counts CRAs — it doesn't include CRCs, medical monitors, or site staff.

This is not a theoretical risk

$670,000

Average additional cost of a data breach linked to shadow AI, compared to traditional breaches. (IBM 2025 Cost of a Data Breach Report)

In January 2026, the Acting Director of CISA, the person responsible for cybersecurity across the entire U.S. government, uploaded sensitive government documents marked "For Official Use Only" to the public version of ChatGPT. If the person leading America's cybersecurity agency can make this mistake, you can imagine what is happening at clinical research sites every single day.

In 2023, Samsung engineers pasted proprietary semiconductor source code into ChatGPT on three separate occasions within 20 days. In July 2025, a ChatGPT user asking about woodworking sandpaper received another person's LabCorp drug test results in the response, pointing to cross-session data contamination.

Only 25%

Of biopharma executives report having established AI governance structures. (2024 Deloitte survey)

Subscribe for more insights

Why clinical trial data is uniquely dangerous to expose

Clinical trial data carries dual sensitivity that makes it different from most corporate information. It is both protected health information under HIPAA and competitively valuable intellectual property. Protocol designs reveal novel endpoints. Patient screening notes contain identifiable health data. Enrollment dashboards telegraph site performance.

The regulatory environment is tightening fast. The FDA issued draft guidance in January 2025 establishing a risk-based credibility assessment framework for AI. The FDA and EMA jointly published ten guiding principles for AI in drug development in January 2026. ICH E6(R3), adopted in January 2025, introduced new data governance requirements that apply to any system processing trial data, AI included. And HHS proposed the first major update to the HIPAA Security Rule in 20 years in late 2024, mandating that organizations include AI tools in their security risk analyses.

The voice agent pipeline most people don't know about

When a patient calls an AI-powered prescreening line and has a five-minute conversation, that feels like one product. In reality, that single call might touch five different vendors: an orchestration layer, a telephony provider, a speech-to-text service, a language model, and a text-to-speech engine. One phone call, five vendors. Your data went to every single one of them.

The Voice Agent Pipeline

One patient call, five vendors. Click each node to see that vendor's data retention policy.

"Zero data retention" sounds reassuring — but it's a contractual policy, not a technical guarantee. Your data still physically travels to their servers during processing, and policies can change with 30 days' notice.

The security spectrum: not all AI is created equal

The industry conversation too often treats AI data security as binary: AI is risky, or AI is fine with a BAA. The reality is a spectrum with dramatically different risk profiles at each level. The critical insight is that the same AI model accessed through different product tiers has completely different data security properties. GPT-4 through ChatGPT Free trains on your input. GPT-4 through Azure OpenAI never does. The model is the same. The data handling is not.

The Security Spectrum

Click each level to understand the risk profile. The same AI model has completely different security properties depending on how it's accessed.

← Most SecureLeast Secure →

AI provider comparison

The following table summarizes how the major AI providers differ on the compliance dimensions that matter most for clinical trial data. This is current as of early 2026, but policies change frequently. Verify directly with providers before making decisions.

AI Provider Comparison

How major AI providers differ on compliance dimensions that matter for clinical trial data. Click a row to expand.

Provider
OpenAI ChatGPT	No	Enterprise, API	Indefinite (consumer); 30 days (API)	Yes (consumer default)	Yes (API only)
Anthropic Claude	No	Enterprise, API	30 days (consumer); 7 days (API)	Yes (consumer, since Sept 2025)	Yes (Enterprise API)
Google Vertex AI	No	Workspace, Vertex	24-hour cache (Vertex)	No (Workspace/Cloud)	Yes
Microsoft Azure OpenAI	N/A	Included by default	30 days (abuse monitoring)	Never	Yes (managed customers)
AWS Bedrock	N/A	Yes (self-service)	Zero by default	Never	Yes (default behavior)

A few things stand out: AWS Bedrock stores zero data by default and offers self-service BAA activation. Azure OpenAI includes a BAA by default and never uses customer data for training. On the other end, both OpenAI and Anthropic train on consumer-tier data by default, and neither offers a BAA below their enterprise tiers.

Data minimization: the 98% solution

Consider a common clinical trial workflow: using AI to prescreen patients by scanning their medical records. A typical cloud-based approach uploads a 50-page medical record to extract maybe three or five fields that are actually relevant to eligibility criteria.

A local-first approach processes those records on the device, extracts only the fields that matter, and sends just those values to the cloud. That is a 98% reduction in data exposure with the exact same functional outcome. The raw medical records never leave the device.

98% reduction

In data exposure by processing records locally and only sending the specific fields you need. Same functional outcome.

This principle applies well beyond patient prescreening. Any workflow where AI is used to extract or summarize specific information from larger documents is a candidate for data minimization. The question: does the full document need to leave my environment, or can I process it locally and only send what I actually need?

Subscribe for more insights

Where this is heading

Today, sponsors work with many different vendors, and each vendor has its own AI stack, its own providers, its own retention policies. Auditing AI data security across a trial means understanding dozens of different vendor relationships, sub-processors, and policy documents. It is not scalable.

But there is a better model emerging. Large pharmaceutical companies already have significant cloud infrastructure. Why not host their own LLMs on that infrastructure? They could give vendors API keys and credentials to access the sponsor's models, maintain a single security policy, a single audit trail, and keep all data within their own environment.

The technical foundations already exist. Open-source models like Meta's Llama and Mistral perform competitively with proprietary alternatives. Cloud platforms like AWS Bedrock and Azure OpenAI support private model deployment with enterprise-grade security. The MELLODDY consortium brought ten major pharmaceutical companies together to train models on combined data without any company sharing its raw datasets.

This is the direction the industry is heading. One policy to audit instead of dozens. One security framework instead of a patchwork. One set of access controls instead of hoping every vendor in the chain is doing the right thing.

Three questions to start asking your vendors today

If you take nothing else away from this piece, start asking these three questions of every AI vendor in your clinical trial ecosystem:

Three Questions to Ask Your Vendors

Check each question off as you confirm your vendors can answer it.

The bottom line

The clinical trial industry does not need to avoid AI. The efficiency gains are real and significant. But AI needs to be deployed with the same rigor applied to any other system that handles regulated data. That means enterprise-grade tools with signed BAAs, data minimization architectures that limit exposure by design, governance frameworks with real enforcement, and honest conversations about where data actually goes.

Stop training your competitors' AI with data you would not email to their CEO.

Learn how Curebase approaches AI data security

Local-first AI processing, enterprise BAAs, and full audit trails built into our eClinical platform.

Schedule a Demo

Subscribe for more insights

Storm Stillman

CEO of Curebase, a clinical trial technology company building an end-to-end eClinical platform with local-first AI capabilities. This article is based on his presentation at SCOPE 2026. Connect on LinkedIn