Where lies the difficulty in clinical trial design for AI medical devices? Sample size, gold standard, endpoint and stratified analysis cannot be neglected_AI & Digital Health_Professional Fields

Where lies the difficulty in clinical trial design for AI medical devices? Sample size, gold standard, endpoint and stratified analysis cannot be neglected

After AI medical devices enter the registration application phase, project teams tend to prioritize algorithm indicators. Nevertheless, in registration clinical trials, regulators focus not merely on algorithm accuracy, but on whether the device can generate reviewable, verifiable and extrapolatable clinical evidence under defined intended use, target population, clinical workflow and practical usage environment.

Introduction

After AI medical devices enter the registration application stage, project teams usually focus primarily on algorithm indicators. However, in registration clinical trials, reviewers pay attention not only to algorithm accuracy, but also to whether the device can generate reviewable, repeat-verifiable and extrapolatable clinical evidence under specified intended use, target population, clinical workflow and usage environment.

AI医疗器械临床试验设计四个关键问题

Figure 1 Four Core Issues in Clinical Trial Design of AI Medical Devices

The Center for Medical Device Evaluation of NMPA has issued the Guidelines for Registration Review of Artificial Intelligence Medical Devices and relevant clinical evaluation guidelines for AI-assisted detection products. It indicates that the clinical evaluation of AI products is shifting from technical verification to scenario-based evidence verification. For AI medical device projects planning registration clinical trials, the core concerns during protocol design mainly fall into the following four categories.

Samples: Not "the More the Better", but Representative of Real-World Clinical Scenarios

Sample design for AI products differs from conventional medical devices. While common devices mainly focus on whether subject quantity meets statistical requirements for primary endpoints, AI medical devices additionally need to verify whether enrolled samples cover the real data distribution encountered by algorithms in clinical practice.

For imaging AI products, sample grouping includes not only positive and negative case quantities, but also lesion size, pathology classification, disease severity, image quality, scanning equipment model, acquisition parameters, participating institutions and radiologist proficiency. Even with favorable statistical outcomes, samples sourced exclusively from a single center, identical equipment or highly idealized datasets fail to sufficiently validate the product’s generalization capacity under intended clinical settings.

The Guidelines for Clinical Trial Design of Medical Devices stipulates that sample size calculation shall be determined based on trial objectives, evaluation endpoints, comparison type and statistical hypotheses. For AI devices, sample size estimation requires calculation not only of overall case volume, but also adequate allocation of positive specimens, negative specimens, special subtype cases and key stratified samples.

Recommendations on protocol design

Prior to clinical trial initiation, a sample distribution matrix shall be formulated to specify dimensions including disease spectrum, disease severity, acquisition equipment, investigational site source, target population and grading of image or data quality, so as to avoid insufficient cases of key subgroups discovered in later trial stages.

Gold Standard: Not Mere "Expert Interpretation", but a Recheckable Reference Standard

AI-assisted detection and diagnostic products often need to be compared with gold standards or clinical reference standards. The core difficulty lies in that most AI products undertake complex tasks such as lesion localization, risk grading, image segmentation, abnormality prompt and auxiliary therapeutic decision-making, rather than simple binary classification judgment. In this case, the confirmation of true values, the formulation of consensus rules and the resolution of discrepancies will all affect the credibility of trial conclusions.

In some scenarios, pathological results, surgical findings and follow-up outcomes can serve as robust reference standards. For scenarios including image detection, lesion identification and functional measurement, it is necessary to establish a complete reference standard system consisting of multiple experts, unified interpretation rules, blinded procedures and arbitration mechanisms.

Therefore, the gold standard design for AI products cannot be simply defined as "interpreted by senior physicians". It shall clearly specify the number and professional background of participating physicians, reading procedures, independence requirements, blinding rules, consistency evaluation methods, dispute resolution mechanisms and data traceability approaches. Otherwise, even with sufficient trial data, the whole clinical evidence chain may be undermined due to unstable reference standards.

Protocol design recommendations

For AI-assisted detection products, an independent chapter of "Clinical Reference Standard Establishment" is suggested in the trial protocol to clarify the source of reference standards, expert panel composition, interpretation rules, arbitration procedures and quality control requirements. For products with multiple applicable gold standards, the scientificity and acceptability of the adopted reference standards shall be demonstrated in advance.

Endpoint: Proving Not Only "Algorithm Accuracy", but also "Clinical Utility"

Two common deviations frequently occur in the endpoint design of clinical trials for AI medical devices. The first is focusing merely on the offline algorithm performance while ignoring the practical application effect in clinical workflows. The second is setting excessive primary evaluation indicators, resulting in ambiguous statistical hypotheses and unfocused trial objectives.

For products positioned for assisted detection, trial endpoints may include lesion detection rate, sensitivity, specificity, AUC, false positive quantity, image reading time, and physicians’ diagnostic performance before and after using the AI system. For products positioned for assisted diagnosis or clinical decision-making, evaluation endpoints shall not be limited to image-level accuracy. It is necessary to illustrate how AI outputs affect physician judgment, patient management and risk stratification in combination with clinical diagnosis and treatment pathways.

The primary endpoint shall be consistent with the intended use specified in the product instructions. Secondary endpoints can focus on physician efficiency, diagnostic consistency, value of abnormal prompts, false positive burden, usability and safety. If the product claims to "improve physicians’ diagnostic capability", a comparative framework of "physician independent interpretation" versus "AI-assisted interpretation" shall be adopted, rather than merely presenting the independent output results of the AI algorithm.

图2 AI产品临床证据链：从算法输出到注册申报

Figure 2 Clinical Evidence Chain of AI Products: From Algorithm Output to Registration Application

Stratified Analysis: Not an Optional Supplement but Critical Evidence for Generalization Verification

Clinical risks of AI medical devices are often concealed behind overall trial outcomes. Qualified overall sensitivity and specificity cannot guarantee stable performance across diverse sites, equipment, patient populations and disease subtypes. From the review perspective, stratified analysis identifies performance degradation of algorithms under specific scenarios and defines the product’s applicable scope and usage limitations.

Common stratification dimensions cover participating clinical centers, equipment models, acquisition protocols, lesion sizes, disease stages, age groups, genders, image quality grades and physicians’ experience levels. Extra attention shall be paid to consistency between real clinical settings and algorithm development datasets if obvious discrepancies exist between training and clinical trial data.

Key stratification factors shall be predefined at the protocol design phase instead of conducting makeshift exploratory analysis after trial completion. For core factors impacting algorithm performance, pre-planned sample allocation, statistical approaches and result interpretation rules help avoid review risks caused by qualified overall data yet unbalanced subgroup outcomes.

Pre-review Checklist for AI Clinical Trial Protocol Design

Verification Dimension	Items to be Clearly Stated in the Protocol	Common Risks
Sample Design	Disease spectrum, positive/negative ratio, center source, equipment model, image quality, key subgroups	Sufficient total sample size, but insufficient sample size in key subgroups
Reference Standard	Expert composition, blinding method, independent interpretation, consistency evaluation, arbitration mechanism, data traceability	Only expert interpretation is stated, without a reviewable process
Evaluation Endpoint	Primary endpoint, secondary endpoint, safety indicators, clinical value interpretation	The endpoint is inconsistent with the intended use or the claims in the instructions for use
Statistical Analysis	Basis for sample size estimation, superiority/non-inferiority hypothesis, MRMC or paired design, missing value handling	Unclear statistical hypothesis, difficult to interpret after the end of the trial
Stratified Analysis	Preset stratification by center, equipment, population, lesion subtype, disease severity, image quality, etc.	Overall compliance is achieved, but evidence of generalization ability is insufficient
Registration Connection	Connection between clinical report, instructions for use restrictions, risk control, software update and post-marketing surveillance	Clinical evidence cannot naturally support the registration documents

Value of Professional CRO Services in Clinical Trials: Translating Algorithm Issues into Registration Evidence Requirements

Clinical trials for AI medical devices are far more than expanding algorithm test datasets into hospital environments. Relevant work requires comprehensive comprehension of product technical characteristics, clinical diagnosis and treatment workflows, statistical design, data management, site implementation, ethical compliance as well as regulatory review logic.

For project teams, advancing clinical trial design to the early stage of product development and registration route evaluation effectively reduces rework risks in later phases. Specifically, insufficient preliminary planning concerning sample distribution, gold standard formulation, primary endpoints, reader configuration, stratified analysis and closed-loop data management will lead to prolonged timelines, rising costs or insufficient supportive evidence even with supplementary follow-up data.

Deda Medical provides full-spectrum support for registration clinical trials of AI medical devices covering protocol design, site selection, ethics submission, clinical monitoring, data management, biostatistics, clinical study report drafting and registration document docking. The team identifies core risks prior to trial initiation and converts algorithm performance data into qualified clinical evidence for registration submission.

Official Source Links

1. NMPA CMDE: Guiding Principles for Registration Review of Artificial Intelligence Medical Devices (No. 8, 2022)

2. NMPA CMDE: Guiding Principles for Clinical Evaluation and Registration Review of Artificial Intelligence-assisted Detection Medical Devices (Software) (No. 38, 2023, Reposted by Yangtze River Delta Branch)

3. NMPA & National Health Commission: Good Clinical Practice for Medical Device Clinical Trials (No. 28, 2022)

4. NMPA: 5 Technical Guidelines including Technical Guidelines for Clinical Evaluation of Medical Devices (No. 73, 2021)

5. FDA: Artificial Intelligence in Software as a Medical Device

6. IMDRF: Software as a Medical Device (SaMD): Clinical Evaluation

Application Scenario: Official Website Article of Deda Medicine | Special Topic on Clinical Trial Design of AI Medical Devices