After AI medical devices enter the registration application phase, project teams tend to prioritize algorithm indicators. Nevertheless, in registration clinical trials, regulators focus not merely on algorithm accuracy, but on whether the device can generate reviewable, verifiable and extrapolatable clinical evidence under defined intended use, target population, clinical workflow and practical usage environment.
After AI medical devices enter the registration application stage, project teams usually focus primarily on algorithm indicators. However, in registration clinical trials, reviewers pay attention not only to algorithm accuracy, but also to whether the device can generate reviewable, repeat-verifiable and extrapolatable clinical evidence under specified intended use, target population, clinical workflow and usage environment.

Figure 1 Four Core Issues in Clinical Trial Design of AI Medical Devices
The Center for Medical Device Evaluation of NMPA has issued the Guidelines for Registration Review of Artificial Intelligence Medical Devices and relevant clinical evaluation guidelines for AI-assisted detection products. It indicates that the clinical evaluation of AI products is shifting from technical verification to scenario-based evidence verification. For AI medical device projects planning registration clinical trials, the core concerns during protocol design mainly fall into the following four categories.
Sample design for AI products differs from conventional medical devices. While common devices mainly focus on whether subject quantity meets statistical requirements for primary endpoints, AI medical devices additionally need to verify whether enrolled samples cover the real data distribution encountered by algorithms in clinical practice.
For imaging AI products, sample grouping includes not only positive and negative case quantities, but also lesion size, pathology classification, disease severity, image quality, scanning equipment model, acquisition parameters, participating institutions and radiologist proficiency. Even with favorable statistical outcomes, samples sourced exclusively from a single center, identical equipment or highly idealized datasets fail to sufficiently validate the product’s generalization capacity under intended clinical settings.
The Guidelines for Clinical Trial Design of Medical Devices stipulates that sample size calculation shall be determined based on trial objectives, evaluation endpoints, comparison type and statistical hypotheses. For AI devices, sample size estimation requires calculation not only of overall case volume, but also adequate allocation of positive specimens, negative specimens, special subtype cases and key stratified samples.
Recommendations on protocol design
Prior to clinical trial initiation, a sample distribution matrix shall be formulated to specify dimensions including disease spectrum, disease severity, acquisition equipment, investigational site source, target population and grading of image or data quality, so as to avoid insufficient cases of key subgroups discovered in later trial stages.
AI-assisted detection and diagnostic products often need to be compared with gold standards or clinical reference standards. The core difficulty lies in that most AI products undertake complex tasks such as lesion localization, risk grading, image segmentation, abnormality prompt and auxiliary therapeutic decision-making, rather than simple binary classification judgment. In this case, the confirmation of true values, the formulation of consensus rules and the resolution of discrepancies will all affect the credibility of trial conclusions.
In some scenarios, pathological results, surgical findings and follow-up outcomes can serve as robust reference standards. For scenarios including image detection, lesion identification and functional measurement, it is necessary to establish a complete reference standard system consisting of multiple experts, unified interpretation rules, blinded procedures and arbitration mechanisms.
Therefore, the gold standard design for AI products cannot be simply defined as "interpreted by senior physicians". It shall clearly specify the number and professional background of participating physicians, reading procedures, independence requirements, blinding rules, consistency evaluation methods, dispute resolution mechanisms and data traceability approaches. Otherwise, even with sufficient trial data, the whole clinical evidence chain may be undermined due to unstable reference standards.
Protocol design recommendations
For AI-assisted detection products, an independent chapter of "Clinical Reference Standard Establishment" is suggested in the trial protocol to clarify the source of reference standards, expert panel composition, interpretation rules, arbitration procedures and quality control requirements. For products with multiple applicable gold standards, the scientificity and acceptability of the adopted reference standards shall be demonstrated in advance.
Two common deviations frequently occur in the endpoint design of clinical trials for AI medical devices. The first is focusing merely on the offline algorithm performance while ignoring the practical application effect in clinical workflows. The second is setting excessive primary evaluation indicators, resulting in ambiguous statistical hypotheses and unfocused trial objectives.
For products positioned for assisted detection, trial endpoints may include lesion detection rate, sensitivity, specificity, AUC, false positive quantity, image reading time, and physicians’ diagnostic performance before and after using the AI system. For products positioned for assisted diagnosis or clinical decision-making, evaluation endpoints shall not be limited to image-level accuracy. It is necessary to illustrate how AI outputs affect physician judgment, patient management and risk stratification in combination with clinical diagnosis and treatment pathways.
The primary endpoint shall be consistent with the intended use specified in the product instructions. Secondary endpoints can focus on physician efficiency, diagnostic consistency, value of abnormal prompts, false positive burden, usability and safety. If the product claims to "improve physicians’ diagnostic capability", a comparative framework of "physician independent interpretation" versus "AI-assisted interpretation" shall be adopted, rather than merely presenting the independent output results of the AI algorithm.

Figure 2 Clinical Evidence Chain of AI Products: From Algorithm Output to Registration Application
Clinical risks of AI medical devices are often concealed behind overall trial outcomes. Qualified overall sensitivity and specificity cannot guarantee stable performance across diverse sites, equipment, patient populations and disease subtypes. From the review perspective, stratified analysis identifies performance degradation of algorithms under specific scenarios and defines the product’s applicable scope and usage limitations.
Common stratification dimensions cover participating clinical centers, equipment models, acquisition protocols, lesion sizes, disease stages, age groups, genders, image quality grades and physicians’ experience levels. Extra attention shall be paid to consistency between real clinical settings and algorithm development datasets if obvious discrepancies exist between training and clinical trial data.
Key stratification factors shall be predefined at the protocol design phase instead of conducting makeshift exploratory analysis after trial completion. For core factors impacting algorithm performance, pre-planned sample allocation, statistical approaches and result interpretation rules help avoid review risks caused by qualified overall data yet unbalanced subgroup outcomes.
| Verification Dimension | Items to be Clearly Stated in the Protocol | Common Risks |
|---|---|---|
Sample Design | Disease spectrum, positive/negative ratio, center source, equipment model, image quality, key subgroups | Sufficient total sample size, but insufficient sample size in key subgroups |
Reference Standard | Expert composition, blinding method, independent interpretation, consistency evaluation, arbitration mechanism, data traceability | Only expert interpretation is stated, without a reviewable process |
Evaluation Endpoint | Primary endpoint, secondary endpoint, safety indicators, clinical value interpretation | The endpoint is inconsistent with the intended use or the claims in the instructions for use |
Statistical Analysis | Basis for sample size estimation, superiority/non-inferiority hypothesis, MRMC or paired design, missing value handling | Unclear statistical hypothesis, difficult to interpret after the end of the trial |
Stratified Analysis | Preset stratification by center, equipment, population, lesion subtype, disease severity, image quality, etc. | Overall compliance is achieved, but evidence of generalization ability is insufficient |
Registration Connection | Connection between clinical report, instructions for use restrictions, risk control, software update and post-marketing surveillance | Clinical evidence cannot naturally support the registration documents |
Clinical trials for AI medical devices are far more than expanding algorithm test datasets into hospital environments. Relevant work requires comprehensive comprehension of product technical characteristics, clinical diagnosis and treatment workflows, statistical design, data management, site implementation, ethical compliance as well as regulatory review logic.
For project teams, advancing clinical trial design to the early stage of product development and registration route evaluation effectively reduces rework risks in later phases. Specifically, insufficient preliminary planning concerning sample distribution, gold standard formulation, primary endpoints, reader configuration, stratified analysis and closed-loop data management will lead to prolonged timelines, rising costs or insufficient supportive evidence even with supplementary follow-up data.
Deda Medical provides full-spectrum support for registration clinical trials of AI medical devices covering protocol design, site selection, ethics submission, clinical monitoring, data management, biostatistics, clinical study report drafting and registration document docking. The team identifies core risks prior to trial initiation and converts algorithm performance data into qualified clinical evidence for registration submission.
5. FDA: Artificial Intelligence in Software as a Medical Device
6. IMDRF: Software as a Medical Device (SaMD): Clinical Evaluation
Application Scenario: Official Website Article of Deda Medicine | Special Topic on Clinical Trial Design of AI Medical Devices