Skip to main content

Effectiveness and clinical impact of using deep learning for first-trimester fetal ultrasound image quality auditing

Abstract

Background

Regular auditing of ultrasound images is required to maintain quality; however, manual auditing is time-consuming and can be inconsistent. We therefore aimed to develop and validate an artificial intelligence-based image quality audit (AI-IQA) system to audit images from the four key planes used in first-trimester scanning.

Methods

The AI-IQA system was developed based on the YOLOv7 structure detection network and a multi-branch image quality regression network using a large multicenter internal dataset. Clinical validation was performed using 567 cases scanned by four radiologists with different experience levels, of which 349 were performed without AI-IQA feedback (clinical test set 1) and 218 were performed after 2–3 rounds of AI-IQA feedback (clinical test set 2). The proportion of standard images obtained and detailed expert audit results were compared to verify whether AI-IQA could objectively and accurately provide feedback on deficiencies in nonstandard images to assist radiologists at different experience levels in improving image quality.

Results

In the internal test set, the AI-IQA system achieved high average accuracy precision, recall and F1-score in auditing the overall plane quality (0.881, 0.833, 0.842 and 0.837, respectively) and structure quality (0.906, 0.861, 0.857 and 0.859, respectively). In clinical test sets 1 and 2, AI-IQA results showed strong consistency with expert assessment results, with the average Cohen’s Kappa coefficient exceeding 0.8 for all four planes. In addition, following AI-IQA feedback, the proportion of standard images obtained by junior and mid-level radiologists increased by 7.7% and 5.1%, respectively. AI-IQA takes only 0.05 s to assess each image, while experts require more than 20 s (p < 0.001).

Conclusions

The proposed AI-IQA system proved to be a highly accurate and efficient method of automatically auditing first-trimester scanning image quality, providing precise and rapid key plane quality control. This tool can also assist radiologists with different levels of experience to improve the image quality.

Peer Review reports

Background

Ultrasound (US) has become an important diagnostic tool for initial gestational screening, owing to its instantaneous results, low cost and noninvasive nature [1]. Although prenatal ultrasonography mainly focuses on the second trimester [2], technological advancements have made first-trimester scanning (FTS) increasingly important in modern medicine. Complete early screening requires standard planes from multiple views covering the fetal head, brain, neck, heart, abdomen, limbs, placenta and biometric anatomical regions [3]. It provides vital early fetal information such as fetal size and gestational age (GA), aiding in timely decision-making for subsequent care and interventions [4].

In FTS, there are four key planes closely related to fetal screening for chromosomal abnormalities, structural malformations and biometric measurements: the nuchal translucency (NT) plane (NTP), the midsagittal view of the fetus (MSF), the axial view of the fetal abdomen (AFA) and the axial view of the fetal head in the transventricular plane (AFTP) [5]. NTP allows for the observation of increased NT thickness, predicting the risk of chromosomal abnormalities and structural malformations [6,7,8,9]. MSF is used to measure crown-rump length (CRL) to calculate GA and detect facial malformations such as cleft lip and palate [10, 11]. AFA primarily visualizes the connection between the umbilical cord and abdominal wall as well as detects omphalocele and gastroschisis [12]. AFTP assists in identifying central nervous system malformations such as brain deformity [13, 14]. Obtaining good quality images in these four planes is vital for the reliability of FTS results, although a complete examination involves additional planes as outlined in International Society of Ultrasound in Obstetrics and Gynecology (ISUOG) guidelines [15].

Obtaining standard plane is fundamental for accurate measurements and diagnoses. Therefore, quality control of acquired US images is a critical task in clinical practice. Traditionally, this process relies on manual auditing, where experienced professionals visually assess the image quality. Although this approach can be reliable to some extent, it has several limitations. First, it imposes an additional burden on experts, requiring them to step away from actual clinical diagnostic work, which hampers the efficient use of medical resources. Second, manual auditing is inherently subjective, making it difficult to eliminate inter-observer variability [16]. Finally, this method is inefficient and struggles to meet the demands of large-scale screening or provide timely feedback [17]. Facing these limitations, rapidly advancing deep learning (DL) technologies offer a promising solution for US image quality control.

DL has become a vital tool for image quality analysis. Classification-based methods offer advantages in speed and reliability [18, 19], while contrastive learning and anomaly detection have shown potential in feature extraction [20,21,22]. For example, Qu et al. employed a differential convolutional neural network (DCNN) to accurately detect specific fetal brain planes [23]. However, such simple classification methods lack granular feedback. Additionally, some studies have evaluated image quality by detecting key anatomical structures [24, 25], but the presence of a structure does not fully reflect image quality. To address this, Dong et al. proposed a general deep learning framework that incorporates image gain and scaling analysis to identify the standard four-chamber heart view [26]. However, the multi-step nature of the process may lead to error accumulation. Despite significant advancements in US image quality analysis, limitations remain, particularly the lack of specific improvement feedback and the research gap in FTS applications.

To address these issues, we propose an artificial intelligent (AI)-based image quality audit (AI-IQA) system that innovatively integrates YOLOv7 for object detection and a multi-branch quality regression network for quality assessment. This integrated design not only improves detection accuracy but also provides granular feedback. We validated the system’s reliability on four key planes in FTS, compared its efficacy with manual expert assessment and analyzed its utility for radiologists with different levels of experience.

Methods

Study design

The study design is illustrated in Fig. 1. This study collected data from multiple centers to form an internal dataset with which to develop the AI-IQA system. Clinical validation was performed using two clinical test sets: clinical test set 1 from four radiologists at different experience levels without AI-IQA feedback and clinical test set 2 from the same four radiologists after 2–3 rounds of AI-IQA feedback. An expert panel (each member with ≥ 15 years of clinical experience) independently evaluated image quality in the two clinical test sets.

Fig. 1
figure 1

Flowchart summarizing the study design. NTP: nuchal translucency plane, MSF: midsagittal view of the fetus, AFTP: axial view of fetal head in the transventricular plane, AFA: axial view of fetal abdomen, AI-IQA: artificial intelligence-based image quality audit

The AI-IQA system’s audits were then compared against the expert panel’s assessments to evaluate its accuracy and consistency. In addition, we compared the proportion of standard images obtained in the two clinical test sets to verify whether AI-IQA could objectively and accurately provide feedback on the reasons for nonstandard images to improve the quality of images obtained by radiologists at different experience levels. This study also assessed the speed and efficiency of the AI-IQA system.

The study was approved by the Ethics Committee of Shenzhen Futian District Maternity & Child Healthcare Hospital (protocol number: K-2023-04-01) and conducted in accordance with the principles outlined in the Declaration of Helsinki.

Data collection

The inclusion criteria for images in this study were as follows: (1) GA between 11 and 13+ 6 weeks; (2) clear fetal images with at least two planes retained on NTP, MSF, AFA and AFTP; (3) singleton pregnancy with no abnormalities. Adhering to the inclusion criteria, a total of more than 17,000 2D images from 5000 fetuses obtained during FTS were collected from multiple centers to form an internal dataset for model development, including 3268 NTP, 3274 MSF, 3768 AFA and 4284 AFTP images. The images were acquired using US machines of various brands, including GE, Mindray, Samsung, Philips and Siemens. The internal dataset was split into training set, validation set, and internal test set in a 7:2:1 ratio based on patient data, using three-fold cross-validation. The validation set was used for hyperparameter tuning, while the internal test set was used for model performance evaluation.

Clinical validation was performed at the Department of Ultrasound in Shenzhen Futian District Maternity & Child Health Hospital, with all participants providing written informed consent. Images of 349 normal fetuses were collected and examined by two junior (< 5 years of experience) and two mid-level (5–10 years of experience) radiologists as clinical test set 1, comprising 388 NTP, 397 MSF, 377 AFA and 380 AFTP images, between July 2022 and April 2023. Following 2–3 rounds of AI-IQA feedback, the same four radiologists obtained images of 218 fetuses as clinical test set 2, comprising 249 NTP, 265 MSF, 264 AFA and 244 AFTP images, between November 2023 and January 2024. The distribution of each plane image, along with the corresponding dataset, is illustrated in Table 1.

Table 1 Distribution of images in clinical test sets 1 and 2

Markers and annotations

In this study, image annotation was conducted using the medical imaging intelligent software Pair [27] (version 2.6; Shenzhen, China), developed by Shenzhen RayShape Medical Technology Co., Ltd. The annotation process was divided into three stages:

  1. (1)

    Image Quality Categorization: Six mid- and senior-level radiologists strictly followed ISUOG guidelines [3] and Fetal Medicine Foundation standards to audit the four-plane image quality. Images were categorized as standard or nonstandard.

  2. (2)

    Anatomical Structure Annotation: Ten experienced radiologists annotated the main anatomical structures within the planes using bounding boxes, which amplified the AI-IQA system’s understanding of local anatomical structures.

  3. (3)

    Detailed Audit Annotation: Six mid- and senior-level radiologists comprehensively annotated audit details across multiple dimensions as listed in Table 2. Main structures with clear, well-defined boundaries were labeled as good; otherwise, as bad. Mutually exclusive structures were classified as visible or invisible based on their presence, while overall image quality, including image clarity, image zoom and fetal position, was categorized as fit or unfit. These criteria, aligned with ISUOG guidelines and refined by experts with over 15 years of experience, trained the AI-IQA system to identify and provide feedback on nonstandard images.

Before formal annotation, all annotators underwent comprehensive training, including detailed criteria explanations, case studies and practice sessions. Only those who passed a qualification assessment by senior experts proceeded with formal annotation. Discrepancies during annotation were resolved through consensus by two additional experts, who also reviewed all annotations to ensure accuracy. Examples of annotated NTP, MSF, AFA and AFTP images are shown in Fig. 2.

Fig. 2
figure 2

Examples of standard and nonstandard images and the reasons manually annotated by radiologists for these categorizations. Green text, main structure assessment; red text, mutually exclusive structure assessment; blue text, overall image assessment. SP: standard plane, NSP: nonstandard plane, NTP: nuchal translucency plane, MSF: midsagittal view of the fetus, AFTP: axial view of fetal head in the transventricular plane, AFA: axial view of fetal abdomen, NT: nuchal translucency, IT: intracranial translucency, LV: lateral ventricle, GN: gonadal node, SSS: spinal sagittal section, CBO: cranial bone ossification, UCI: umbilical cord insertion, CP: choroid plexus

Table 2 Evaluation criteria used by radiologists to annotate specific plane details

Development of the AI-IQA system

To achieve a comprehensive evaluation of both global image features (e.g., fetal position, image clarity) and local anatomical structures, the AI-IQA system integrates YOLOv7 and a ResNet50-based regression network for object detection and quality assessment [28, 29], respectively, as shown in Fig. 3. In the initial detection phase, YOLOv7 identifies regions of interest (ROIs) and key anatomical structures (Table 2), crucial for image quality. These features are then input into the ResNet50-based regression network, which uses residual connections and weight-sharing to extract complex features. The network is split into a main branch and multiple structural branches, each consisting of a global average pooling layer and two fully connected layers. The structural branches learn and output specific structural scores, while the main branch combines these weighted features to output the overall quality score. The final score is calculated by weighted averaging of the component scores. Each score is categorized into binary results based on a 60-point threshold. If mutually exclusive structures are detected, the final score is forced below the threshold, indicating a nonstandard image.

Fig. 3
figure 3

Flowchart illustrating the artificial intelligence-based image quality audit process. ROI: region of interest, FC: fully connected layer

Our experiments were conducted using PyTorch 1.12.1, on a workstation equipped with NVIDIA GeForce RTX 2080 Ti. During data preprocessing, input images were resized to 416 × 416 pixels and normalized with mean= (0.485, 0.456, 0.406) and std= (0.229, 0.224, 0.225). For online data augmentation, YOLOv7 applied default augmentations such as geometric transformations (scaling: 0.5x-1.5x, translation: 20–40 pixels, rotation: ±10 degrees), color adjustments and noise injection, while ResNet50 applied more conservative augmentations (scaling: 0.9x-1.1x, translation: 10–20 pixels, rotation: ±5 degrees) to preserve fetal anatomy. To avoid gradient conflicts and simplify the process, a staged training strategy was employed. In training phase, YOLOv7 used the SGD optimizer (Momentum: 0.937) with Complete Intersection over Union (IoU) and Binary Cross-Entropy (BCE) Loss, whereas ResNet50 employed the Adam optimizer (betas = (0.9, 0.999)) and Mean Squared Error (MSE) Loss. Both networks were trained separately with a batch size of 128 and a learning rate of 1e-3. Early stopping was not employed, as cross-validation indicated no overfitting within 100 epochs.

Statistical analysis

Statistical analysis was conducted using SPSS software version 22.0 (IBM Corp., Armonk, NY, USA). Accuracy (ACC), precision, recall and F1-score were used as metrics to evaluate the performance of the AI-IQA system. These metrics were calculated by comparing the binary qualitative results derived from the model’s predicted quality scores with the gold standard. The specific calculation formulas are as follows:

$$\:\begin{array}{c}ACC=\frac{\text{T}\text{P}+TN}{\text{T}\text{P}+\text{T}\text{N}+FP+FN}\:\end{array}$$
(1)
$$\:\begin{array}{c}Precision=\frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{P}}\end{array}$$
(2)
$$\:\begin{array}{c}Recall=\frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{N}}\end{array}$$
(3)
$$\:\begin{array}{c}F1\:score=\frac{2\times\:\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{o}\text{n}\times\:\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}}{\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{o}\text{n}+\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}}\end{array}$$
(4)

Where TP represents the number of samples correctly classified as positive, FP represents the number of samples incorrectly classified as positive, FN represents the number of samples incorrectly classified as negative and TN represents the number of samples correctly classified as negative.

Consistency between AI-IQA system and expert audit results was assessed using the Cohen’s Kappa analysis, with the coefficient interpreted as follows: 0.81–1.00 (strong), 0.61–0.80 (moderate to strong), 0.41–0.60 (moderate), 0.21–0.40 (fair), and < 0.2 (poor). For comparing the time consumption of AI-IQA and expert audits, we used the Wilcoxon signed-rank test, as the paired differences were symmetrically distributed. McNemar’s test was used to evaluate differences between paired binary data (AI-IQA vs. expert assessments), appropriate for categorical data with dependent observations. The chi-square test, which requires expected frequencies greater than 5 in each cell, was employed to evaluate the significance of improvements in the proportion of standard images among radiologists. P < 0.05 was considered statistically significant.

Results

Characteristics of the study

The mean age and body mass index of pregnant women and the GA of the fetuses are summarized in Table 3. There were no significant differences (p > 0.05) in these characteristics between clinical test sets 1 and 2. The US devices used by the four radiologists in the clinical validation, including the GE Voluson E8, GE Voluson E10 and Samsung WS 80 A, are also listed in Table 3.

Table 3 Characteristics of pregnant women undergoing routine prenatal screening in the first-trimester

Performance of the AI-IQA system in the internal test set

Using the internal test set, we assessed the capacity of the AI-IQA system to appraise the holistic quality of plane images and the integrity of the key structures.

The detection model achieved a precision of 0.83, recall of 0.85 and mean Average Precision 50 (mAP50) of 0.85 across the four types of standard plane images. The auditing results of the AI-IQA system, including ACC precision, recall and F1-score, consistently exceeded 0.8, as summarized in Table 4. Notably, for the appraisal of image quality, AI exhibited a remarkable average ACC of 0.881. The individual ACCs for NTP, MSF, AFTP and AFA were 0.879, 0.857, 0.914 and 0.875, respectively. These results suggest that the AI-IQA system demonstrates effectiveness and reliability in evaluating both the image quality and the structural integrity of major components during internal testing, reflecting solid performance.

Table 4 Performance of the artificial intelligence-based image quality audit system in the internal test set

To better visualize AI-IQA predicted scores and expert annotation results, examples of four plane images are shown in Fig. 4. In the table on the right side of Fig. 4, the second column is the AI-IQA predicted scores and the third column is the expert annotated results which served as gold standards.

Fig. 4
figure 4

Comparison of artificial intelligence-based image quality audit predicted scores with expert annotated results. The green rectangles represent the region of interest for the plane. The red rectangles delineate the anatomical structures appearing in the plane image, including both main and mutually exclusive structures. NTP: nuchal translucency plane, MSF: midsagittal view of the fetus, AFTP: axial view of fetal head in the transventricular plane, AFA: axial view of fetal abdomen, NT: nuchal translucency, IT: intracranial translucency, LV: lateral ventricle, GN: gonadal node, SSS: spinal sagittal section, CBO: cranial bone ossification, UCI: umbilical cord insertion, CP: choroid plexus

Performance of the AI-IQA system in clinical test sets

For a more in-depth analysis of the performance of the AI-IQA system in the clinical test sets, two experienced experts independently assessed the image quality based on the same criteria. In cases of disagreement, a consensus was reached following discussion. Table 5 shows the number of clinical test sets images classified as standard and nonstandard by the AI-IQA system and experts. The AI-IQA identified 93.6% (596/637), 85.2% (564/662), 88.6% (553/624) and 75.7% (485/641) of NTP, MSF, AFTP and AFA plane images, respectively, as standard; for the experts, these values were 94.2%, 86.4%, 89.7% and 76.9%, respectively. There was therefore consistency between the AI-IQA and expert results, with the Cohen’s Kappa coefficient exceeding 0.8 for all four planes.

Table 5 Comparison of clinical test set images analyzed by the AI-IQA system and experts

Table 6 summarizes the average time taken by the AI-IQA system and experts to assess the quality of each clinical test set image. The AI-IQA system demonstrated an average evaluation time of 0.05 s per image, over 100 times faster than the experts (9.8–36.7 s). The difference between the two was statistically significant (p < 0.001), as determined by the Wilcoxon signed-rank test.

Table 6 A summary of the time-consuming taken to assess each image

For a more systematic validation of whether AI-IQA results comply with clinical practice standards, we used expert audit results as the gold standard and calculated the ACC of AI-IQA in the four planes, as shown in Table 7. In clinical test set 1, the ACCs in the four planes were 0.884, 0.841, 0.896 and 0.825, with an average ACC of 0.862. In clinical test set 2, the ACCs were 0.932, 0.964, 0.881 and 0.847, with an average ACC of 0.906. Additionally, we separately compared the percentage of standard images between the two radiologist groups and the AI-IQA and expert audit results. All the results showed close proximity, with McNemar’s test indicating no statistical difference between the AI-IQA and expert results (p > 0.05).

Table 7 Accuracy of the AI-IQA system and a comparison of junior and mid-level radiologists, with tests indicating consistency between AI-IQA and expert results (p > 0.05)

Image quality before and after AI-IQA feedback

To assess the effect of the AI-IQA feedback on image quality, we compared the image quality for the two groups of radiologists before and after receiving AI-IQA feedback. In each feedback round, US images acquired by the radiologists within a recent period were collected and audited by the AI-IQA system. Feedback was delivered as a visual report with quality scores for each item, guiding radiologists on improvements. Expert assessment results were used as gold standards for comparison. As presented in Table 8, both junior and mid-level radiologists exhibited an improvement in the quality of obtained images after AI-IQA feedback training. Specifically, the proportion of standard images obtained by junior radiologists increased from 80.8% to 88.5% and that obtained by mid-level radiologists increased from 86.7% to 91.8%, improvements of 7.7% and 5.1%, respectively. The chi-square test indicated that the differences in pass rates were statistically significant (P < 0.05). Analysis of the various imaging planes demonstrated that both groups achieved a standard image rate exceeding 90% on NTP and MSF post-feedback images. AFA image quality was initially subpar, particularly among junior radiologists, who achieved a standard image rate of 62.5%. However, this value improved to 76.6% after AI-IQA feedback.

Table 8 Proportion of standard images obtained by junior and mid-level radiologists

To further analyze the performance of radiologists, the error proportions of different attributes in all nonstandard images were calculated, as shown in Fig. 5. Both junior and mid-level radiologists made the most errors in the representation of the main structures, with error rates of 58.76% and 52.64%, respectively. This was followed by errors in overall image evaluation, with rates of 35.33% and 38.53%, respectively. Over 90% of nonstandard NTP and MSF images did not clearly display the brain structures. Additionally, the fetal position and display of NT error rates, which can significantly affect CRL and NT measurements, exceeded 70% in both groups. Falx cerebri and choroid plexus were the two structures most likely to be imaged incorrectly in AFTP images, whereas in the worst-performing AFA plane, the lack of clear display of the spine and umbilical cord insertion (UCI) were the two main reasons for images being classified as nonstandard.

Fig. 5
figure 5

Summary of the error proportions of different attributes in nonstandard images across all four planes. Images obtained by junior radiologists are analyzed in the first subplot, whereas those obtained by mid-level radiologists are analyzed in the second subplot. NTP: nuchal translucency plane, MSF: midsagittal view of the fetus, AFA: axial view of fetal abdomen, AFTP: axial view of fetal head in the transventricular plane, NT: nuchal translucency, IT: intracranial translucency, LV: lateral ventricle, GN: gonadal node, SSS: spinal sagittal section, CBO: cranial bone ossification, CP: choroid plexus, UCI: umbilical cord insertion

Discussion

Principal findings

Many previous studies have emphasized the importance of quality control in fetal ultrasonography, with regular auditing of images and feedback on identified issues contributing to improvements in image quality. However, manual assessments are associated with low consistency and inefficiency, limiting their potential to significantly enhance examination quality. To overcome the deficiencies of manual auditing, we developed an AI model, AI-IQA, which utilizes plane structure detection and a quality regression network to intelligently audit the image quality of the four key planes of FTS.

Recent advancements in DL have demonstrated its potential in fetal US imaging. Chen et al. pioneered the use of CNNs to locate the fetal abdominal standard plane (FASP) in US videos [30], while Zhang et al. focused on recognizing key anatomical structures in the fetal abdomen, head and heart [31]. Other studies have also achieved real-time detection in 2D and 3D US videos [32, 33]. However, these approaches either focus on mid-to-late pregnancy images or are limited to a single standard plane in early pregnancy, lacking comprehensive quality assessments. Zhen et al.’s work, the most comparable to ours, developing a quality control system for early pregnancy images based on expert scoring [34]. It is important to note that, unlike these studies primarily focused on assisting radiologists in detecting standard planes during US screening, our objective is to evaluate the quality of acquired images. This represents a subtle difference from other tasks and research in this area remains limited. Our work aims to provide a quality auditing tool, offering feedback to assist radiologists in improving image quality.

To validate performance, we verified the AI-IQA system in an internal test set and two clinical test sets, with results indicating that the AI-IQA system meets clinical practice standards and is consistent with expert audits. Following AI-IQA feedback, both junior and mid-level radiologists exhibited substantial improvements in obtained image quality.

Clinical implications

Further investigations have showed that comprehensive image quality audits can improve the completeness and quality of US scans conducted by radiologists [35]. However, manual image auditing is a labor-intensive task and requires experienced experts to ensure the effectiveness of the assessment results. Therefore, hospitals can only perform a limited number of selective audits, which may not fully assess the performance of radiologists or provide timely feedback. In less developed regions, resource shortages may further limit the performance of effective audits, exacerbating healthcare quality disparities.

In theory, an AI-based model can serve as a comprehensive and convenient approach for US image quality control to tackle the above issues. Compared to manual assessments, the AI-IQA system provides a uniform and objective evaluation, eliminating intra-observer variability and addressing human resource disparities between hospitals. The AI-IQA system is also 100 times quicker than manual assessment, significantly enhancing efficiency. Without the constraints of observer experience and time, the utilization of AI-IQA should enable comprehensive quality audits of all cases, thereby substantially elevating US screening in hospitals.

Research implications

Standard US plane images are fundamental for assessing anatomical structures and performing biometric measurements in maternal-fetal screening, enabling accurate diagnoses that guide clinical decision-making and pregnancy care. Therefore, this study explored the value of AI-IQA feedback in improving image quality among radiologists with different levels of experience.

NT thickness and CRL are two key biometric measurements of the fetus performed in the first trimester using NTP and MSF images, respectively. These measurements are crucial for chromosomal abnormality screening and calculating GA. In this study, we found that nonstandard images in these planes were mostly due to radiologists failing to position the fetus in the neutral position and incorrectly believing that showing only a small portion of NT fulfilled the requirement. However, as positional flexion or extension can alter the distance from the fetal head to the buttock, NT measurement requires a clear view of the entire cross section of the fetal neck, including the skin boundaries on both sides of the neck and head.

These issues may reflect a lack of operational experience or thorough understanding of standard planes. Even more experienced mid-level radiologists can sometimes overlook these critical details. In the AFA plane, both junior and mid-level radiologists failed to obtain a clear view of the UCI and spine, resulting in a lower standard rate of AFA images. These findings reveal some common issues that affect the quality of hospital US examinations. Therefore, it is essential that junior and mid-level radiologists receive continuous education and training to enhance the overall quality of US examinations.

The AI-IQA system objectively and clearly identified these issues and the quality of NTP, MSF and AFA images obtained by both junior and mid-level radiologists improved significantly following AI-IQA feedback. The proportion of standard images obtained by junior radiologists after AI-IQA feedback approached or even surpassed that of mid-level radiologists before feedback, suggesting that the AI-IQA system could shorten the hospital training cycle. Especially in the AFA plane, there is a significant difference between the junior and mid-level group in pre-feedback standard rate, indicating that the scanning may be challenging and experience may be especially important for this plane. AI-IQA assists radiologists in reducing the accumulation period of experience, enabling junior radiologists to approach the level of senior radiologists in a short period of time. The intervention of the AI system deepened the understanding of standard planes among radiologists, corrected previous misconceptions and provided continuous education. This study indicates that AI-IQA not only performs accurate quality audits but also has value in improving the quality of images obtained by radiologists with varying levels of expertise.

Strengths and limitations

This investigation represents a pioneering endeavor to introduce an AI-IQA system meticulously tailored to fetal plane images during FTS. Our work has substantiated the viability of this methodology, which may streamline intelligent oversight of quality control protocols in obstetric imaging. It is important to note that, while the AI-IQA system showed a good level of agreement with expert evaluations, it is better suited as a tool for audits and learning rather than as a direct replacement for human clinical assessment.

This study has several limitations. There is no evidence in this study suggesting that the improvement in image quality depends more on the feedback from AI-IQA than on the accumulation of radiologists’ own experience and skill enhancement over time. However, since our mid-level radiologists have 5–10 years of clinical experience, it is unlikely that the significant improvement in their image quality can be solely attributed to a few months of additional experience. Additionally, the study was limited to fetal images with normative anatomical structures, which may restrict the generalizability of the AI-IQA system to more complex populations and real-world clinical settings, where a broader and more heterogeneous patient demographic is typically encountered. Finally, this single-center validation involved a relatively homogeneous sample obtained from only three US machines. The performance of the AI-IQA system may vary with different US equipment, as each machine has unique imaging characteristics and signal processing methods. Future work will involve multi-center validation to more comprehensively assess the performance of the AI-IQA system. We also plan to address sample size and imbalance issues using generative methods and to include more heterogeneous samples to enhance the model’s generalizability [36]. Furthermore, we aim to conduct a large-scale controlled study to evaluate the impact of the system on improving hospital training cycles.

Conclusion

In this study, we developed an AI-IQA system for automatically auditing image quality during FTS, which showed good consistency with expert assessments. Our results indicate that Al-QA is a useful tool for conducting comprehensive quality audits and has the potential to support radiologists with varying levels of expertise in improving image quality, contributing to maternal and fetal health screening. In clinical practice, the AI-IQA system could significantly improve efficiency and optimize resource utilization, particularly in resource-limited settings. Further multi-center validation and the development of supporting software will be essential to facilitate the clinical adoption and integration of this technology.

Data availability

The datasets and codes are not publicly available because of hospital policy and privacy considerations but are available from the corresponding author upon reasonable request.

Abbreviations

ACC:

Accuracy

AFA:

Axial view of fetal abdomen

AFTP:

Axial view of fetal head in the transventricular plane

AI:

Artificial intelligence

AI-IQA:

Artificial intelligence-based image quality audit

CBO:

Cranial bone ossification

CRL:

Crown-rump length

DL:

Deep learning

GA:

Gestational age

GN:

Gonadal node

ISUOG:

International Society of Ultrasound in Obstetrics and Gynecology

IT:

Intracranial transparency

LV:

Lateral ventricle

MSF:

Midsagittal view of the fetus

NT:

Nuchal translucency

NTP:

Nuchal translucency plane

ROI:

Region of interest

SSS:

Spinal sagittal section

UCI:

Umbilical cord insertion

US:

Ultrasound

IoU:

Intersection over union

BCE:

Binary cross-entropy

MSE:

Mean squared error

mAP50:

Mean average precision 50

CNN:

Convolutional neural network

References

  1. Zaffino P, Moccia S, De Momi E, Spadea MF. A review on advances in Intra-operative imaging for surgery and therapy: imagining the operating room of the future. Ann Biomed Eng. 2020;48:2171–91.

    PubMed  Google Scholar 

  2. Salomon LJ, Alfirevic Z, Berghella V, Bilardo CM, Chalouhi GE, Da Silva Costa F, et al. ISUOG practice guidelines (updated): performance of the routine mid-trimester fetal ultrasound scan. Ultrasound Obstet Gynecol. 2022;59:840–56.

    CAS  PubMed  Google Scholar 

  3. International Society of Ultrasound in Obstetrics and Gynecology, Bilardo CM, Chaoui R, Hyett JA, Kagan KO, Karim JN, et al. ISUOG practice guidelines (updated): performance of 11–14-week ultrasound scan. Ultrasound Obstet Gynecol. 2023;61:127–43.

    Google Scholar 

  4. Liao Y, Wen H, Ouyang S, Yuan Y, Bi J, Guan Y, et al. Routine first-trimester ultrasound screening using a standardized anatomical protocol. Am J Obstet Gynecol. 2021;224:e3961–39615.

    Google Scholar 

  5. Chinese Society of Ultrasound in Medicine Obstetric Ultrasound Group, National Health Commission Maternal and Child Health Division National Expert Group on Prenatal Diagnosis Medical Imaging Group. Guidelines for prenatal ultrasound screening. Chin J Ultrasonography. 2022;31:1–12.

    Google Scholar 

  6. Sun Y, Zhang L, Dong D, Li X, Wang J, Yin C, et al. Application of an individualized nomogram in first-trimester screening for trisomy 21. Ultrasound Obstet Gynecol. 2021;58:56–66.

    CAS  PubMed  PubMed Central  Google Scholar 

  7. Charasson T, Ko-Kivok-Yun P, Martin F, Sarramon MF. Screening for trisomy 21 by measuring nuchal translucency during the first trimester of pregnancy. J Gynecol Obstet Biol Reprod (Paris). 1997;26:671–8.

    CAS  PubMed  Google Scholar 

  8. Chen M, Xue S, Chen J, Chen D, Liu Y, Yan H. OP05.09: correlation between increased nuchal translucency and chromosomal abnormalities. Ultrasound Obstet Gynecol. 2019;54:101–101.

    Google Scholar 

  9. Almeida A, Moura C v., Alves Tm, Braga J, Martins LG, Cunha A. VP28.05: predicting chromosomal abnormalities through first trimester screening and nuchal translucency: experience of a tertiary centre. Ultrasound Obstet Gynecol. 2021;58:213–213.

    Google Scholar 

  10. Napolitano R, Dhami J, Ohuma E, Ioannou C, Conde-Agudelo A, Kennedy S, et al. Pregnancy dating by fetal crown–rump length: a systematic review of charts. BJOG: Int J Obstet Gynecol. 2014;121:556–65.

    CAS  Google Scholar 

  11. Chaoui R, Orosz G, Heling KS, Sarut-Lopez A, Nicolaides KH. Maxillary gap at 11–13 weeks’ gestation: marker of cleft lip and palate. Ultrasound Obstet Gynecol. 2015;46:665–9.

    CAS  PubMed  Google Scholar 

  12. Verla MA, Style CC, Olutoye OO. Prenatal diagnosis and management of omphalocele. Semin Pediatr Surg. 2019;28:84–8.

    PubMed  Google Scholar 

  13. Volpe N, Dall’Asta A, Di Pasquo E, Frusca T, Ghi T. First-trimester fetal neurosonography: technique and diagnostic potential. Ultrasound Obstet Gynecol. 2021;57:204–14.

    CAS  PubMed  Google Scholar 

  14. Zhang N, Dong H, Wang P, Wang Z, Wang Y, Guo Z. The value of obstetric ultrasound in screening fetal nervous system malformation. World Neurosurg. 2020;138:645–53.

    PubMed  Google Scholar 

  15. Salomon L-J, Bernard J-P, Ville Y. Quality control of prenatal ultrasound. A role for biometry. Gynecol Obstet Fertil. 2006;34:683–91.

    PubMed  Google Scholar 

  16. Chinese Physicians Association Ultrasound Physicians Branch. Expert consensus on standardized training and assessment criteria for obstetric ultrasound (2022 edition). Chin J Ultrasonography. 2022;31:369–78.

    Google Scholar 

  17. Wu L, Cheng J-Z, Li S, Lei B, Wang T, Ni D. FUIQA: fetal ultrasound image quality assessment with deep convolutional networks. IEEE Trans Cybern. 2017;47:1336–49.

    PubMed  Google Scholar 

  18. He S, Lin Z, Yang X, Chen C, Wang J, Shuang X et al. Statistical Dependency Guided Contrastive Learning for Multiple Labeling in Prenatal Ultrasound. 2022.

  19. Burgos-Artizzu XP, Coronado-Gutiérrez D, Valenzuela-Alcaraz B, Bonet-Carne E, Eixarch E, Crispi F, et al. Evaluation of deep convolutional neural networks for automatic classification of common maternal fetal ultrasound planes. Sci Rep. 2020;10:10200.

    CAS  PubMed  PubMed Central  Google Scholar 

  20. Fu Z, Jiao J, Yasrab R, Drukker L, Papageorghiou AT, Noble JA. Anatomy-Aware contrastive representation learning for fetal ultrasound. Comput Vis ECCV. 2022;2022:422–36.

    PubMed  PubMed Central  Google Scholar 

  21. Migliorelli G, Fiorentino MC, Di Cosmo M, Villani FP, Mancini A, Moccia S. On the use of contrastive learning for standard-plane classification in fetal ultrasound imaging. Comput Biol Med. 2024;174:108430.

    PubMed  Google Scholar 

  22. Zhao H, Zheng Q, Teng C, Yasrab R, Drukker L, Papageorghiou AT, et al. Memory-based unsupervised video clinical quality assessment with multi-modality data in fetal ultrasound. Med Image Anal. 2023;90:102977.

    PubMed  Google Scholar 

  23. Qu R, Xu G, Ding C, Jia W, Sun M. Standard plane identification in fetal brain ultrasound scans using a differential convolutional neural network. IEEE Access. 2020;8:83821–30.

    Google Scholar 

  24. Lin Z, Li S, Ni D, Liao Y, Wen H, Du J, et al. Multi-task learning for quality assessment of fetal head ultrasound images. Med Image Anal. 2019;58:101548.

    PubMed  Google Scholar 

  25. Fiorentino MC. A review on deep-learning algorithms for fetal ultrasound-image analysis. Med Image Anal. 2023;83:102629.

  26. Dong J, Liu S, Liao Y, Wen H, Lei B, Li S, et al. A generic quality control framework for fetal ultrasound cardiac Four-Chamber planes. IEEE J Biomedical Health Inf. 2020;24:931–42.

    Google Scholar 

  27. Liang J, Yang X, Huang Y, Li H, He S, Hu X, et al. Sketch guided and progressive growing GAN for realistic and editable ultrasound image synthesis. Med Image Anal. 2022;79:102461.

    PubMed  Google Scholar 

  28. Wang C-Y, Bochkovskiy A, Liao H-YM. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023:7464–75.

  29. He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016:770–8.

  30. Chen H, Ni D, Qin J, Li S, Yang X, Wang T, et al. Standard plane localization in fetal ultrasound via domain transferred deep neural networks. IEEE J Biomed Health Inf. 2015;19:1627–36.

    Google Scholar 

  31. Zhang B, Liu H, Luo H, Li K. Automatic quality assessment for 2D fetal sonographic standard plane based on multitask learning. Medicine. 2021;100:e24427.

    PubMed  PubMed Central  Google Scholar 

  32. Baumgartner CF, Kamnitsas K, Matthew J, Fletcher TP, Smith S, Koch LM, et al. SonoNet: Real-Time detection and localisation of fetal standard scan planes in freehand ultrasound. IEEE Trans Med Imaging. 2017;36:2204–15.

    PubMed  Google Scholar 

  33. Ramirez Zegarra R, Ghi T. Use of artificial intelligence and deep learning in fetal ultrasound imaging. Ultrasound Obstet Gynecol. 2023;62:185–94.

    CAS  PubMed  Google Scholar 

  34. Zhen C, Wang H, Cheng J, Yang X, Chen C, Hu X, et al. Locating multiple standard planes in First-Trimester ultrasound videos via the detection and scoring of key anatomical structures. Ultrasound Med Biol. 2023;49:2006–16.

    PubMed  Google Scholar 

  35. Yaqub M, Kelly B, Stobart H, Napolitano R, Noble JA, Papageorghiou AT. Quality-improvement program for ultrasound-based fetal anatomy screening using large-scale clinical audit. Ultrasound Obstet Gynecol. 2019;54:239–45.

    CAS  PubMed  PubMed Central  Google Scholar 

  36. Lasala A, Fiorentino MC, Bandini A, Moccia S, FetalBrainAwareNet. Bridging GANs with anatomical insight for fetal ultrasound brain plane synthesis. Comput Med Imaging Graph. 2024;116:102405.

    PubMed  Google Scholar 

Download references

Acknowledgements

We thank Editage for providing professional English language editing.

Funding

This work was supported by the Research Fund of the Shenzhen Health Economics Association (2023149), Futian District Health and Health System Research Fund (FTWS2023013), Science and Technology Planning Project of Guangdong Province (2023A0505020002), Science and Technology Development Fund of Macao (0021/2022/AGJ) and National Natural Science Foundation of China (Nos.62101343, 62171290).

Author information

Authors and Affiliations

Authors

Contributions

XYC, XY: Conceptualization; XYC, YC, SKZ, YLY, HLL, TW: Data collection, Annotations, Quality control; XYC, BHL, YSZ, XY, YC, XDH, CYC: Methods, Results analysis; XYC, YC, BHL, YSZ: Manuscript writing; XYC, XY, TT, XDH, CYC: Funding acquisition, Project administration; XYC, XY, BHL, YSZ, LW, DN: Supervision; XYC, XY, BHL, YSZ, YC, XDH, CYC, SKZ, HLL, TW, YLY, TT, LW, DN: Manuscript revision. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Lin Wang.

Ethics declarations

Ethics approval and consent to participate

The study was approved by the Ethics Committee of Shenzhen Futian District Maternity & Child Healthcare Hospital (protocol number: K-2023-04-01). Informed consent was obtained from all pregnant women. This study was conducted in accordance with the principles outlined in the Declaration of Helsinki.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

12884_2025_7485_MOESM1_ESM.docx

Supplementary Material 1: Detailed Description of Development of the AI-IQA System: This supplementary file provides a detailed methodology of the AI-IQA system, including the workflow of the YOLOv7 object detection network and the ResNet50-based quality regression network, along with data processing details. The document supports the model development and experimental setup for image quality assessment in the study

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cao, X., Li, B., Zhou, Y. et al. Effectiveness and clinical impact of using deep learning for first-trimester fetal ultrasound image quality auditing. BMC Pregnancy Childbirth 25, 375 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12884-025-07485-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12884-025-07485-4

Keywords