Introduction
Colonoscopy is an effective tool for colorectal cancer prevention, enabling the detection and removal of precancerous polyps. Early detection and removal of colorectal polyps can significantly reduce colorectal cancer incidence. However, colonoscopy is a complex procedure, and even experienced endoscopists may miss a substantial fraction of polyps; a recent meta-analysis reported miss rates of approximately 26% for adenomatous polyps.1 To assist clinicians, computer-aided detection (CADe) systems using artificial intelligence (AI) have been developed to help identify polyps.2 While AI-driven systems have shown promising results in adenoma detection, their clinical utility is often limited by a significant challenge: poor generalization across different types of lesions. Most current CADe models are trained on large annotated datasets and tend to perform well only on lesion types similar to their training data. Their performance degrades substantially when encountering unseen lesion types, such as those with different shapes, sizes, or textures.3,4 Parallel advances likewise show that colorectal AI is expanding from real-time colonoscopic recognition and digital pathology quantification to machine-learning-based blood or laboratory screening and AI-supported malignant polyp characterization and surveillance,5 -10 collectively underscoring the need for robust systems that can transfer across heterogeneous clinical settings.
In this study, we investigate the issue of cross-dataset generalization, specifically a zero-shot lesion detection scenario across different types of colonic lesions. We aimed to train a generalizable model on one category of colorectal lesion and then enable it to detect other lesion types without additional training. For example, a model trained only on images of adenomatous polyps should also be able to identify a different lesion type, such as an early cancerous lesion, during testing. This “zero-shot” setting requires the model to recognize patterns from a class it has never explicitly seen during training, a capability that is essential for real-world clinical deployment, where lesion presentations are highly variable.11,12
Recent advances in vision-language models offer a potential solution.13-15 From a translational gastroenterology perspective, recent reviews have further emphasized that advanced endoscopic imaging, optical biopsy, and lesion-tailored resection strategies are central to improving colorectal cancer detection and management.16,17 A notable study introduced Anomaly-Aware CLIP (AA-CLIP), built upon Contrastive Language-Image Pretraining (CLIP) adapting the powerful CLIP model for industrial anomaly detection.13,18 AA-CLIP enhances CLIP’s ability to distinguish between normal and abnormal samples by learning anomaly-focused text “anchors” and aligning patch-level visual features to these anchors using transformer-based adapters. Building on the AA-CLIP architecture, we present a zero-shot lesion detection framework designed to overcome the cross-lesion detection problem. Our approach introduces a set of modifications to enhance its applicability to medical imaging.19-24
To improve the model’s discrimination between pathologic and normal tissue, we develop a contrastive learning strategy using pseudo-normal image patches extracted from lesion-containing images. To enhance semantic understanding, we design attribute-enriched text prompts that describe the visual appearance of lesions using generalizable textual attributes. Finally, to further improve lesion localization, we refine the visual encoder by integrating residual adapters at multiple layers and applying generalized mean (GeM) pooling for superior multi-scale feature aggregation.19-21,24-27
We validate our model under a cross-lesion testing protocol: the model is trained on labeled images from a single dataset and subsequently evaluated on multiple unseen colonoscopy datasets containing a wide range of polyp pathologies, without using any target-domain data for model tuning. This experimental design simulates real-world deployment, where a model encounters data from unseen clinical scenarios, and directly tests the framework’s generalization performance. This study aimed to develop and validate a generalizable AI framework for polyp detection across unseen colonoscopy datasets and to evaluate its performance under a cross-lesion testing protocol that simulates real-world clinical scenarios.
Materials and methods
Dataset and preprocessing
We evaluated the zero-shot lesion detection framework using four publicly available colonoscopy image datasets: CVC-ClinicDB, CVC-ColonDB, Kvasir-SEG, and CVC-300. CVC-ClinicDB contains 612 colonoscopy frames extracted from clinical videos, each annotated with expert-provided segmentation masks. CVC-ColonDB comprises publicly available colonoscopy images with corresponding pixel-level polyp annotations. Kvasir-SEG provides a larger public dataset of polyp images with pixel-level masks. The CVC-300 test subset used in this study contains 60 colonoscopy images with ground-truth polyp segmentation masks.
All images were resized to 518 × 518 pixels to match the input resolution of the Vision Transformer. We normalized the images using the standard mean and standard deviation from the original CLIP preprocessing. We adopted a rigorous leave-one-dataset-out evaluation protocol. For each cross-dataset experiment, images from one dataset were used for training (the source domain), and the trained model was then tested on the remaining three datasets (the target domains) in a zero-shot setting (with no fine-tuning on the target data). This protocol evaluates the ability of the model to generalize to unseen polyp pathologies and imaging conditions.
Model architecture
Our approach extends the AA-CLIP framework, which adapts the CLIP model for anomaly detection. CLIP consists of an image encoder, specifically a ViT-L/14 Vision Transformer, and a transformer-based text encoder to map images and text descriptions into a shared embedding space.13 Although CLIP has shown strong zero-shot performance in general vision tasks, its standard formulation is anomaly-unaware, as it cannot inherently distinguish normal from abnormal features in fine-grained medical images. AA-CLIP is designed to address this gap by introducing lightweight transformer adapter modules and an anomaly-focused training strategy.18-22,24 These adapters refine the model features for anomaly detection without disrupting the foundational knowledge acquired during CLIP pretraining.
The training process is structured in two sequential stages designed to align visual features with anomaly-specific textual concepts, as illustrated in Figure 1. In Stage 1, the focus is on refining the textual representations. The pre-trained CLIP text encoder is fine-tuned with lightweight adapters to produce distinct textual anchors for normal and abnormal semantics. This process produces a pair of text embeddings representing contrasting concepts: one for a normal description (e.g., healthy colon tissue) and one for an abnormal description (e.g., a colorectal lesion).
In Stage 2, the visual encoder is adapted to align with these optimized text anchors. Visual adapters are utilized to fine-tune the image encoder such that image patch features from anomalous regions align more closely with the abnormal text anchor than with the normal one. The model learns the correspondence between local, patch-level visual features and the text embeddings in the shared latent space. For inference, the model generates a pixel-wise similarity map by comparing extracted visual features with both normal and abnormal text anchors, thereby enabling localization of anomalous regions by identifying patches that exhibit higher similarity to the abnormal descriptions.
Medical attribute-enriched text prompts
The development of discriminative text prompts is crucial for effective zero-shot detection in vision-language models. Whereas typical approaches use simple prompts (e.g., “a photo of a diseased colon” versus “a photo of a normal colon”), we introduce medical attribute-enriched prompts that incorporate domain-specific descriptors of colon pathologies. We design a comprehensive set of text prompts detailing salient visual characteristics of lesions, moving beyond generic disease labels. For example, an abnormal prompt might be “colon mucosa with ulceration and irregular polypoid lesion” rather than “diseased colon.” These medical-specific descriptions explicitly include visual attributes such as ulceration, bleeding, erosion, or polypoid growth.
This attribute-enriched approach provides a more nuanced semantic grounding for abnormality, leading to more descriptive and discriminative text anchors. Furthermore, because these attributes are generalizable across diverse lesion morphologies, this strategy improves cross-dataset generalization by preventing the model from overfitting to the dataset-specific visual features of a single lesion type. The abnormal text anchor becomes an attribute-aware representation of a lesion, which enhances its effectiveness when detecting novel lesion appearances in unseen test datasets.20,21,24 The full set of prompts is provided in Supplementary Table 1.
Visual adapter refinement and feature aggregation
We propose a lightweight visual adapter strategy that modulates the CLIP ViT-L/14 image encoder with per-layer gating and a GeM pooling layer, designed to enhance cross-dataset generalization without introducing computationally expensive attention blocks.
First, we introduce adaptive per-layer gating for the visual adapters. While the original AA-CLIP design inserts residual adapters primarily in the early layers of the image encoder, we extend this approach by learning a scalar gating parameter ωil for each adapter at layer l. This parameter scales the adapter’s output before it is integrated into the main network via a residual connection:
Fout(l)=Fin(l)+ωi(l)∙Adapter(Fin(l))
These gating weights are learned during training, allowing the model to adaptively modulate the contribution of each adapter layer, optimizing the balance between pretrained knowledge and task-specific feature refinement.
Second, to derive a global image representation that complements the patch-level anomaly map, we apply GeM pooling over the vision encoder’s output feature map. GeM pooling is a differentiable operation that generalizes average and max pooling by taking the p-th power mean of feature activations, with p being a learnable parameter. This enables the model to dynamically prioritize regions with high activation while still aggregating information from the entire image, resulting in a robust global embedding summarizing the image’s abnormal content. We use this pooled embedding for image-level anomaly classification in conjunction with the patch-level segmentation output. By refining patch features at multiple network layers and then pooling them, the model better captures both the fine-grained details and the global context necessary for accurate lesion localization.19,28
Contrastive learning with pseudo-normal patches
In the original AA-CLIP framework, image-level anomaly detection is guided by a binary cross-entropy classification objective. We hypothesize that a patch-level training loss better exploits local lesion cues and improves localization granularity. To this end, we replace the image-level loss with a pseudo-normal patch contrastive (PNC) loss. For a given training image containing a lesion and its corresponding mask, we sample representative background patches from the non-lesion regions of the same image. These patches serve as “pseudo-normal” prototypes (samples of normal tissue from an otherwise abnormal image). We then enforce a contrastive objective that maximizes the distance between the embeddings of these background patches and the abnormal lesion features in the embedding space. This patch-level contrastive approach yields more robust lesion localization, as the model learns to distinguish lesion content from normal tissue context within each image, effectively learning a localized normality model from abnormal samples.25-27
Endoscopy-specific data augmentation
To address the critical challenge of domain shift in medical imaging, we develop a comprehensive data augmentation strategy designed to simulate the diverse visual conditions encountered in clinical colonoscopy. Our pipeline incorporates a set of transformations that mimic common endoscopic artifacts, thereby enhancing the model’s robustness to these variations. These transformations include: (1) Photometric Perturbations: To account for variations in endoscope hardware and lighting, we apply random shifts in white balance and color temperature, as well as gamma correction to model exposure drift. (2) Optical and Motion Artifacts: We simulate specular highlights, a common artifact in mucosal imaging, by adding bright Gaussian blobs to the image. Motion blur is introduced using small directional kernels to reflect artifacts from rapid endoscope or tissue movement. Furthermore, optical effects such as vignetting and mild fisheye distortion are included to model the characteristics of wide-angle endoscope lenses. (3) Compression Artifacts: To mimic the effects of video compression used in clinical archiving systems, we simulate compression-like artifacts by performing a down-sampling and subsequent up-sampling operation. This strategy embeds domain knowledge of the endoscopic procedure into the training process, enhancing the model’s robustness to real-world clinical variations. The complete augmentation pipeline with all probabilities and magnitude ranges is provided in Supplementary Table 2.
Results
Training and implementation details
We implemented the proposed method in PyTorch and used a frozen ViT-L/14-336 CLIP backbone with 518 × 518 inputs on a single NVIDIA A5000 GPU.13 Only the lightweight text and visual adapters were trainable. In Stage 1, we trained the text adapter for 20 epochs with Adam (lr = 1 × 10-5, batch size = 16). The training objective combined segmentation loss (Dice + Focal) with an orthogonality regularizer that separated normal and anomalous embeddings. In Stage 2, we trained the image adapter for 40 epochs with Adam (lr = 5 × 10-4, batch size = 2) and a multi-step scheduler (milestones at 0.5T and 0.75T). The Stage 2 training objective combined fused segmentation loss, an auxiliary per-level segmentation loss (weight = 0.25), and a PNC loss (weight = 1.0). Complete training hyperparameters and tuning policies are summarized in Supplementary Table 3.
PNC learning: The binary mask was downsampled to the ViT patch grid (37 × 37 tokens for patch size 14). A patch was labeled as lesion if its foreground occupancy was ≥ 0.25; pseudo-normal patches were drawn from a 3 × 3 dilated non-lesion ring with occupancy ≤ 0.05. Up to 64 patches of each type were sampled per image. A temperature-scaled (τ = 0.07) patch-to-anchor classification loss was used with a prototype-separation margin of 0.10.
Text prompts: Anchors were built from 5 normal and 5 abnormal prompts combined with 4 templates (20 + 20 text prompts). For colorectal data, the abnormal anchor was enriched with 12 lesion-attribute terms instantiated through 3 templates (36 sentences), fused with the base abnormal anchor at α = 0.5. The full prompts are listed in Supplementary Table 1.
Endoscopy-specific augmentation: The pipeline applied horizontal/vertical flips, white-balance jitter (p = 0.5), color-temperature shift (p = 0.5, ±0.08), gamma correction (p = 0.5, 0.8–1.2), synthetic specular highlights (p = 0.3; 1–3 Gaussian blobs, radius 3%–8%, intensity 0.10–0.35), motion blur (p = 0.2; kernel size 3–7, angle range 0°–180°), vignetting (p = 0.25), fisheye distortion (p = 0.2), and compression simulation (p = 0.3). GeM pooling was initialized at p = 3.0 and learned during training; visual gate scalars used sigmoid parameterization initialized at 0.1.
Overall findings
We evaluated the proposed framework under a rigorous leave-one-dataset-out zero-shot protocol. In each run, the model was trained on a single source dataset and evaluated directly, without any target-domain fine-tuning, on the remaining three unseen datasets. We benchmarked our approach against a standard U-Net baseline using pixel-level area under the ROC curve (AUROC) and average precision (AP) as the primary metrics.
As summarized in Table 1, the proposed CLIP-based framework outperformed the U-Net baseline in all 24 AUROC/AP comparisons (12 transfer scenarios × 2 metrics). Averaged over all transfer settings, our method achieved a mean pixel AUROC of 94.34% and a mean pixel AP of 81.14%, compared with 87.22% and 62.21% for U-Net, respectively. The average gains were therefore +7.12 AUROC points and +18.93 AP points. A pooled paired analysis across all leave-one-dataset-out transfer settings further showed mean per-image improvements of +7.41 AUROC points, +19.47 AP points, and +0.231 Dice, with all overall Holm-corrected p-values smaller than 10−6.
| Protocol | Pixel AUROC ↑ | Pixel AP (PR) ↑ |
|---|
| Train set | Zero-shot test set | Ours | U-Net | P for AUROC | Ours | U-Net | P for AP |
|---|
| CVC-ColonDB | CVC-ClinicDB | 96.38 | 87.70 | <10-6 | 85.31 | 65.32 | <10-6 |
| Kvasir-SEG | 94.87 | 84.88 | <10-6 | 85.06 | 59.50 | <10-6 |
| CVC-300 | 99.96 | 99.11 | 1.34×10-2 | 99.02 | 95.55 | 3.57×10-1 |
| CVC-ClinicDB | CVC-ColonDB | 88.03 | 81.77 | <10-6 | 69.50 | 52.55 | <10-6 |
| Kvasir-SEG | 96.64 | 89.31 | <10-6 | 89.76 | 69.70 | <10-6 |
| CVC-300 | 99.75 | 95.93 | <10-6 | 92.38 | 71.62 | 4.11×10-6 |
| Kvasir-SEG | CVC-ColonDB | 93.20 | 86.34 | <10-6 | 72.20 | 54.38 | <10-6 |
| CVC-ClinicDB | 96.58 | 94.83 | <10-6 | 90.17 | 82.12 | <10-6 |
| CVC-300 | 99.49 | 96.64 | 2.22×10-4 | 90.30 | 77.03 | 1.05×10-3 |
| CVC-300 | CVC-ColonDB | 85.24 | 75.40 | <10-6 | 55.97 | 34.31 | <10-6 |
| CVC-ClinicDB | 92.36 | 76.44 | <10-6 | 72.67 | 41.86 | <10-6 |
| Kvasir-SEG | 89.39 | 78.25 | <10-6 | 71.72 | 42.56 | <10-6 |
The statistical results further suggest that these improvements are not driven by isolated cases. After Holm correction, the AUROC advantage remained significant in all 12 transfer settings, while the AP advantage remained significant in 11 of 12 settings. The only non-significant AP comparison was CVC-ColonDB → CVC-300 (P = 3.57 × 10-1), where both methods were already close to ceiling performance. Because several benchmarks are frame-based video datasets, these P-values should be interpreted as frame-level paired significance rather than patient-level or video-level inference.
Zero-shot cross-dataset performance
We next analyze the transfer behavior by source dataset. Table 1 shows that the strongest average performance was obtained when training on CVC-ColonDB, where our method achieved a mean AUROC of 97.11% and a mean AP of 89.67% across the three unseen targets, compared with 90.56% and 73.46% for U-Net. In this setting, the gains were especially large on CVC-ClinicDB (85.00% vs. 65.32% AP) and Kvasir-SEG (85.13% vs. 59.50% AP). On CVC-300, both methods were close to saturation (98.89% vs. 95.55% AP), which explains why the AP difference in this single transfer was not statistically significant after correction (P = 3.57 × 10-1), despite a still significant AUROC gap (P = 1.34 × 10-2).
When trained on the larger CVC-ClinicDB and Kvasir-SEG source domains, our model also showed stable cross-dataset transfer. For CVC-ClinicDB → Kvasir-SEG, the AP improved from 69.70% to 89.76% (+20.06 points). For Kvasir-SEG as the source, the model maintained AUROCs of 93.20% to 99.49% and APs of 72.20% to 90.30% across all three unseen targets, outperforming U-Net in every case. Importantly, all six AUROC/AP comparisons involving these two source datasets remained significant after Holm correction, with the largest corrected P-values still small (2.22 × 10-4 for AUROC and 1.05 × 10-3 for AP).
The most challenging setting remained training on CVC-300. Even in this case, our method achieved a mean AUROC of 89.00% and a mean AP of 66.79%, compared with 76.70% and 39.58% for U-Net, corresponding to improvements of +12.30 AUROC points and +27.21 AP points. The largest AP gains were observed on CVC-ClinicDB (+30.81 points) and Kvasir-SEG (+29.16 points). All corresponding corrected P-values were extremely small (e.g., CVC-300 → CVC-ClinicDB: AUROC, P < 10-6; AP, P <10-6), supporting that the advantage of the proposed anomaly-aware vision-language model is statistically robust rather than anecdotal. At the same time, because CVC-ClinicDB and CVC-300 are frame-based benchmarks extracted from video sequences, these significance results should be interpreted as frame-level evidence of consistent improvement rather than as direct patient-level or video-level inference.
Comparison with other vision-language backbones
We compare against two additional vision-language backbones, ALBEF and BLIP,29,30 under the same zero-shot protocol (CVC-ColonDB → CVC-ClinicDB, Kvasir-SEG, CVC-300) with identical text prompts. Two settings are used: Setting A (basic VL) trains only a plain segmentation head on the frozen backbone, excluding all proposed modules; Setting B (backbone replacement) swaps CLIP for ALBEF/BLIP while retaining our full pipeline. Both use the native 224 × 224 resolution of the public ALBEF/BLIP checkpoints.
As shown in Table 2, our segmentation modules consistently outperform ALBEF/BLIP in their basic-VL experiments (BLIP mean pixel AP rises from 44.07% to 60.36%). Regarding the experimental results, we have the following two observations: (i) Simply replacing the backbone with another generic vision-language encoder does not improve model performance, confirming that the anomaly-aware prompt design, contrastive patch learning, and multi-scale fusion modules are important for performance improvement. (ii) CLIP’s contrastive pre-training on large-scale image-text pairs provides a stronger pretrained representation for dense anomaly localization than the masked-language modeling used by ALBEF and BLIP.
| Method | CVC-ClinicDB | Kvasir-SEG | CVC-300 | Average |
|---|
| ALBEF-basic VL | 79.27/32.81/0.467 | 80.79/52.53/0.530 | 97.71/62.23/0.396 | 85.92/49.19/0.464 |
| BLIP-basic VL | 81.34/41.29/0.421 | 76.37/46.76/0.480 | 93.23/44.16/0.259 | 83.65/44.07/0.387 |
| ALBEF+Ours | 81.50/42.08/0.452 | 76.32/49.63/0.466 | 98.51/70.85/0.405 | 85.44/54.19/0.441 |
| BLIP+Ours | 86.46/53.42/0.445 | 80.41/53.37/0.516 | 98.19/74.30/0.339 | 88.35/60.36/0.433 |
| Ours (CLIP) | 96.38/85.31/0.787 | 94.87/85.06/0.808 | 99.96/99.02/0.911 | 97.07/89.80/0.835 |
Benchmarking against polyp segmentation and medical baselines
We further compared our method with two strong supervised polyp segmentation models, PraNet and TransUNet,2,31 and two medical vision-language methods, MADCLIP and MedCLIP.12,32 All methods were trained on CVC-ColonDB and evaluated in a zero-shot setting on CVC-ClinicDB, Kvasir-SEG, and CVC-300 under the same protocol. Since the target splits are positive-only, we report pixel AUROC, pixel AP, and Dice.
Table 3 shows that our method achieves the best mean AUROC and AP among all compared methods. PraNet, the strongest supervised baseline, yields a marginally higher mean Dice (0.8388 vs. 0.8353) but lower mean AUROC (-1.21) and AP (-1.30). Both PraNet and TransUNet remain competitive on the relatively easy CVC-300 transfer yet show a clear gap in the more challenging CVC-ClinicDB and Kvasir-SEG settings, where our method obtains the highest AUROC and AP. MADCLIP and MedCLIP rank below all other methods by a wider margin (MedCLIP mean AP: 59.21). Since they are originally designed for image-level classification rather than dense pixel-level segmentation, their feature embeddings lack the spatial representation needed for precise lesion localization.
| Method | Mean AUROC | Mean AP | Mean Dice |
|---|
| Ours | 97.07 | 89.80 | 0.8353 |
| PraNet | 95.86 | 88.50 | 0.8388 |
| TransUNet | 94.95 | 87.57 | 0.8369 |
| MADCLIP | 94.75 | 84.69 | 0.7452 |
| MedCLIP | 90.73 | 59.21 | 0.5151 |
Lesion-size detection performance
To evaluate robustness across lesion scales, we evaluated zero-shot results (CVC-ColonDB → {CVC-ClinicDB, Kvasir-SEG, CVC-300}, epoch 40) by lesion-area ratio r=|Ωmask |/|Ωimage |, defining three bins: small (r < 3%), medium (3% ≤ r < 10%), and large (r ≥ 10%). Since all target sets are positive-only, we report pixel-level AUROC/AP, Dice, and lesion-level recall. A lesion is counted as detected if the binarized anomaly map covers ≥10% of its ground-truth pixels.
Table 4 shows that even for the smallest lesions (r < 3%), recall remains 97.52% on CVC-ClinicDB, 97.40% on Kvasir-SEG, and 100% on CVC-300, with miss rates of at most 2.60%. Segmentation quality is lower for small lesions (Dice 0.59–0.64, AP 62–66 on the two larger datasets), but improves substantially for medium lesions (Dice > 0.83). This suggests that the main failure mode for tiny polyps is inaccurate boundary localization rather than missed detection. Across all three bins and datasets, lesion-level recall stays above 97%, confirming that the framework generalizes well across lesion scales under a strict zero-shot protocol.
| Test set | Size bin | Ratio range | N | Mean ratio (%) | Pixel AUROC | Pixel AP | Dice | Recall | Miss rate |
|---|
| CVC-ClinicDB | Bin-1 | [0.00%, 3.00%) | 121 | 1.94 | 97.51 | 66.41 | 0.6360 | 97.52 | 2.48 |
| CVC-ClinicDB | Bin-2 | [3.00%, 10.00%) | 283 | 5.97 | 98.47 | 89.58 | 0.8384 | 98.59 | 1.41 |
| CVC-ClinicDB | Bin-3 | [10.00%, 100.00%] | 208 | 17.71 | 94.95 | 87.44 | 0.7942 | 99.52 | 0.48 |
| Kvasir-SEG | Bin-1 | [0.00%, 3.00%) | 77 | 2.10 | 97.73 | 62.12 | 0.5853 | 97.40 | 2.60 |
| Kvasir-SEG | Bin-2 | [3.00%, 10.00%) | 355 | 6.22 | 98.78 | 90.15 | 0.8300 | 99.72 | 0.28 |
| Kvasir-SEG | Bin-3 | [10.00%, 100.00%] | 560 | 22.98 | 94.21 | 87.49 | 0.8178 | 99.64 | 0.36 |
| CVC-300 | Bin-1 | [0.00%, 3.00%) | 35 | 2.09 | 99.96 | 98.34 | 0.8828 | 100.00 | 0.00 |
| CVC-300 | Bin-2 | [3.00%, 10.00%) | 24 | 4.60 | 99.96 | 99.30 | 0.9501 | 100.00 | 0.00 |
| CVC-300 | Bin-3 | [10.00%, 100.00%] | 1 | 18.45 | 99.95 | 99.77 | 0.9686 | 100.00 | 0.00 |
Qualitative localization performance
Figure 2 shows representative anomaly heatmaps on three unseen target datasets. The selected examples cover clinically distinct polyp types: small sessile polyps (<5 mm, smooth dome-shaped surface), stalked polyps with irregular heads, and flat elevated lesions (Paris 0-IIa, with subtle color change). Across all three types, the model produces strong activation in the lesion region while suppressing false responses in the surrounding normal tissue. Flat elevated lesions, which are among the most commonly missed findings during colonoscopy, also receive high anomaly scores.
Scoring and thresholding strategy: For threshold-free metrics (pixel AUROC and AP), the raw anomaly maps are normalized across the test set. For Dice computation, each image is independently normalized to [0, 1] via per-image min-max scaling and converted to a binary mask using a fixed threshold of 0.5. This threshold remains constant across all leave-one-dataset-out runs and is not adjusted for any target dataset. In a clinical deployment, threshold calibration on a held-out validation set would be required to balance sensitivity and specificity for the intended operating point.
Engineering efficiency and deployment characteristics
To assess deployment feasibility, we evaluated inference cost for the epoch-40 model in the CVC-ColonDB → CVC-ClinicDB zero-shot setting on a single NVIDIA A5000 GPU at 518 × 518 resolution. Latency was measured with batch size 1 over 30 timed iterations (after 10 warm-up runs), excluding data-loading overhead. End-to-end throughput was additionally measured on the full CVC-ClinicDB test set with batch size 8.
As Table 5 shows, only 2.85% of parameters (12.58M) are introduced by the proposed adapters, fusion module, and GeM pooling; the frozen ViT-L/14 backbone dominates computational cost. The deployed checkpoint is approximately 1 GB with 2 GB peak GPU memory usage. Inference latency is approximately 113 ms per image (~8.9 img/s). At the current stage, the system serves as a research prototype and does not yet meet the 25–30 fps requirement for real-time clinical use. Since the frozen backbone accounts for over 97% of the total computation, reducing inference time should focus on backbone-level optimization (e.g., lighter vision-language encoders, mixed-precision inference, and temporal feature reuse across video frames) rather than removing the lightweight task-specific modules.
| Metric | Value |
|---|
| Total/added parameters | 441.34 M/12.58 M (2.85%) |
| Deployed checkpoint size | ≈1007.84 MB |
| Peak CUDA memory | 2009.42 MB |
| FLOPs/MACs per image | 1041.73 G/520.86 G |
| Mean latency (p50/p90) | 112.90 ms (112.91/113.25) |
| Single-image throughput | 8.86 img/s |
| End-to-end throughput (CVC-ClinicDB) | 8.94 img/s |
Ablation study
We ablated three core components: (i) removing the PNC loss, (ii) replacing attribute-enhanced prompts with generic prompts, and (iii) disabling the enhanced visual branch by reverting to fixed gates, mean fusion, and average pooling. All variants were trained on CVC-ColonDB and evaluated in a zero-shot setting on CVC-ClinicDB, Kvasir-SEG, and CVC-300 using the same protocol and hyperparameters as the full model.
Table 6 shows that the full model achieves the highest mean Dice (0.8329). Removing attribute-enhanced prompts leads to a small but consistent drop in AP (89.87 → 89.52) and Dice (0.8329 → 0.8322), confirming that domain-specific textual descriptions provide complementary semantic guidance beyond generic prompts. Removing the PNC loss slightly increases AP on Kvasir-SEG but lowers CVC-ClinicDB AUROC (96.28 → 94.81) and CVC-300 Dice (0.9112 → 0.8947), resulting in lower mean AUROC and Dice overall. The contrastive objective thus primarily improves robustness under cross-dataset shift. Removing the enhanced visual branch produces the lowest mean Dice (0.8268) and the largest localization drop on CVC-300 (Dice 0.9112 → 0.8942), while AUROC changes minimally, indicating that adaptive fusion and GeM pooling mainly benefit spatial localization rather than coarse anomaly ranking. Each component addresses a different aspect of cross-dataset generalization, and the full combination yields the best balanced performance.
| Variant | Pixel AUROC ↑ | Pixel AP ↑ | Dice ↑ |
|---|
| Full model | 97.07 | 89.80 | 0.8353 |
| w/o PNC loss | 96.72 | 90.05 | 0.8313 |
| w/o attribute prompts | 97.07 | 89.52 | 0.8322 |
| w/o visual enhancement | 97.12 | 89.81 | 0.8268 |
Discussion
Future directions
This work establishes a strong foundation for the development of generalizable AI in colonoscopy. A logical next step is to extend this validation to real-time video sequences. Evaluating the performance of the model on full-length colonoscopy videos will be crucial for assessing its temporal consistency, processing speed, and practical utility in a clinical workflow. More broadly, the intelligent oncology literature is increasingly integrating computational pathology and radiomics-based response modeling into precision cancer workflows, including recent work on digital and intelligent pathology and neoadjuvant-response prediction in rectal cancer.33 ,34 Furthermore, the vision-language framework of the model opens promising avenues for more advanced diagnostic tasks. Future research will explore adapting the model to perform zero-shot classification of detected lesions into clinically relevant subtypes, such as adenomatous versus hyperplastic lesions. Such a capability would provide endoscopists with real-time decision support, enabling the adoption of “resect and discard” protocols for low-risk lesions, which could improve procedural efficiency and reduce pathology costs.11 ,12 ,14 ,23
Limitations
Several limitations should be noted. First, all four benchmark datasets contain only colorectal polyps. Publicly available endoscopy segmentation benchmarks with pixel-level masks for non-polyp colorectal pathologies—such as ulcerative colitis, colorectal tumors, or inflammatory erosions—do not currently exist. Datasets like HyperKvasir provide classification labels for ulcerative colitis (Mayo grades) and colorectal cancer but do not include segmentation masks for these categories. The scope of the present evaluation is therefore limited to polyp detection and localization; whether the anomaly-aware design generalizes to other mucosal abnormalities remains an open question that will require new annotated data, ideally produced through multicenter clinical collaborations. Second, all four datasets originate from European academic centers, introducing potential population and hardware bias. Third, the evaluation uses curated still frames rather than full-procedure video, which does not capture motion artifacts, transient views, or the predominance of polyp-free frames in real colonoscopies. Fourth, the current inference speed (≈8.9 fps) falls short of the 25–30 fps needed for real-time overlay; the bottleneck is the frozen ViT-L/14 backbone. Fifth, there is no same-dataset blinded reader comparison with human endoscopists. Finally, the anomaly-detection formulation treats all abnormal tissue as a single class and does not distinguish adenomas from hyperplastic polyps. Taken together, these factors mean that the results demonstrate zero-shot technical generalization across polyp datasets but do not yet constitute clinical validation.
Clinical workflow integration and regulatory considerations
The intended translational role of the proposed method is adjunctive CADe rather than autonomous diagnosis. In practice, the model would process the live white-light video stream frame by frame and overlay a heatmap on suspicious regions, while the endoscopist retains full responsibility for interpretation, biopsy, and surveillance decisions. This is consistent with existing CADe systems designed to support rather than replace clinical judgment.5 ,9 Clinical deployment would further require threshold calibration and temporal smoothing to suppress false positives across consecutive frames, as well as prospective reader studies to evaluate whether the model reduces miss rates among endoscopists at varying experience levels.
From a regulatory standpoint, the framework falls under Software as a Medical Device (SaMD). International SaMD guidance requires evidence of valid clinical association, analytical validation, and clinical validation in the intended-use population. For a colonoscopy CADe tool, this typically involves multicenter testing under diverse imaging conditions, human factors assessment, and post-deployment monitoring for performance drift. The present work is a retrospective still-image study without a same-dataset reader comparison against human experts; the results should therefore be interpreted as evidence of technical zero-shot generalization rather than as demonstrating superiority over endoscopists.