Introduction
Emerging data suggest that patients with ulcerative colitis (UC) who exhibit persistent histologic activity are at elevated risk for both short-term and long-term complications, including higher rates of relapse, hospitalization, surgery, and neoplasia, even when apparent endoscopic and clinical remission is achieved.1–3 In light of these findings, the Selecting Therapeutic Targets in Inflammatory Bowel Disease II initiative recommends incorporating histologic remission as an adjunct to endoscopic remission to achieve deeper disease control, consistent with its treat-to-target approach.4 While histologic remission is increasingly being adopted as a secondary treatment target in randomized clinical trials and observational studies, the lack of a universally accepted definition complicates its application in routine practice. This poses significant challenges for clinicians seeking to optimize treatment strategies and improve patient outcomes.5
Several histologic scoring systems have been proposed over the years to address the limitations of descriptive pathology reports, which often lack standardization and comparability. Among these systems, the Robarts Histopathology Index, Nancy Histologic Index (NHI), and Geboes Score have undergone the most extensive validation and are increasingly used in clinical trials.6 However, there remains a paucity of data supporting their use in everyday clinical practice. In a global survey of gastroenterologists and pathologists, 77% of respondents reported that a standardized histologic score was not included in their pathology reports, in contrast to more than 90% who reported using a standardized endoscopic scoring system, such as the Mayo Endoscopic Score (MES), in clinical practice.7
The NHI involves a stepwise evaluation of three components: ulceration, acute inflammation, and chronic inflammation. These parameters are used to assign a five-tier grade: Grade 0 (no or mild chronic inflammation), Grade 1 (moderate to severe chronic inflammation), Grade 2 (rare or few neutrophils in the lamina propria or epithelium that are difficult to detect), Grade 3 (multiple clusters of neutrophils in the lamina propria and/or epithelium that are apparent), and Grade 4 (presence of ulceration).8,9 It is one of two indices currently recommended by the European Crohn’s and Colitis Organisation for use in clinical practice, clinical trials, and observational studies due to its validity and simplicity.10 However, its reproducibility in real-world settings has not been thoroughly evaluated. The aim of this study was to assess the performance of the NHI among gastrointestinal pathologists at tertiary care academic centers in the United States, using colorectal biopsies from a prospective adult cohort of treated UC patients.
Materials and methods
Case selection and histologic assessment
Thirty-seven hematoxylin and eosin (H&E)-stained whole-slide images of colorectal biopsies from 34 patients with UC who had received treatment and were enrolled in a multicenter longitudinal cohort of the Study of a Prospective Adult Research Cohort with Inflammatory Bowel Disease (SPARC-IBD) were included. Cases were selected from this larger cohort to ensure a heterogeneous distribution from different anatomic sites (right colon, left colon, sigmoid colon, and rectum) and macroscopic appearances during colonoscopy. Endoscopic evaluation was based on MES of 0–3.11 Biopsy samples were collected in formalin using the SPARC-IBD protocol. Biopsies were formalin-fixed, paraffin-embedded, and stained with H&E at a central biobank (Sampled: https://sampled.com/ ). All slides were scanned using an Olympus V120 Virtual Slide Microscope at 40× magnification. The .vsi output images were then converted to .jpg files for easier accessibility while preserving image quality.
Twelve pathologists with subspecialty training in gastrointestinal pathology, all practicing at tertiary care academic centers in the United States, reviewed each biopsy twice, with a five-month washout interval between reviews to minimize recall bias. Each pathologist evaluated the biopsies using the NHI, as previously described.8,9 Pathologists were instructed to treat biopsies with erosion similarly to those with ulceration—that is, to assign them an NHI Grade 4 at both reviews. They also assessed two additional parameters: crypt architectural distortion and Paneth cell metaplasia in biopsies from the left colon and rectum. Pathologists were informed of the anatomic location of each biopsy to account for regional differences in the chronic inflammatory gradient in the normal colon, though the decision to use this information was left to the discretion of each reader. All pathologists were otherwise blinded to clinical and endoscopic data. They had access to the original paper describing and validating the NHI for self-reading but did not receive any pre-study group training before the first review.8
The same set of 37 H&E-stained colorectal biopsies was rearranged in a different order and reviewed a second time after five months. Prior to this second round, the pathologists received additional training via an interactive online web tutorial (Supplementary Material 1), curated by authors KDS, JNL, and XL, to address any knowledge gaps. The study pathologists recorded the same histologic parameters as in the first review and remained blinded to the clinical and endoscopic data, except for biopsy site.
Statistical analysis
Statistical analysis was performed using R software, version 4.3.1 for MacOS (R Foundation for Statistical Computing, Vienna, Austria). Inter- and intra-rater agreements and reliability were computed using the irr package. Inter-rater reliability (IRR) for all raters was calculated using the intraclass correlation coefficient (ICC) with a two-way random effects model for absolute agreement. Intra-rater agreement was calculated using unweighted Cohen’s kappa. ICC and kappa values were interpreted using the categories proposed by Landis and Koch.12 Values less than 0.00 were considered as poor reliability/agreement, 0.01 to 0.20 as slight reliability/agreement, 0.21 to 0.40 as fair reliability/agreement, 0.41 to 0.60 as moderate reliability/agreement, 0.61 to 0.80 as substantial reliability/agreement, 0.81 to 0.99 as almost perfect reliability/agreement, and 1.00 as perfect reliability/agreement. The overall concordance rate for each parameter across all cases was calculated and expressed as a percentage. A P-value < 0.05 was considered statistically significant.
Ethics approval
All biopsies were obtained from patients enrolled in the SPARC-IBD multicenter cohort, a component of the IBD Plexus of the Crohn’s and Colitis Foundation. SPARC-IBD data are available upon approved application to the Crohn’s and Colitis Foundation IBD Plexus (https://www.crohnscolitisfoundation.org/ibd-plexus ).
Results
Site of biopsies and endoscopic scores
Of the 37 biopsies, 11 (30%) were from the rectum, 15 (40%) from the sigmoid colon, four (11%) from the descending colon, three (8%) from the ascending colon, and four (11%) from the cecum. Nine biopsies (24%) corresponded to an MES of 0, 19 (51%) to an MES of 1, six (16%) to an MES of 2, and the remaining three (8%) to an MES of 3. Based on majority grading, the distribution of cases at the first reading was as follows: 21 out of 37 cases (56.7%) were classified as Grade 0, one case (3.7%) as Grade 1, five cases (13.5%) as Grade 2, seven cases (18.9%) as Grade 3, and three cases (8.1%) as Grade 4. At the second reading, the distribution was 21 cases (56.7%) as Grade 0, one case (3.7%) as Grade 1, six cases (16.0%) as Grade 2, five cases (13.5%) as Grade 3, and four cases (10.8%) as Grade 4.
IRR of NHI
The IRR for the overall NHI among the 12 pathologists was substantial at both reviews [Review 1: ICC = 0.79, 95% confidence interval (CI): 0.70–0.87; Review 2: ICC = 0.78, 95% CI: 0.69–0.86]. However, there was considerable variability in IRR among the individual NHI grades. Grades 0 and 1 showed substantial IRR at both reviews (P < 0.001). Grades 3 and 4 demonstrated moderate IRR at both reviews (P < 0.001). Grade 2 had only fair IRR at both reviews (P < 0.001) (Table 1). When Grades 2, 3, and 4 were combined, the IRR remained substantial at both reviews (ICC = 0.76, 95% CI: 0.66–0.85; P < 0.001).
Table 1Interrater reliability for parameters assessed in grading previously treated ulcerative colitis activity in the cohort of 37 biopsies using the Nancy histologic index
Item | Interrater reliability, ICC
|
---|
Review 1
| Review 2
|
---|
ICC (95% CI) | ICC (95% CI) |
---|
Overall NHI Grade | 0.79 (0.70–0.87) | 0.78 (0.69–0.86) |
Grade 0 | 0.74 (0.64–0.83) | 0.75 (0.65–0.84) |
Grade 1 | 0.75 (0.66–0.84) | 0.79 (0.71–0.87) |
Grade 2 | 0.24 (0.15–0.37) | 0.23 (0.14–0.36) |
Grade 3 | 0.42 (0.30–0.56) | 0.47 (0.35–0.61) |
Grade 4 | 0.41 (0.29–0.55) | 0.47 (0.35–0.61) |
Grades 2, 3 and 4 combined (active disease) | 0.76 (0.66–0.85) | 0.76 (0.66–0.85) |
Intra-rater agreements for NHI and its components
The mean intra-rater agreement for NHI and its components, assessed using unweighted Cohen’s kappa, is summarized in Table 2. Grade 2 had the lowest intra-rater agreement (fair). Grade 4 showed moderate intra-rater agreement among the participating pathologists. Figure 1 illustrates the concordant and discordant cases for each NHI component across both reviews, highlighting inconsistency, particularly in assigning Grades 2 and 4. H&E-stained colorectal biopsies from representative cases with the least concordance in Grades 2 and 4 are shown in Figure 2a–d.
Table 2Mean intra-rater agreement for parameters assessed in grading previously treated ulcerative colitis activity in the cohort of 37 biopsies using the Nancy index
Feature/item | Intra-rater agreement, Cohen’s kappa (range) | Interpretation |
---|
Overall NHI Grade | 0.57 (0.24–0.80) | Moderate agreement |
Grade 0 | 0.81 (0.52–1) | Substantial agreement |
Grade 1 | 0.74 (0.52–1.0). | Substantial agreement |
Grade 2 | 0.31 (−0.08–0.77) | Fair agreement |
Grade 3 | 0.51 (0.08–0.91) | Moderate agreement |
Grade 4 | 0.59 (0.2–1.0) | Moderate agreement |
Combined Grades 2, 3 and 4 (active disease) | 0.75 (0.37–1.0) | Substantial agreement |
IRR and intra-rater agreements for crypt distortion and Paneth cell metaplasia
The IRR for crypt distortion was substantial at both reviews (Review 1: ICC = 0.64 (95% CI: 0.52–0.75, P < 0.001) and Review 2: ICC = 0.68 (95% CI: 0.57–0.79, P < 0.001)). The IRR for Paneth cell metaplasia was moderate (Review 1: ICC = 0.60 (95% CI: 0.49–0.73, P < 0.001) and Review 2: ICC = 0.51 (95% CI: 0.39–0.65, P < 0.001)). The mean intra-rater agreement for crypt distortion and Paneth cell metaplasia was substantial, with a Cohen’s kappa = 0.70 (range: 0.45–0.95) and 0.62 (range: 0.04–1.0), respectively.
Discussion
We observed substantial IRR for the NHI among 12 practicing pathologists with subspecialty training in gastrointestinal pathology, both before (ICC = 0.79, 95% CI: 0.70–0.87, P < 0.001) and after (ICC = 0.78, 95% CI: 0.69–0.86, P < 0.001) the implementation of a brief online tutorial on the NHI. When analyzing individual NHI components, we found that Grade 1 exhibited the highest IRR at both assessments, while Grade 2 showed the lowest IRR, with minimal improvement between reviews. Grades 3 and 4 had intermediate IRR values. Notably, combining Grades 2, 3, and 4 (i.e., active disease) yielded substantial IRR.
While Marchal-Bressenot et al.8 demonstrated near-perfect IRR for the NHI (ICC = 0.88, 95% CI: 0.82–0.92) and its component items—except for chronic inflammation (ICC = 0.63, 95% CI: 0.33–0.70)—discrepancies in other studies, including ours, highlight some challenges of using this index. Similar to our findings, Jairath et al.13 reported substantial IRR for final NHI grades (ICC = 0.80, 95% CI: 0.73–0.85) among four pathologists with expertise in inflammatory bowel disease. Le et al.14 also reported substantial IRR for final NHI grades (ICC = 0.70, 95% CI: 0.50–0.82) between two pathologists, including a pathologist-in-training, but noted higher discordance for Grades 1 and 4. Arkteg et al.15 reported substantial IRR for the presence of acute inflammation (ICC = 0.79, 95% CI: 0.64–0.88). However, their study did not provide IRR values for Grades 2 and 3 individually. Like our findings, they reported lower IRR for Grade 4 (ICC = −0.04, 95% CI: −0.74 to 0.41), although it remains unclear whether biopsies with erosions were included in this group. Notably, they also observed low IRR for chronic inflammation (ICC = 0.42, 95% CI: 0.02–0.67). Discrepancies across studies may be attributed to factors such as case selection, use of glass slides versus digital images, practice settings, and the level of experience among participating pathologists. These variations underscore the challenges of assessing certain NHI components.
Subjectivity in distinguishing Grades 2 and 3—specifically, the need to identify a few or rare neutrophils (often difficult to visualize) versus multiple, easily visible clusters—may have contributed to the lower IRR. Neutrophils are not normally present in the intestinal mucosa, and the threshold of neutrophilic inflammation that increases the risk of adverse outcomes has yet to be clearly defined. The Geboes Score considers both lamina propria and epithelial neutrophils, providing a quantitative evaluation of the latter. Similarly, the Robarts Histologic Index, which is largely derived from the Geboes Score, assesses neutrophils in both compartments. These indices, unlike the NHI, also distinguish between erosions and ulcers. In this study, erosions were categorized as NHI Grade 4. Encouragingly, the substantial IRR for active disease (Grades 2, 3, and 4) in our study underscores the NHI’s clinical utility. However, refining the criteria for these grades will be essential for reducing inter-observer variability and enabling more accurate monitoring of treatment endpoints. This may include developing more precise definitions for the amount of acute inflammation that qualifies as Grade 2 or 3, and clarifying the classification of erosions—an issue currently unaddressed by the NHI. It may also be important to specify whether neutrophils exclusively located in the lamina propria should be considered Grade 2. Notably, the two additional features evaluated in our study—crypt architectural distortion and Paneth cell metaplasia—had moderate IRR in both reviews. Although these parameters are not part of the NHI, they are routinely used in clinical practice as markers of chronic mucosal injury and are included in other scoring systems, such as the Geboes Score.
More recently, artificial intelligence (AI)-powered algorithms have been applied to UC datasets to assist in the histologic grading of biopsies.16–19 Najdawi et al.16 used convolutional neural networks to segment tissue and classify cells on whole-slide H&E-stained biopsies to generate NHI predictions. Their AI model showed strong correlation with increasing NHI scores (ρ = 0.90, P < 0.001) and reliably distinguished between different grades based on the proportion of epithelium with neutrophilic inflammation, the count and density of neutrophils in the epithelium, and the presence of ulcers or combinations thereof (ρ = 0.83–0.90, all P < 0.001). Peyrin-Biroulet et al.17 employed four artificial neural networks to recognize cell types and assign NHI grades. They found that the AI-based grading was reproducible and comparable in performance (ICC = 87.2%) to that of four expert histopathologists (ICC = 89.3%). The PICaSSO Histologic Remission Index, a recently introduced simplified scoring system, focuses on the presence or absence of neutrophils in the epithelium (surface and crypt) and lamina propria. This index has shown stronger correlation with endoscopic activity compared to other histologic indices, including the NHI, and exhibits minimal inter-rater variability.18 It has also been validated using an AI model, which accurately and reliably predicted PICaSSO Histologic Remission Index.
While our study provides valuable insight into the reproducibility of histologic assessments of colorectal biopsies from treated UC patients using the NHI, several limitations should be acknowledged. The small sample size and uneven distribution of biopsy sites may have led to an underestimation of the ICC.19 Additionally, this study focused on reproducibility among academic gastrointestinal pathologists, so results may not be fully generalizable to real-world practice settings where levels of expertise may vary. Notably, only two of the reviewing pathologists had prior experience with a modified version of the NHI. Additionally, variations in staining quality and image artifacts may have contributed to interpretation differences.
Conclusions
Our study revealed substantial IRR for active disease (Grades 2, 3, and 4) among 12 pathologists, which underscores the clinical utility of the NHI in the assessment of colorectal biopsies from treated UC patients. However, refinement of the criteria for Grades 2, 3, and 4 may be required to improve reproducibility and enable more accurate monitoring of treatment outcomes in UC, especially as histologic remission is an evolving therapeutic endpoint.
Declarations
Acknowledgement
The results published here are, in whole or in part, based on data from the inflammatory bowel disease (IBD) Plexus program of the Crohn’s and Colitis Foundation. The Study of a Prospective Adult Research Cohort with Inflammatory Bowel Disease (SPARC IBD) is a component of the Crohn’s & Colitis Foundation’s IBD Plexus data exchange platform. SPARC IBD enrolls patients with a new or established diagnosis of IBD from sites across the United States and links data collected from electronic health records and study-specific case report forms. Patients also provide blood, stool, and biopsy samples at designated time points during follow-up. The design and implementation of the SPARC IBD cohort have been previously described.
SPARC-IBD investigators
Richa Shukla: Baylor College of Medicine; Themistocles Dassopoulos: Baylor University Medical Center; Scott B. Snapper, Joshua R. Korzenik: Brigham & Women’s Hospital; Matthew Bohm: Indiana University; Laura Raffals: Mayo Clinic; Poonam Beniwal-Patel: Medical College of Wisconsin; David Hudesman: NYU Langone Medical Center; Mazer Ally, Gauree Konijeti, Rebecca Matro: Scripps Healthcare; Sheldon Lidofsky: Brown University; Kirk Russ: University of Alabama; Loren Brook: University of Cincinnati Medical Center; Joel Pekow: University of Chicago; Raymond Cross: University of Maryland; Shrinivas Bishu: University of Michigan; Meenakshi Bewtra, James D Lewis: University of Pennsylvania; Richard Duerr: University of Pittsburgh; Sumona Saha, Freddy Caldera: University of Wisconsin; Elizabeth Scoville: Vanderbilt University Medical Center; Parakkal Deepak: Washington University School of Medicine.
Ethical statement
This study was approved by the Ethics Committee of Washington University (Approval No. 202206060) and was conducted in accordance with the Declaration of Helsinki (as revised in 2024). As the data were obtained from an existing database, the requirement for informed consent was waived.
Data sharing statement
The dataset used in support of the findings of this study are included within the article.
Funding
This study was completed without financial support.
Conflict of interest
Dr. Parakkal Deepak is supported by a Junior Faculty Development Award from the American College of Gastroenterology and the inflammatory bowel disease (IBD) Plexus program of the Crohn’s & Colitis Foundation. Three of the authors—Dr. Hanlin L. Wang, Dr. Zhaohai Yang, and Dr. Xiuli Liu—are Editorial Board Members, and one author, Dr. Xuchen Zhang, is the Associate Editor of the Journal of Clinical and Translational Pathology since May 2021. The authors declare no other conflicts of interest.
Authors’ contributions
Study conception, design, statistical data interpretation (XL), data curation (SM, PD, KDS, JNL), statistical analysis and original draft preparation (KDS, JNL), and histopathologic review (DA, SJB, KB, AGD, AKE, RSG, XG, HL, JML, NS, HLW, ZY, XZ). All authors have reviewed and approved the final version of the manuscript.