Self-Teaching AI Uses Pathology Images to Find Similar Cases, Diagnose Rare Diseases

Overview of the SISH pipeline… a, After tissue segmentation, we tile the foreground regions and perform two-stage K-means clustering to select representative patches to include in the WSI mosaic. We first cluster all patches based on their RGB histogram features. In each cluster generated from the first stage (for example, the yellow cluster shown in the figure), we perform K-means clustering again using the spatial coordinates of each patch as features (spatial clustering), extract the patches that correspond to the coordinates of each resulting cluster centre (black dots) and add them to the mosaic of the slide. b, We pretrain a VQ-VAE on tissue patches from slides in the TCGA and save its encoder and codebook for feature extraction. For each patch in the mosaic, the VQ-VAE encoder is used to compute its discrete latent representation and a Densenet121 encoder is used to obtain a binarized texture feature vector. Finally, we feed the discrete latent representation into another pipeline composed of a series of average pooling (AvgPool), shift and summation operations to get an integer index for the patch, then use the vEB tree to construct the index structure for search. c, For a given query slide preprocessed as a mosaic representation, we feed the mosaic into the feature extractor to compute the integer indices and binarized texture features of each patch in the mosaic, then apply our search and ranking algorithm to filter the candidate patches.

Rare diseases are often difficult to diagnose and predicting the best course of treatment can be challenging for clinicians. Investigators from the Mahmood Lab at Brigham and Women’s Hospital, a founding member of the Mass General Brigham healthcare system, have developed a deep learning algorithm that can teach itself to learn features which can then be used to find similar cases in large pathology image repositories. Known as SISH (Self-Supervised Image search for Histology), the new tool acts like a search engine for pathology images and has many potential applications, including identifying rare diseases and helping clinicians determine which patients are likely to respond to similar therapies. A paper introducing the self-teaching algorithm is published in Nature Biomedical Engineering.

“We show that our system can assist with the diagnosis of rare diseases and find cases with similar morphologic patterns without the need for manual annotations, and large datasets for supervised training,” said senior author Faisal Mahmood, PhD, in the Brigham’s Department of Pathology. “This system has the potential to improve pathology training, disease subtyping, tumor identification, and rare morphology identification.”

Modern electronic databases can store an immense amount of digital records and reference images, particularly in pathology through whole slide images (WSIs). However, the gigapixel size of each individual WSI and the ever-increasing number of images in large repositories, means that search and retrieval of WSIs can be slow and complicated. As a result, scalability remains a pertinent roadblock for efficient use.

To solve this issue, researchers at the Brigham developed SISH, which teaches itself to learn feature representations which can be used to find cases with analogous features in pathology at a constant speed regardless of the size of the database.

In their study, the researchers tested the speed and ability of SISH to retrieve interpretable disease subtype information for common and rare cancers. The algorithm successfully retrieved images with speed and accuracy from a database of tens of thousands of whole slide images from over 22,000 patient cases, with over 50 different disease types and over a dozen anatomical sites. The speed of retrieval outperformed other methods in many scenarios, including disease subtype retrieval, particularly as the image database size scaled into the thousands of images. Even while the repositories expanded in size, SISH was still able to maintain a constant search speed.

The algorithm, however, has some limitations including a large memory requirement, limited context awareness within large tissue slides and the fact that it is limited to a single imaging modality.

Overall, the algorithm demonstrated the ability to efficiently retrieve images independent of repository size and in diverse datasets. It also demonstrated proficiency in diagnosis of rare disease types and the ability to serve as a search engine to recognize certain regions of images that may be relevant for diagnosis. This work may greatly inform future disease diagnosis, prognosis, and analysis.

“As the sizes of image databases continue to grow, we hope that SISH will be useful in making identification of diseases easier,” said Mahmood. “We believe one important future direction in this area is multimodal case retrieval which involves jointly using pathology, radiology, genomic and electronic medical record data to find similar patient cases.”