Publications

01

DocRevive: An End-to-End Framework for Document Text Restoration

Under Preparation

In Document Analysis and Recognition, the challenge of reconstructing damaged, occluded, or incomplete text remains a critical yet unexplored problem. Subsequent document understanding tasks can benefit from a document reconstruction process

In Document Analysis and Recognition, the challenge of reconstructing damaged, occluded, or incomplete text remains a critical yet unexplored problem. Subsequent document understanding tasks can benefit from a document reconstruction process. This paper presents a novel end-to-end pipeline designed to address this challenge, combining state-of-the-art Optical Character Recognition (OCR), advanced image analysis, and the contextual power of large language models (LLMs) and diffusion-based models to restore and reconstruct text while preserving visual integrity. We create a synthetic dataset simulating diverse document degradation scenarios, establishing a benchmark for restoration tasks. Our pipeline detects and recognizes text, identifies degradation, and uses LLMs for semantically coherent reconstruction. A diffusion-based module reintegrates text seamlessly, matching font, size, and alignment. To evaluate restoration quality, we propose a Unified Context Sim- ilarity Metric (UCSM), incorporating edit, semantic, and length similarities with a contextual predictability measure that penalizes deviations when the correct text is contextually obvious. Our work advances document restoration, benefiting archival research and digital preservation while setting a new standard for text reconstruction.

02

DATR: Domain Agnostic Text Recognizer

Accepted | ICPR-2024

Recognizing text extracted from multiple domains is complex and challenging because complexities vary from one domain to another. Most existing methods focus either on natural scene text or specific text type

Recognizing text extracted from multiple domains is complex and challenging because complexities vary from one domain to another. Most existing methods focus either on natural scene text or specific text type but not text of multiple domains, namely, scene, underwater, and drone texts. In addition, the state-of-the-art models ignore the vital cues that exist in multiple instances of the text. This paper presents a new method called the Student-Teacher-Assistant (STA) network, which involves dual CLIP models to exploit cues in multiple text instances. The model that uses ResNet50 in its image encoder is called helper CLIP, while the model that uses ViT in its image encoder is called primary CLIP. The proposed work processes both models simultaneously to extract visual and textual features through image and text encoders. Our work uses cosine similarity for the randomly chosen input image to detect instances similar to the input image. The input and similar instances are supplied to primary and helper CLIPs for visual and textual feature extraction. The outputs of dual CLIPs are fused in a different way through the alignment step for recognizing text accurately, irrespective of domains. To demonstrate the proposed model’s significance, experiments are conducted on a set of standard natural scene text datasets (regular and irregular), underwater images, and drone images. The results on three different domains show that the proposed model outperforms the state-of-the-art recognition models. The datasets and code for public use in training and testing shall be made available on GitHub.

Code

Paper

03

DITS: A New Domain Independent Text Spotter

Accepted | ICPR-2024

Text spotting in diverse domains, such as drone-captured images, underwater scenes, and natural scene images, presents unique challenges due to variations in image quality, contrast, text appearance

Text spotting in diverse domains, such as drone-captured images, underwater scenes, and natural scene images, presents unique challenges due to variations in image quality, contrast, text appearance, background complexity, and external factors like water surface reflections and weather conditions. While most existing approaches focus on text spotting in natural scene images, we propose a Domain-Independent Text Spotter (DITS) that effectively handles multiple domains. We innovatively combine the Real-ESRGAN, developed for regular image enhancement, with the DeepSolo, developed for scene text spotting, in an end-to-end fashion for text detection and spotting on images of different domains. The key idea behind our approach is that improving image quality and text-spotting accuracy are complementary goals. Real-ESRGAN enhances image quality, making the text more discernible, while DeepSolo, a state-of-the-art text spotting model, accurately localizes and recognizes text in the enhanced images. We validate the superiority of our proposed model by evaluating it on datasets from drone, underwater, and scene domains (ICDAR 2015, CTW1500, and Total-Text). Furthermore, we demonstrate the domain independence of our model through cross-domain validation, where we train on one domain and test on others. Our dataset and code will be publicly available on GitHub.

Code

Paper

04

Graphically Residual Attentive Network For Tackling Aerial Image Occlusion

Accepted | Journal of Computers and Electrical Engineering

Deep learning has rapidly advanced, and many new applications are being developed for tasks like object detection, text recognition, occlusion handling, etc. However, chal

Deep learning has rapidly advanced, and many new applications are being developed for tasks like object detection, text recognition, occlusion handling, etc. However, chal- lenges still exist in the detection of objects in complex environments such as aerial images where things like motion blur, low light, and significant occlusion occur. This paper addresses similar challenges by introducing a novel framework, the Graphically Residual Attentive Network (GRESIDAN). GRESIDAN integrates three synergistic pipelines for object detection, occlusion detection, and occlusion removal. GRESI- DAN uses a residually attentive block combining ResNet-18 and a multi-headed atten- tion mechanism to improve feature extraction and detection accuracy in low-quality, occluded aerial images. A graphically attentive occlusion detection pipeline is imple- mented to handle occlusion, segment better, and mask out the occluder in the aerial im- age. The pipelines of the GRESIDAN model are validated on the COCO-2017 dataset and a custom private aerial object detection dataset, outperforming the state-of-the- art methods in handling occlusion and detecting objects. Our contributions provide a robust solution to the problem of detecting and handling occluded objects in aerial imagery, pushing the boundaries of automated visual recognition in challenging real- world scenarios.

Code

Paper

05

Compound attention embedded dual channel encoder-decoder for ms lesion segmentation from brain MRI

Accepted | Multimedia Tools and Applications

Multiple Sclerosis (MS) lesions’ segmentation is difficult due to their variegated sizes, shapes, and intensity levels. Besides this, the class imbalance problem and the availability of limited

Multiple Sclerosis (MS) lesions’ segmentation is difficult due to their variegated sizes, shapes, and intensity levels. Besides this, the class imbalance problem and the availability of limited annotated data samples obstruct the building of highly efficient deep learning-based models. Though researchers have made many attempts to design efficient deep learning-based models, the maximum Dice Coefficient achieved by their models is fairly below the acceptable level of 0.70. The possible reason may be due to the inability to capture sufficient local and global features of the lesions required for accurate segmentation. In this paper, we present a new deep-learning architecture based on compound attention for MS lesion segmentation from magnetic resonance images that handles the challenges of capturing the local and global variable features of the MS lesions. The proposed model is equipped with a dualchannel CNN encoder-decoder structure employing residual connections in one channel and residual channel and spatial attention in the other. The residual connections alleviate the vanishing gradient problem and pass the fine-grained information through the channels, which is crucial for pixel-wise prediction. The attention mechanism used in a channel helps to capture long-range dependencies. Thus, the complete model leverages rich global and local information through the two channels for lesion segmentation. The problem of data imbalance is handled by using the Focal Tversky loss function. Through rigorous evaluation using 3-fold cross-validation on the MICCAI 2016 challenge dataset, our model demonstrates superior performance, achieving a Dice Coefficient of 0.73, surpassing state-of-the-art models in both qualitative and quantitative assessments.

Paper

06

PIRNet: Two-Step Deep Neural Network for Segmentation of Brain MRI with Efficient Loss Functions

Accepted | ASCIS-2024

Segmentation of brain tissues from MRI scans is a critical first step in diagnosing brain-related issues. Traditionally, this task was performed manually by radiologists, which was time-consuming and prone to errors. Recently,

Segmentation of brain tissues from MRI scans is a critical first step in diagnosing brain-related issues. Traditionally, this task was performed manually by radiologists, which was time-consuming and prone to errors. Recently, machine learning techniques, especially deep learning, have been explored to automate this process with varying degrees of success. In this research, we propose the Pyramidal Inception Residual Network (PIRNet), a state-of-the-art model that combines Pyramid Pooling, inception blocks, and residual connections for improved brain MRI segmentation. Our model segments key tissues including cerebrospinal fluid (CSF), gray matter (GM), white matter (WM), and background (BG). We evaluate PIRNet on well-known public datasets: MRBrains13, iSeg17, and CANDI, and test it using various loss functions, such as Tversky loss, label-wise dice loss, categorical cross-entropy, and weighted categorical cross-entropy, to optimize segmentation accuracy. The results show that PIRNet outperforms contemporary techniques with average dice scores of 94.55%, 91.28%, and 91.91% on the MRBrains13, iSeg17, and CANDI datasets, respectively. We also highlight the advantages of different loss functions, which enhance the model’s robustness and architecture. Our method is versatile and can be applied to other segmentation tasks, such as breast cancer detection and liver lesion segmentation. PIRNet successfully tackles the complex challenges of volumetric brain tissue segmentation, offering substantial value for rapid diagnosis in clinical settings and advancing quantitative computer modeling in medical analysis.

Code

Paper

07

Swin-ResNeST: CNN-Transformer Hybrid for Skin Lesion Segmentation

Accepted | ICTDsC-2024

Skin lesion segmentation is a well-studied topic that faces obstacles such as differences in lesion shape, size, color, skin tones, and image noise. This study describes a U-shaped architecture

Skin lesion segmentation is a well-studied topic that faces obstacles such as differences in lesion shape, size, color, skin tones, and image noise. This study describes a U-shaped architecture designed for skin lesion segmentation. It uses ResNest blocks, Swin Transformer blocks, and Unified Attention methods integrated into skip connections. This CNN-Transformer hybrid model is effective in capturing both the local and global features. Squeeze attention and convolution are used in parallel in the bottleneck to balance the extracted local and global features, while the Unified Attention between the encoder and decoder blocks improves critical feature learning. The model’s effectiveness is proven through training and testing on three publicly available datasets: ISIC 2016, ISIC 2017, and ISIC 2018. Comparative analysis with state-ofthe-art models indicates the suggested model’s remarkable performance, emphasizing its ability to delineate skin lesions precisely.

Code

08

Automated Multi-Class Brain Glioma Segmentation Using Cascaded mLinkNet

Accepted | ICTDsC-2024

Glioma is one of the most common and vigorous tumors of the brain, which may lead to a very short life expectancy. There has been considerable interest in the potential of the non-invasive Magnetic Resonance Imaging

Glioma is one of the most common and vigorous tumors of the brain, which may lead to a very short life expectancy. There has been considerable interest in the potential of the non-invasive Magnetic Resonance Imaging (MRI) to aid the clinician in diagnosis, determination of the tumor extent, treatment planning, and disease management. Automated brain glioma segmentation from MR modalities remains a challenging, time-consuming, computationally intensive, and error-prone task till date. This paper introduces an automatic technique for the diagnosis of brain tumor regions from multi-class brain MRI using a modified version of LinkNet (mLinkNet) based on Convolutional Neural Networks (CNNs), where three types of deep CNN models were trained: W-mLinkNet, C-mLinkNet, and E-mLinkNet. The multiclass segmentation problem has been decomposed into three steps of binary segmentation of the glioma regions, i.e., Whole tumor, Tumor core, Enhancing tumor core, by cascading the above models one after the other in sequence. We also investigated the use of zero-centering and normalization of intensity for smooth variation of the intensity over the tissues as a pre-processing step. A relative study has been done to prove the efficacy of the proposed CNN model on the Brain Tumor Segmentation Challenge 2015 database (BraTS 2015).