1. Introduction
Colorectal cancer (CRC) is one of the top three leading causes of cancer death in the United States [
1]. Adenomatous polyps carry a significantly higher probability of malignant transformation at a quicker rate than hyperplastic polyps [
2]. Early detection of these colorectal poslyps through colonoscopies can significantly curtail the risk of progression to cancer [
1]. Computer-aided diagnosis (CAD) can help physicians accurately detect polyps, decide treatment plans, and predict patient prognosis. Hence, early enteroscopies enable the excision of nascent polyps to prevent advancement towards cancer and subsequent degeneration [
3].
Traditional intestinal endoscopy encounters limitations due to factors impacting the field of view, structure, and image resolution. Detecting polyps in low-resolution images undoubtedly poses a greater challenge for physicians. Furthermore, the process of detecting colorectal polyps necessitates a high degree of physician expertise and is thwarted by physician subjectivity [
4], resulting in an estimated manual oversight rate of colorectal polyps at approximately one-quarter [
5].
Deep learning can be applied in many fields. In the field of medical imaging, the tasks of object detection, recognition, and classification of medical images have become easier and play an important role in practical medical detection [
6,
7,
8]. Texture, color, and shape features of images are commonly employed in traditional detection of colorectal polyps, such as scale-invariant feature transform (SIFT) [
9], support vector machines (SVMs) [
10], and Gaussian mixture models [
11] for classification. However, these traditional methods heavily rely on the given data features and limit their ability to capture the complex variations in polyp images. This leads to poor performance in real-world scenarios. Therefore, these methods are prone to overfitting and exhibit poor generalization.
In the field of medical image processing and cancer detection, super-resolution (SR) reconstruction plays a distinctive role [
12,
13]. SR reconstructs low-resolution images into high-resolution images, helping to improve image resolution and enhance image details. Pavlou et al. [
14] proposed using SRGAN to enhance the resolution of OCT images, distinguishing between BCC lesions and scar tissue in cryoimmunotherapy. Shi et al. [
15] proposed utilizing an improved SRGAN to reconstruct phase contrast polarimetry (PCP) images with enhanced resolution, and they incorporated a counting network for added functionality. These studies demonstrate the potential of SR in enhancing diagnostic quality in medical image processing.
Currently, researchers are increasingly employing artificial intelligence (AI) methods to assist in colonoscopy for detecting polyps. Zhu et al. [
16] improved polyp diagnosis accuracy by introducing the PAM-Net and GWD loss functions. Ghose et al. [
17] proposed a method that uses data augmentation and fine-tunes parameters to improve polyp detection performance. Yasmin et al. [
18] proposed the GastroNet to detect and classify gastrointestinal polyps and abnormalities and achieved high accuracy through hyperparameter tuning. These approaches employ detection algorithms for polyp identification. However, several challenges still exist in using these methods to assist in detecting polyps.
Low-resolution images leads to insufficient accuracy in polyp detection.
During colorectal polyp detection, the intestinal environment is complex. Some polyps are small, making them more difficult to identify.
To solve the problems of low-resolution image in polyp detection, this paper proposes a polyp detection method based on super-resolution reconstruction. Super-Resolution Generative Adversarial Network (SRGAN) [
19] is used to increase the resolution of colonoscopy images and decrease manual missed detection due to low-resolution images. The improved You Only Look Once (YOLO) then uses the reconstructed images as input to help physicians identify polyps.
The main contributions of this study are listed below:
(1) We propose YOLO-SRPD (Super-Resolution Reconstruction for Polyp Detection) to address the problems of low-resolution polyp images during the detection by combining super-resolution reconstruction and YOLOv5.
(2) To enhance partial texture and details of polyps, we introduce attention-based mixed convolution modules (ACmix) in the generator and discriminator of SRGAN.
(3) We propose the improved YOLOv5 algorithm by incorporating the Res2net-based C3 module in the backbone. The Res2net-based C3 module can enlarge the convolutional receptive fields and enhance multiscale feature extraction within the backbone network.
(4) This paper also incorporates the CBAM attention mechanism into head layers of YOLO. CBAM can focus on polyp information and enhance overall detection capability.
This paper proposes an algorithm for polyp detection that involves the use of SRGAN to reconstruct low-resolution images, followed by detection algorithm.
Section 2 reviews the pertinent research on deep learning in medical diagnosis, with a focus on polyp detection. In
Section 3, we describe the algorithms and framework employed in this study.
Section 4 presents experimental results and algorithm comparisons. Finally, a summary of the paper is provided in
Section 5.
2. Related Work
Deep learning has made substantial advancements in cancer detection, particularly in the areas of lesion detection and segmentation. Many studies have shown that employing deep learning models can significantly enhance the accuracy of cancer detection [
20,
21,
22], enabling effective differentiation between normal and cancerous tissue. Tan et al. [
23] proposed a small target breast mass detection network, introducing an adaptive positive sample selection algorithm to automatically select positive samples. This method significantly improved the detection accuracy of small masses in breast mass detection. However, there may be issues with missed detections in the edge regions during the breast cancer detection process. Deep learning models heavily rely on a large and accurate dataset for training, and insufficient data can potentially compromise their generalization capabilities.
The majority of traditional colorectal polyp detection relies on physicians’ medical expertise. Therefore, problems like misdetection and missing detection may arise during the detection process due to inexperience and manual mistakes of physicians. Accurate and quick diagnosis of colorectal polyps using AI technology has become possible in the field of medical detection [
24,
25,
26].
With the application of computer vision in the medical field, convolutional neural networks (CNNs) play a crucial role in segmentation [
27,
28,
29,
30] and detection of colorectal polyps. Ozawa et al. [
31] demonstrated the viability of CNN as a polyp detection support system by proposing a polyp classification architecture based on a single-shot multibox detector (SSD) on a private dataset. In 2020, Kayser et al. [
32] employed the RetinaNet network for intestinal polyp diagnosis to lessen the impact of image artifacts. In 2020, Kayser et al. [
32] employed the RetinaNet network for polyp detection in datasets such as EAD2019 [
33], CVC-Clinic [
34], ETIS-Larib [
35], and Kvasir-SEG [
36], aiming to mitigate image artifacts and achieved a precision of 53.7% and a recall of 72.6%. Zeng et al. [
37] proposed the RetinaNet model, which employed CNNs to capture structural patterns in human colon optical coherence tomography (OCT) images.
One of the critical challenges in gastric polyp detection is the wide range in size and shape of gastric polyps. In order to solve this issue, Laddha et al. [
38] developed a feature fusion module based on deep learning to identify target problems on the CLV-14SL [
39] dataset, achieving a precision of 93%, a recall of 91%, and a mean average precision (mAP) of 91%. Zhang et al. [
40] proposed a ResYOLO model which was trained on nonmedical data and fine-tuned using colonoscopy images. Tang et al. [
41] used GAN to generate polyp images for YOLO training. The accuracy was improved by using Gaussian blur to simulate blurred images and deblur the images. Carrinho et al. [
42] utilized the YOLOv4 and achieved real-time detection through optimization and quantization with NVIDIA TensorRT. However, this optimization may sacrifice the generalization ability on different types of images. Tang et al. [
43] proposed narrow-band imaging (NBI) technology on a private dataset to enhance polyps’ contrast and vascular patterns, and this method positively impacts polyp identification and classification tasks. Chou et al. [
44] employed discrete wavelet transform (DWT) and GAN2 (presumably referring to StyleGAN2) to enhance the discriminative characteristics of polyps. Chen et al. [
45] proposed an accelerated R-CNN architecture that leverages self-attention mechanisms for polyp detection. They achieved a precision of 94.3%, a recall of 92.5%, and F1-score of 93.4% on a private dataset.
The deep-learning-based intestinal polyp detection framework offers practical applications in the detection of intestinal polyps and the reduction of missed detection rates. The YOLOv5 framework is the primary basic framework discussed and employed in this study. YOLOv5 satisfies the demands for high accuracy and frame rate in colonoscopy detection scenarios.
3. Materials and Methods
In this paper, we propose a model for intestinal polyp detection that combines a fused SR and an improved YOLOv5 algorithm. The pertinent structure is depicted in
Figure 1. The process starts with the low-resolution image (
) being reconstructed into a super-resolution image (
) by using SRGAN.
are obtained by applying a Gaussian filter to high-resolution images (
) and then downsampling. Subsequently, the reconstructed images are input into YOLOv5 to detect colon polyps.
3.1. Super-Resolution Reconstruction Using SRGAN
Compared to traditional image processing algorithms, SRGAN can generate high-quality images with enhanced details and textures. Furthermore, it achieves visually more realistic results by leveraging deep learning techniques. The SRGAN algorithm is employed to generate high-resolution images from low-resolution inputs. The approach comprises a generator network and a discriminator network. The generator network incorporates multiple residual block structures, followed by sub-pixel convolution layers.
The discriminator network module consists of seven convolutional layers with the LeakyReLU activation function. To differentiate between
and
, two fully connected layers and a sigmoid activation function are added after the convolutional layer. The perceptual loss function of SRGAN [
19] is shown in Equation (
1). The perceptual loss is composed of two parts: content loss (
) and adversarial loss (
).
This study employs the VGG19 network for content loss, where the content loss is represented in Equation (
2).
where
W and
H are the width and height of the feature map,
i and
j denote the
j-th convolutional layer before the i-th max pooling layer,
denotes the obtained feature map, and
represents the pixel value (
) in the feature map extracted from
.
The adversarial loss function is expressed in Equation (
3).
where
denotes the discriminator that judges the image generated by the generator
as a natural image.
Although SRGAN could improve the image quality and increase pixel count, wrong detections may still occur. It is vitally important to preserve texture and detail in medical image reconstruction. This paper adds the self-attention convolution module ACmix to both the generator and discriminator networks of SRGAN. This addition aims to enhance colorectal polyp detection and facilitate the training of super-resolution images that closely resemble real images.
The ACmix module effectively combines the advantages of traditional convolution and the self-attention mechanism to improve the network’s focus on details. The structure of this module is illustrated in
Figure 2.
Figure 3 presents the ACmix architecture, which operates in two stages. First, the input
feature map is broken into
N features by three
convolutions, obtaining rich intermediate features. The feature information generated in stage I is sent into the convolution and self-attention branches in stage II. In the convolution branch, the features first travel through a convolution kernel of size
K, after which the feature data are divided into
subset feature maps through the dense layer, and, ultimately, a new feature map is generated. The features in the self-attention branch are separated into
N groups, and three
convolutions are employed for self-attention multiplication. The two learnable parameters
and
are then used to add the features of the two branches to the channel. To obtain the final feature map, as illustrated in Equation (
4), the feature output of the two branches is finally sent through the ACmix module.
where
represents the final output features,
represents the self-attention branch, and
denotes the convolutional branch, while
and
are parameters that can be learned.
3.2. Improved YOLOv5 Polyp Detection Algorithm
Aiming to enhance the efficiency of the conventional YOLO detection algorithm, this paper proposes improvements to YOLOv5, specifically, a C3 fusion feature extraction module based on Res2net [
46] within the backbone network. Additionally, we incorporate an attention mechanism, CBAM [
47], into the detection layer. As a result, an intestinal polyp detection framework based on YOLOv5s is proposed. The precise configuration is exhibited in
Figure 4.
3.2.1. C3 Module Fused with Res2Net
The conventional C3 module has limited feature extraction capabilities, as it only employs three convolution layers. The main innovation of Res2Net lies in employing hierarchical cascaded feature group convolutions, which facilitate the enlargement of receptive fields. The finer-grained multibranch structure is used to achieve more effective feature extraction. We propose a new module, the C3_Res2Net module, which combines the C3 module and Res2Net. This module improves the accuracy of the YOLOv5, better extracts features of different scales, and broadens the receptive field. Consequently, it allows for more comprehensive capture of intestinal image feature information. The final C3 convolution module in the backbone network is replaced with the C3_Res2Net module in this article.
The primary architecture of C3_Res2Net is depicted in
Figure 5. Initially, the feature map undergoes a
convolution, partitioning the features into
s subsets, with the parameter
s set to 4 in this study. Except for
and
, the other subgroups can accept features from the left branch and perform element-wise addition. Then, a corresponding
convolution is applied, denoted by
, thereby expanding the receptive field of the feature convolution. Subsequently, the final output
of the split features is recombined and passed through a
convolution to produce the resulting feature. Consequently, within the fused C3_Res2Net module, features are extracted at a finer granularity, allowing for more effective handling of global and local features, thereby improving recognition accuracy. The corresponding mathematical expression can be represented by Equation (
5):
3.2.2. CBAM Attention Mechanism Module
The attention mechanism is commonly used in machine learning. The accuracy of the network can be affected by the presence of both small polyps and blocked intestinal polyps. To solve this problem, we introduce the CBAM before the prediction head. By incorporating the CBAM, we aim to enhance the accuracy of polyp detection and minimize the impact of the complex intestinal environment. The CBAM is composed of the spatial attention module (SAM) and the channel attention module (CAM). It simultaneously generates channel and spatial attention feature map information before performing adaptive calibration on the input feature map.
Figure 6 depicts its primary structure.
The CBAM attention mechanism initiates by directing input feature values to CAM, where weighted calculations occur. Subsequently, these processed values are directed to SAM, where weighted calculations are once again performed. The specific calculations are defined by Equations (
6) and (
7).
where
F represents input features,
denotes the one-dimensional channel attention module,
denotes the two-dimensional spatial attention module,
denotes the output eigenvalue, and ⊗ represents element-wise multiplication.
3.3. Evaluation Indexes
Evaluation indexes in DL are crucial tools for assessing algorithm performance. This study primarily focuses on , , and , which facilitate the assessment of algorithmic effectiveness.
3.3.1. SR Evaluation Index
Peak signal-to-noise ratio (PSNR) and structure similarity index measure (SSIM) serve as standard metrics for assessing image reconstruction quality. PSNR measures the fidelity of image reconstruction, while SSIM quantifies the similarity between the reconstructed
and
. The mathematical expressions for PSNR and SSIM are provided in Equations (
8) and (
9).
where
represents the mean square error between the two images and
is the maximum value that can be calculated from the image pixels. The PSNR value is generally within the range of 20 to 50 dB, with higher PSNR values indicating better image quality.
where
represents the gray mean,
represents the variance, and
and
are constants that keep the equation valid. The SSIM value ranges from −1 to 1. In practical applications, it is typically between 0 and 1. SSIM can measure the similarity between two polyp images. The greater their structural similarity, the higher the
value.
3.3.2. Indices for Object Detection Evaluation
In YOLO detection, evaluation metrics such as
,
,
(mean average precision), and F1-score are commonly employed. The specific formulas for these metrics [
48] are provided in Equations (
10)–(
13).
where
(true positive) represents the number of correct predictions,
(false positive) denotes the number of incorrect positive predictions,
(false negative) represents the number of positive instances that the model failed to predict correctly.
represents the proportion of correct positive predictions, and
represents the proportion of all correct predictions. The F1-score ranges from 0% to 100%. The F1-score represents the weighted average of
and
. The F1-score of 100% represents the best possible classification performance. The higher the F1-score, the better the model performance.
where
represents the average
values,
N is the variety of classes contained,
p is the value of
, and
r is the value of
.
5. Conclusions
Some polyps may progress to colorectal cancer tumors. Early colonoscopy can detect and remove some polyps. However, the low resolution of colonoscopy images and the small size of some polyps pose a diagnostic challenge. By using SRGAN for super-resolution reconstruction of images, the pixel size was increased by a factor of 4. This process led to an improvement in the clarity and texture of polyp images, thereby enhancing their visibility.
To address the challenge of misdiagnosis in colorectal polyps attributed to low-resolution images during colonoscopy, this study presents a model integrating an enhanced SRGAN for image super-resolution and an improved YOLOv5s model for polyp detection. Firstly, the study addresses the problem of insufficient resolution in colorectal polyp images by performing super-resolution reconstruction. An improved SRGAN model is employed, incorporating mixed self-attention mechanisms and convolutional modules (ACmix) in both the generator and discriminator of SRGAN. This enhancement bolsters subsequent convolutional neural networks in effectively extracting features. Secondly, the YOLOv5s model is improved by integrating the Res2Net module into the C3 module, resulting in the proposed C3_Res2net fusion module. This modification increases the receptive fields of convolutional kernels, thereby enhancing the detection rate of polyps of varying sizes. Additionally, a CBAM attention mechanism is incorporated to augment the model’s focus on colorectal polyps. The experimental results indicate that the model proposed in this paper exhibits high accuracy in detecting colorectal polyps. With an mAP of 94.2% and a precision of 95.2%, the model effectively localizes polyps in the colon. Consequently, employing this proposed model facilitates efficient detection of polyp locations. In the future, we will further investigate lightweight models for polyp detection to achieve even faster screening capabilities.
Compared to other detection models, the method proposed in this study demonstrates higher accuracy, making it a practical tool for assisting medical professionals in colorectal polyp detection and reducing the rate of missed diagnoses. However, it is crucial to acknowledge that this heightened accuracy comes with an associated increase in computational complexity. Future research will explore lightweight target detection models to address these computational challenges. In recent years, more effective super-resolution reconstruction algorithms have been proposed. Therefore, our next research direction will focus on the reconstruction and detection of colorectal polyp images using the latest algorithms. In addition, we also aim to combine the advantages of super-resolution algorithms with the YOLO series detection algorithms, making the integration of these into a hybrid framework for comprehensive detection a future research direction.