A General Image Super-Resolution Reconstruction Technique for Walnut Object Detection Model

Wu, Mingjie; Yang, Xuanxi; Yun, Lijun; Yang, Chenggui; Chen, Zaiqing; Xia, Yuelong

doi:10.3390/agriculture14081279

Open AccessArticle

A General Image Super-Resolution Reconstruction Technique for Walnut Object Detection Model

by

Mingjie Wu

^1,2

,

Xuanxi Yang

³,

Lijun Yun

^1,2,*,

Chenggui Yang

^1,2,

Zaiqing Chen

^1,2 and

Yuelong Xia

^1,2

¹

School of Information, Yunnan Normal University, Kunming 650500, China

²

Engineering Research Center of Computer Vision and Intelligent Control Technology, Department of Education of Yunnan Province, Kunming 650500, China

³

Centre for Planning and Policy Research, Yunnan Institute of Forest Inventory and Planning, Kunming 650500, China

^*

Author to whom correspondence should be addressed.

Agriculture 2024, 14(8), 1279; https://doi.org/10.3390/agriculture14081279

Submission received: 19 July 2024 / Revised: 31 July 2024 / Accepted: 1 August 2024 / Published: 2 August 2024

(This article belongs to the Section Digital Agriculture)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Object detection models are commonly used in yield estimation processes in intelligent walnut production. The accuracy of these models in capturing walnut features largely depends on the quality of the input images. Without changing the existing image acquisition devices, this study proposes a super-resolution reconstruction module for drone-acquired walnut images, named Walnut-SR, to enhance the detailed features of walnut fruits in images, thereby improving the detection accuracy of the object detection model. In Walnut-SR, a deep feature extraction backbone network called MDAARB (multilevel depth adaptive attention residual block) is designed to capture multiscale information through multilevel channel connections. Additionally, Walnut-SR incorporates an RRDB (residual-in-residual dense block) branch, enabling the module to focus on important feature information and reconstruct images with rich details. Finally, the CBAM (convolutional block attention module) attention mechanism is integrated into the shallow feature extraction residual branch to mitigate noise in shallow features. In 2× and 4× reconstruction experiments, objective evaluation results show that the PSNR and SSIM for 2× and 4× reconstruction reached 24.66 dB and 0.8031, and 19.26 dB and 0.4991, respectively. Subjective evaluation results indicate that Walnut-SR can reconstruct images with richer detail information and clearer texture features. Comparative experimental results of the integrated Walnut-SR module show significant improvements in mAP50 and mAP50:95 for object detection models compared to detection results using the original low-resolution images.

Keywords:

intelligent walnut production; UAV image; walnut; super-resolution reconstruction; object detection

1. Introduction

In recent years, with the rapid development of artificial intelligence, many deep learning algorithms have been widely applied in agricultural production [1,2]. In large-scale walnut production, accurately predicting walnut yield is crucial for production management and risk prevention. Currently, manual fruit counting in the complex environment of walnut tree plantations is extremely challenging, which is why drone remote sensing technology is used in agricultural production [3]. For walnut yield estimation, object detection algorithms can be employed to detect and count walnuts. However, images captured by drones at high altitudes often have low clarity and contain noise and artifacts [4], affecting the accuracy of object detection algorithms to some extent. Therefore, improving the quality of drone images is more conducive to walnut detection and counting, and is of great significance for walnut production management.

With the rapid development of the agricultural production industry, the increasing demands in production management have been driving the progress of object detection algorithms [5,6]. Currently, many advanced object detection algorithms are applied in agricultural production [7,8]. Jia et al. [9] proposed a convolutional neural network model based on the attention mechanism to estimate the weight of king oyster mushrooms, replacing manual weighing. However, their method requires considerable computational power. Chen et al. proposed an improved YOLOv4 model for detecting and counting bayberry trees in drone images, with experimental results showing that the improved model maintains accuracy while achieving a higher recall rate [10]. Kumar et al. improved the YOLOv5 model for insect identification, with results showing an F1 score close to 0.90 and an mAP of 93% in insect detection tasks [11]. Butera et al. used CNN object detection algorithms to distinguish between pests and nonpest species, achieving an average detection precision of 92.66% [12]. For walnut target detection using drones, model structure improvement or data augmentation can enhance the accuracy of object detection models [13]. However, the quality of the input images still determines whether the object detection model can accurately capture walnut features. In real-world scenarios, high-quality images can be obtained by choosing higher resolution image sensors. However, cost constraints may exist. Thus, image super-resolution reconstruction methods can be used to reconstruct pixels of low-resolution (LR) images, generating high-resolution (HR) images [14]. This approach offers significant advantages in terms of cost and convenience without the need to change image acquisition equipment.

Drone-captured walnut images often suffer from blurriness and low resolution. Performing super-resolution reconstruction on drone images can effectively improve their quality, resulting in clearer and more detailed walnut images. Currently, super-resolution reconstruction is mainly divided into the following two categories: traditional methods and deep learning–based methods [15]. Traditional methods, including interpolation-based [16,17] and learning-based approaches, are relatively simple to implement and run quickly, but the reconstructed images usually exhibit artifacts such as jagged edges and block effects, leading to significant detail loss. In contrast, deep learning–based methods often achieve better reconstruction results. Currently, CNN-based methods have become the mainstream in the field of super-resolution. In recent years, CNN-based super-resolution reconstruction algorithms have been applied in agricultural production [18,19]. Zhao and Chen [20] applied super-resolution methods to reconstruct complex cotton leaf images, and their experimental results showed that SRFBN, which introduced a feedback mechanism, could generate higher-quality images. Li [21] used lightweight enhanced super-resolution CNN, enhanced super-resolution generative adversarial networks, and dual regression networks to restore image quality in the study of the impact of spatial resolution on the internal texture and external shape of crop fields in irrigation areas, successfully reconstructing detailed internal texture information of the fields. He et al. [22] proposed a new generative adversarial model based on the SRGAN model, achieving detailed field weed image acquisition and effectively overcoming the issue of detail loss seen in traditional methods.

CNN-based super-resolution reconstruction methods restore clear high-resolution images from low-resolution images. Due to the powerful feature representation capabilities of deep convolutional neural networks, many researchers have proposed CNN-based image reconstruction algorithms to learn the nonlinear mapping from LR to HR, from SRCNN (super-resolution using convolutional neural networks) [23] to ESPCN (efficient sub-pixel convolutional neural network) [24]. However, as networks deepen, degradation phenomena occur, prompting Kim et al. [25] to propose a deep super-resolution network based on residual learning (very deep super-resolution (VDSR)), which further improved the quality of reconstructed images. In 2014, GANs were introduced [26], and Ledig et al. [27] proposed a super-resolution reconstruction model, SRGAN (super-resolution generative adversarial networks), based on GANs. Although SRGAN significantly improved the overall visual effect of reconstructed images, issues such as artifacts and blurred edges remained. These algorithms still face challenges in effectively extracting information at different scales, resulting in detail loss, blurred edges, and artifacts in the reconstructed images. Although current super-resolution reconstruction algorithms have achieved significant success in processing natural images, their application in agricultural remote sensing images still has many deficiencies. Therefore, designing a super-resolution reconstruction model suitable for drone walnut images is of great importance.

To address a series of challenges, this study designed a CNN-based super-resolution reconstruction module for drone walnut images, named Walnut-SR. This module reconstructs pixels of LR walnut images to generate higher resolution HR images. First, a multilevel depth adaptive attention residual block (MDAARB) was designed as the backbone network of Walnut-SR with dense connections. The nonlinearity of the MDAARB enables deeper feature mapping, helping the module capture complex semantic information in walnut images. Multiple skip connections within the MDAARB can integrate features at different scales, enhancing the module’s ability to capture image details and textures, thereby improving the quality of image reconstruction. Second, the convolutional block attention module (CBAM) [28] was introduced in the shallow feature extraction layer of the Walnut-SR module. The CBAM adaptively adjusts feature map weights based on global and local channel and spatial information, suppressing noise in the image. Lastly, the residual-in-residual dense block (RRDB) [29] was incorporated into the Walnut-SR module, integrating multiple multilevel residual and dense connections. This increases the network’s capacity to capture more complex structures and effectively mitigates artifacts during image reconstruction. Furthermore, this study integrated the Walnut-SR module into walnut object detection models. Extensive experiments were conducted using several mainstream object detection models enhanced by the Walnut-SR module on the walnut data set to verify the effectiveness of this general image super-resolution reconstruction module in improving the performance of walnut target detection.

2. Materials and Methods

In this study, the research area is Changning County, Baoshan City, Yunnan Province, China. We collected walnut image data using drones at low altitudes. After data preprocessing methods such as data annotation and data cleaning, the data were input into the model for training. Additionally, we designed an image super-resolution reconstruction module called Walnut-SR and integrated this module into the walnut object detection model to improve the model’s detection accuracy for small walnut targets.

2.1. Research Process

The research workflow is shown in Figure 1. First, we used the DJI Matrice-300-RTK (DJI, Shenzhen, China) equipped with a Zenmuse P1 Camera sensor to collect the image data of walnut trees. During the flight, the terrain-following mode was used for low-altitude shooting of the walnut trees. After obtaining the image data, we performed data annotation to generate label files corresponding to each walnut image. Then, we improved the walnut object detection model by designing a super-resolution reconstruction module for walnut drone images, called Walnut-SR, and integrated it into the walnut object detection model. Finally, we used the walnut object detection model with the super-resolution reconstruction module to detect walnut images in the test set, verifying the effectiveness of the Walnut-SR module in enhancing the accuracy of the walnut small object detection model.

2.2. Study Area and Data Set

The study area is located in Changning County, Baoshan City, Yunnan Province, China (WGS 84: 24°49′51.51″ N, 99°29′26.02″ E), as shown in Figure 2. This region is situated in southwestern China, with an elevation of approximately 1913 m. It belongs to the subtropical monsoon climate zone, characterized by abundant rainfall, a mild climate, and a long sunshine duration. The area contains a sample of 45 walnut trees, with an average height of 10.83 m and an average basal diameter of 39.81 cm.

In this study, all walnut tree image data were captured by DJI Matrice-300-RTK drones equipped with Zenmuse P1 camera sensors, both manufactured by DJI, China. The image collection took place in August 2022. During the capture process, the drone operated at a preset altitude of 100 m, with a ground sampling distance (GSD) of 1.26 cm per pixel. A total of 180 aerial images were captured, each with dimensions of 5472 × 3648 pixels (Figure 3), saved in JPEG format.

After obtaining the walnut tree image data, we need to perform data cleaning to remove blurry images and images without walnut fruits. Subsequently, the cleaned walnut image data is cropped to a size of 640 × 640 pixels. The resulting data set comprise a total of 2490 images. Using LabelImg for data annotation, label files are generated to accompany the image data, forming the walnut data set used for training and testing models [30]. Information about this data set is detailed in Table 1.

2.3. Super-Resolution Reconstruction Module–Walnut-SR

The Walnut-SR module consists of the following three main processes: shallow feature extraction, deep feature extraction, and image reconstruction. This structure is depicted in Figure 4.

In the shallow feature extraction stage, the input LR image

I_{L R} \in R^{(H \times W \times 3)}

, where

H

and

W

represent the number of pixels in the height and width of the low-resolution image, undergoes a 3 × 3 convolution (Conv) to obtain shallow features

F_{0} = Conv (I_{L R})

. Subsequently, the features are enhanced by the CBAM attention mechanism, which improves channel representation and enhances the network’s ability to extract spatial features across different positions.

The deep feature extraction stage consists of parallel branches composed of multiple RRDB and MDAARB units. The expressions for the RRDB branch and MDAARB branch are given by Formulas (1) and (2), respectively

F_{R R D B} = f_{R R D B}^{k} (F_{k - 1})

(1)

F_{M D A A R B} = f_{M D A A R B}^{k} (F_{k - 1})

(2)

where

k = 1,2, 3, \dots, K

,

f_{R R D B}^{k} (\cdot)

represents the kth function within the RRDB, and

f_{M D A A R B}^{k} (\cdot)

represents the kth function within the MDAARB.

During deep feature extraction of the image, deep features of the image can be obtained from

F_{0}

. In this process,

F_{1}, F_{2}, \dots, F_{k}

denote intermediate features of the image, which focus only on a small portion of information from the input feature map, thereby effectively capturing details and local structural information of the image. Additionally, after the RRDB and MDAARB branches, which may exhibit strong local responses, a 3 × 3 Conv operation is applied to each to further smooth the feature maps, as shown in Formula (3)

F_{D} = C o n v (F_{R R D B}) + C o n v (F_{M D A A R B})

(3)

In the image reconstruction stage, the input consists of the output

F_{C B A M}

from the shallow feature extraction network after the CBAM attention mechanism, and the output

F_{D}

from the deep feature extraction network. These are connected via residual connections, which mitigate the vanishing gradient problem in deep networks to some extent, and facilitate the fusion of deep and shallow feature information. The expression for the image reconstruction operation

I_{S R}

can be represented by Formula (4)

I_{S R} = F_{o u t} (F_{C B A M} + F_{D})

(4)

where

I_{S R} \in R^{s H \times W \times 3}

and

s

is the magnification factor for image super-resolution reconstruction. In the image reconstruction stage, the spatial resolution of the upsampled feature maps is increased using nearest-neighbor interpolation by a factor of two. This is followed by convolutional layers to refine the image details. Through the mapping from low resolution to high resolution, the final result is a high-resolution image with detailed information.

In Figure 4, assuming the input image (LR) has a resolution of 160 × 160 pixels, the resolution of the image after passing through the MDAARB branch, CBAM branch, and RRDB branch remains as 160 × 160 pixels. After two rounds of 2× upsampling, the image is reconstructed to a 640 × 640-pixel SR image.

2.3.1. Multilevel Depth Adaptive Attention Residual Block

One of the main structures in Walnut-SR is the MDAARB branch, as shown in Figure 5. The MDAARB consists of multiple multilevel depth adaptive attention (MDAA) modules, achieving flexible and efficient feature extraction. Before the input feature map is fed into the MDAA, only

C / 2

of the channels are passed to the MDAA to reduce redundancy when computing feature information. Channel fusion and output generation occur in the final stage. The MDAA receives a feature map of size

H \times W \times (C / 2)

. First, a Pointwise Conv adjusts the number of channels, and a nonlinear activation function enhances the network’s nonlinear expressive capability. Then, a 3 × 3 Depthwise Conv performs feature extraction, preserving the independence of each channel by not mixing information between different channels. The results from the Depthwise Conv are then fed into the EffectiveSE module [31], followed by another Pointwise Conv to restore the feature map channels before adding it to the original feature map, yielding the MDAA network’s feature extraction result. The MDAARB integrates feature information from different layers to enhance the network’s expressive capability. Additionally, it includes a Dropout layer for regularization, preventing overfitting during neural network training.

The EffectiveSE module generates channel weights through Global Average Pooling and Pointwise Conv, enabling the model to identify important features. When the EffectiveSE module receives an input feature

X \in R^{H \times W \times C}

, it first undergoes Global Average Pooling to compress spatial dependencies. This is followed by a 1 × 1 Pointwise Conv, which acts as a fully connected layer, mapping global information to channel weights. The calculation formula is shown as follows:

W = F_{P C} (F_{G a p} (X))

(5)

where

F_{G a p} (X) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X (i, j)

represents the Global Average Pooling operation, and

F_{P C}

is the process of generating weights through Pointwise Conv. Finally,

W

is normalized by the Hardsigmoid function, restricting the generated weights to between 0 and 1. The output result of the EffectiveSE module is given by Formula (6)

X^{'} = X \times H a r d s i m o i d (W)

(6)

where

X^{'} \in R^{H \times W \times C}

is the output of the EffectiveSE. The EffectiveSE adaptively weights the channel features, enhancing the information representation capability of the MDAA.

2.3.2. Residual in Residual Dense Block

In the SRGAN model, the backbone residual block (RB) uses batch normalization (BN) to enhance the model’s fitting and mapping capabilities. However, this comes with the disadvantage of high computational complexity. Therefore, the upgraded version of the SRGAN model, known as the ESRGAN model, removes all BN layers, which not only reduces computational complexity but also effectively mitigates noise during deep feature extraction.

In the RRDB (Figure 6), three dense blocks transmit feature information through residual connections, helping to avoid gradient information loss to some extent. Each dense block consists of multiple convolutional layers and leaky ReLU (LReLU) layers. The dense connectivity keeps the network deep while maintaining low computational cost. Additionally, the RRDB structure integrates residual connections and dense connections, making it a highly flexible structure that effectively handles multiscale target super-resolution reconstruction tasks.

2.3.3. Convolutional Block Attention Module

CBAM attention consists of the following two separate submodules: the channel attention module (CAM) and the spatial attention module (SAM). This module operates by applying attention in both the channel and spatial dimensions separately, which helps save on parameters and computational cost. The CBAM structure is shown in Figure 7. When shallow feature information is passed to CBAM, it first uses a residual structure that retains the original features while applying channel attention to the original features, allowing the neural network to focus on specific feature channels. After CAM processing, it generates a channel attention feature, which is then combined with the original feature branch using element-wise operations and fed into the SAM module. Once the CAM-processed feature is obtained, SAM further extracts important semantic information from the features and then multiplies it with the channel attention feature to obtain the final generated feature.

The detailed structures of the CAM module and SAM module are shown in Figure 8a,b, respectively. In Figure 8a, the CAM module receives input feature map

F \in R^{H \times W \times C}

, which undergoes Global Max Pooling and Global Average Pooling to produce two

1 \times 1 \times C

feature maps. These two feature maps are then fed into a neural network with

C / r

neurons in the first layer and

C

neurons in the second layer, sharing information between them. The outputs of the shared MLP are summed and normalized to generate the final channel attention feature.

SAM takes the channel attention feature

F^{'} \in R^{H \times W \times C}

, obtained by multiplying the channel attention feature with the original features.

F^{'}

undergoes Global Max Pooling and Global Average Pooling separately, and the resulting two feature maps are concatenated. Subsequently, they pass through a 7 × 7 convolutional layer, followed by sigmoid normalization to obtain the spatial attention feature.

2.4. Integration of the Walnut-SR Module into the Object Detection Model

Currently, mainstream object detection models typically take training images as input, first performing feature extraction at various scales, and then upsampling to the required feature map size for localization and classification predictions, as shown in Figure 9a. Despite their overall strong performance in many aspects, these models still face challenges in detecting small objects like walnut fruits. Moreover, when using drones to capture images of walnut trees, the quality of camera sensor imaging and the performance of the drone gimbal significantly impact the image quality of walnut tree images. Under these influences, images often suffer from lack of detail, blurred fruit boundaries, and excessive image noise. Such image data sets, when input into models, can interfere with the model’s ability to learn target features, thereby reducing detection reliability.

Integrating the Walnut-SR module into object detection model, as depicted in Figure 9b, effectively mitigates these issues caused during the image acquisition process. Low-resolution (LR) training images are initially processed through the Walnut-SR module, where a super-resolution reconstruction module enhances features in images with weak information. This process effectively sharpens the boundaries of walnut targets, enhances details of blurry targets, and filters out some noise. After preprocessing with the Walnut-SR module, the LR images are transformed into super-resolution (SR) results, which are then fed into the object detection model for feature extraction. Ultimately, the object detection model detects small walnut targets in the images.

3. Experimental Results

3.1. Experimental Setup

All experiments in this study were conducted using the PyTorch framework, CUDA 12.4, and Windows 10 operating system, with an NVIDIA GeForce RTX 3090Ti GPU (24 GB). The experiments were divided into two phases as follows: training of Walnut-SR and training of the walnut object detection model.

In the first phase, the Walnut-SR network was trained using 80 walnut images as the training set. The training data set underwent multiple random horizontal flips and 90° random rotations for data augmentation. During training, the L1Loss function [32] and Adam optimizer [33] with parameters

β_{1} = 0.9

,

β_{2} = 0.99

were employed, with a batch size of 8 and an initial learning rate of

2 \times 10^{- 4}

. Training continued for a total of 500,000 iterations.

In the second phase, the walnut object detection model integrated with Walnut-SR was trained. The training and validation sets for the walnut object detection model were split in an 8:2 ratio, consisting of 1594 and 398 images, respectively. Training involved 300 epochs for all object detection models to achieve convergence, with a batch size of 2.

3.2. Evaluation Indicators

This study focuses on evaluating the detection performance of the object detection model integrated with the Walnut-SR module. Therefore, metrics such as P (precision), R (recall), mAP (mean average precision), and parameters are used for evaluation. mAP50 denotes the average precision when the IOU threshold for all classes is set to 0.5, while mAP50:95 represents the average precision when the IOU threshold ranges from 0.5 to 0.95 with a step size of 0.05. mAP is the most common and authoritative evaluation metric. Parameters refer to the sum of the weight parameters for each layer of the model, used as an indicator of model complexity. A larger number of parameters indicates a more complex model.

In evaluating the super-resolution reconstructed images, this study uses the following two objective metrics: peak signal-to-noise ratio (PSNR) and structural SIMilarity (SSIM) to assess model performance. The PSNR value is directly proportional to the image quality; the higher the PSNR value, the better the image quality. The SSIM metric, with values closer to 1, indicates a higher similarity between the reconstructed image and the original high-resolution image. Additionally, this study also employs subjective visual judgment to evaluate the visual quality of the reconstructed images. The downsampled structural similarity index (DS-SSIM) is a variant of the SSIM index that focuses more on the similarity of large-scale structures in images. This metric is calculated by downsampling the image and then computing the SSIM value, with values close to 1 indicating higher image quality. The natural image quality evaluator (NIQE) metric has lower values indicating that the reconstructed image appears more natural.

3.3. Experimental Results

3.3.1. Comparison Experiments of Walnut-SR Module Integration

In the entire experiment, all object detection models were trained using the walnut data set, and the same parameters were used for training throughout. We conducted object detection performance tests using images downsampled by a factor of 2 (320 pixels) and by a factor of 4 (160 pixels), followed by super-resolution reconstruction with Walnut-SR. The detection performance of the models before and after integrating the Walnut-SR module was compared. The experimental results are shown in Table 2 and Table 3. From the comparison results of YOLOv3 [34], YOLOv5 [35], YOLOv6 [36], YOLOv7 [37], YOLOv8 [38], and YOLOv9 [39], it can be seen that the object detection models integrated with the Walnut-SR module have better detection performance for walnut targets. All models showed significant improvements in mAP50 and mAP50:95 metrics. The Walnut-SR module achieved the best improvement in object detection for 4× downsampled images. The detection comparison before and after integrating Walnut-SR with a series of object detection models is shown in Figure 10. After integrating Walnut-SR, the detection results show a certain degree of improvement in reducing missed detections and false detections.

In this experiment, we also tested a lightweight model specifically for detecting small walnut targets—w-YOLO [30]. The experimental results show that even if the model achieved high accuracy, the detection precision for small walnut targets can still be further improved by adding the super-resolution reconstruction module. In Table 4, we compared the FPS of the w-YOLO model before and after integrating the Walnut-SR module. Without integrating Walnut-SR, w-YOLOt0 achieves 50 frames per second (FPS). After adding the Walnut-SR module, w-YOLOt0 achieves 1.18 FPS at 2× downsampling and 2.13 FPS at 4× downsampling. Although there is a decrease in FPS after integrating the super-resolution module, it is still sufficient for analyzing static images in offline scenarios.

3.3.2. Comparison Experiments of Super-Resolution Networks

This paper demonstrates the superiority of our network by comparing it with other mainstream super-resolution reconstruction networks. As shown in Table 5, in the 2× reconstruction experiments, our network outperforms the ESRGAN [29] algorithm, which also features an RRDB branch, in terms of PSNR and SSIM metrics. The SSIM and DS-SSIM scores are comparable to those of the SRGAN [27], EDSR_L [40], RCAN [41], and SwinIR [42] algorithms. In the 4× reconstruction tasks, our network achieves PSNR, SSIM, and DS-SSIM values of 21.4266 dB, 0.5960, and 0.9402, respectively, surpassing other super-resolution algorithms. The NIQE metric is on par with other algorithms.

In terms of parameter count, our network has 27.73 M, which is higher than ESRGAN, SRGAN, RCAN, and SwinIR algorithms. However, it requires more time for inference compared to other models. Although our network falls slightly short in terms of lightweight design, it effectively performs the super-resolution reconstruction task for walnut fruits. In future work, we will further improve the efficiency of the model.

3.4. Ablation Study

To evaluate each component of the Walnut-SR module, we used MDAARB as the backbone part of Walnut-SR and gradually added the RRDB branch and CBAM attention branch. From the experimental results shown in Table 6, it can be seen that adding the RRDB branch to the MDAARB backbone achieved the best improvement for 2× reconstruction, with PSNR and SSIM metrics increasing by 3.46 dB and 0.0797, respectively. In experiment group 3, after adding the CBAM attention branch, there was a slight loss in 2× reconstruction performance, but an improvement in 4× reconstruction performance. This indicates that CBAM attention enhanced the feature fusion capability of the residual branch, providing greater assistance in the reconstruction process of more blurred images. From the reconstruction results of the 2× downsampled images in Experiment Group 2, it can be seen that the RRDB branch improves image quality but somewhat increases artifacts in the walnut fruits. In the 4× downsampled reconstruction results, the decrease in the NIQE metric indicates that the RRDB branch helps restore a more realistic appearance of low-resolution walnut images.

To validate the subjective visual effect of the Walnut-SR module’s super-resolution reconstruction, we used the walnut data set for 2× and 4× super-resolution reconstruction. The comparison between the original images and the reconstructed images is shown in Figure 11. Observations reveal that the edges of the walnut fruits in the original images are blurry and have artifacts. After reconstruction with the Walnut-SR module, the edge blur and artifacts are effectively alleviated, restoring more feature details in the images and producing textures closer to real high-definition images. In Figure 11a, the edges of the walnuts in the original image are relatively blurry, and the edge transitions are not smooth. After reconstruction, the distinction between the walnut fruits becomes more apparent. Figure 11b shows the comparison between the 4× downsampled original image and the reconstructed image. In the original image, the features of the walnut fruits are very indistinct, making it difficult to discern the edges of the walnut fruits with the human eye. After super-resolution reconstruction, the approximate shapes of the walnut fruits in the reconstructed image are fully presented and can be easily identified by the human eye. However, the reconstructed images still exhibit edge blur, and the detailed features of the walnut fruits are not fully restored.

To verify the effectiveness of Walnut-SR in improving the detection performance of the walnut object detection model, we take the w-YOLO walnut object detection model as an example and display a portion of the detection results before and after integrating Walnut-SR (Figure 12). From Figure 12a–f, it can be seen that in the detection without integrating Walnut-SR, there are frequent instances where leaves are incorrectly identified as walnut fruits. However, after super-resolution reconstruction, these issues are improved. Furthermore, at low resolution, w-YOLO without the assistance of the super-resolution model struggles to identify occluded walnut fruits. After the images are reconstructed by the Walnut-SR module, the contour features of the walnut fruits become clearer. Therefore, Walnut-SR enhances the object detection model’s ability to recognize challenging samples to a certain extent.

4. Discussion

4.1. Comparison of Object Detection Model Performance before and after Integrating the Walnut-SR Module

In the 2× downsampled detection results (Table 2), after integrating Walnut-SR, the P of all object detection models significantly increased, while the R saw only a modest improvement. In the 4× downsampled detection results (Table 3), both P and R of all models were notably improved after integrating Walnut-SR. Combined with Figure 11, it can be seen that the distinction between walnut fruits in the 2× downsampled LR images before and after reconstruction is not very significant. Walnut-SR mainly performs edge sharpening and texture reconstruction on the walnut fruits. Therefore, the Walnut-SR module has limited impact on improving the model’s ability to identify positive samples, and more complex model structures or additional training data may be needed to further enhance the detection accuracy of the object detection model. The reconstruction results for the 4× downsampled images are much better, effectively restoring the characteristic information of the walnut fruits. Thus, the Walnut-SR module not only improved the detection accuracy of the object detection models but also alleviated the issue of missed detections.

After integrating Walnut-SR, there is a significant decrease in the FPS of the object detection model. This is due to the high complexity of the MDAARB and RRDB branches in the Walnut-SR network, which results in a large computational load and affects the model’s real-time performance. Although high real-time performance is not required for offline processing of detection results, future research will involve deploying the object detection model with the super-resolution reconstruction module to edge computing devices on drones. Therefore, before deployment, we will optimize the real-time performance of the model.

4.2. Analysis of Super-Resolution Reconstruction Effects under Different Lighting Conditions

In our walnut data set, there are walnut fruits subjected to different lighting conditions. After model reconstruction, walnut fruits exposed to direct light can recover their basic appearance well. As shown in Figure 13, although the characteristic information of walnut fruits under backlighting is severely lost at 4× downsampling, after super-resolution reconstruction, the contours of the walnuts can be restored, similar to those under direct light, making it easy for the human eye to identify them. As shown in Figure 14, the surface of walnut fruits reconstructed at 2× downsampling is smoother.

The model in this study is capable of performing effective pixel reconstruction of walnut fruits under different lighting conditions and significantly improves the detection accuracy of the object detection model. In future research, we will further investigate the pixel reconstruction effects of walnut fruits under different weather conditions and times, as well as the performance of the object detection model. Additionally, we aim to enhance the real-time performance and generalization ability of the object detection model integrated with the super-resolution reconstruction module.

5. Conclusions

In this study, we proposed a super-resolution reconstruction module, Walnut-SR, to enhance the detection accuracy of small walnut targets. This module mainly comprises the MDAARB branch, RRDB branch, and CBAM branch. The MDAARB branch performs multilevel independent feature extraction, where channel information does not interfere with each other, and adaptive channel weighting helps extract features at different levels. The multilevel feature extraction effectively compensates for the features missed between different levels, allowing the detailed information of walnut fruits in the image to be fully captured. The RRDB branch’s residual and dense connections can effectively extract multiscale target features, adaptively integrating useful features from previous and current local features through local feature fusion. With the introduction of the RRDB branch, Walnut-SR can better focus on key information about walnut fruits, making the reconstructed walnut details richer. Adding the CBAM attention mechanism to the residual branch enhances the discriminative learning ability of the residual branch, strengthening Walnut-SR’s ability to capture shallow feature information. Under subjective visual evaluation, the reconstructed images from 2× downsampling show that the details of walnut fruits are well presented, with the edge contours restored effectively. In 4× downsampling image reconstruction, the Walnut-SR’s reconstruction effect is more evident, making walnut fruits easily distinguishable even when the original images are hard to discern.

Integrating Walnut-SR into the object detection model significantly improves the model’s detection performance for walnut fruits, with notable enhancements in both mAP50 and mAP50:95 metrics. Thanks to the super-resolution reconstruction module, the detection model can better capture the contours and detailed information of walnut fruits. Even in lightweight detection models, the super-resolution reconstruction module can still further improve the detection accuracy.

Although the Walnut-SR module demonstrates good super-resolution reconstruction performance, it still has shortcomings in fully restoring the detailed information of walnut fruits in high-magnification super-resolution reconstruction tasks. Therefore, in future work, we will further enhance the reconstruction capability of the super-resolution module for 4× downsampling images and develop an algorithm that can perform arbitrary magnification super-resolution reconstruction based on the current foundation.

Author Contributions

Conceptualization, M.W. and L.Y.; methodology, M.W.; software, M.W.; validation, M.W., Y.X. and Z.C.; formal analysis, C.Y.; investigation, C.Y.; resources, M.W. and X.Y.; data curation, M.W. and X.Y.; writing—original draft preparation, M.W.; writing—review and editing, M.W.; visualization, M.W.; supervision, L.Y.; funding acquisition, L.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the Key Project of Yunnan Basic Research Program (grant number 202401AS070034) and the Yunnan Provincial Forestry and Grass Science and Technology Innovation Joint Project (grant number 202404CB090002).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data used in this study is available by contacting the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Reddy, S.R.G.; Varma, G.P.S.; Davuluri, R.L. Optimized convolutional neural network model for plant species identification from leaf images using computer vision. Int. J. Speech Technol. 2023, 26, 23–50. [Google Scholar] [CrossRef]
Prasad, A.; Mehta, N.; Horak, M.; Bae, W.D. A Two-Step Machine Learning Approach for Crop Disease Detection Using GAN and UAV Technology. Remote Sens. 2022, 14, 4765. [Google Scholar] [CrossRef]
Kumar, Y.P.; Alex, T.J.; Hardin, R.; Searcy, S.W.; Braga-Neto, U.; Popescu, S.C.; Martin, D.E.; Rodriguez, R.; Meza, K.; Enciso, J. Detecting volunteer cotton plants in a corn field with deep learning on UAV remote-sensing imagery. Comput. Electron. Agric. 2023, 204, 107551. [Google Scholar] [CrossRef]
Weng, W.; Huang, H.; Du, Z.; Zhang, L.; Wang, J. A GAN-Based UAV Platform Complex Weather Image Restoration Technology. In International Conference on Autonomous Unmanned Systems; Springer Nature Singapore: Singapore, 2022; pp. 2233–2243. [Google Scholar]
Li, Y.; Wang, C.; Wang, C.; Deng, X.; Zhao, Z.; Chen, S.; Lan, Y. Detection of the foreign object positions in agricultural soils using Mask-RCNN. Int. J. Agric. Biol. Eng. 2023, 16, 220–231. [Google Scholar] [CrossRef]
Li, D.; Li, B.; Kang, S.; Feng, H.; Long, S.; Wang, J. E2CropDet: An efficient end-to-end solution to crop row detection. Expert Syst. Appl. 2023, 227, 120345. [Google Scholar] [CrossRef]
Hou, G.; Chen, H.; Jiang, M.; Niu, R. An Overview of the Application of Machine Vision in Recognition and Localization of Fruit and Vegetable Harvesting Robots. Agriculture 2023, 13, 1814. [Google Scholar] [CrossRef]
Yang, J.; Han, M.; He, J.; Wen, J.; Chen, D.; Wang, Y. Object detection and localization algorithm in agricultural scenes based on YOLOv5. J. Electron. Imaging 2023, 32, 052402. [Google Scholar] [CrossRef]
Jia, J.; Hu, F.; Zhang, X.; Ben, Z.; Wang, Y.; Chen, K. Method of Attention-Based CNN for Weighing Pleurotus eryngii. Agriculture 2023, 13, 1728. [Google Scholar] [CrossRef]
Chen, Y.; Xu, H.; Zhang, X.; Gao, P.; Xu, Z.; Huang, X. An object detection method for bayberry trees based on an improved YOLO algorithm. Int. J. Digit. Earth 2023, 16, 781–805. [Google Scholar] [CrossRef]
Kumar, N.; Nagarathna; Flammini, F. YOLO-Based Light-Weight Deep Learning Models for Insect Detection System with Field Adaption. Agriculture 2023, 13, 741. [Google Scholar] [CrossRef]
Butera, L.; Ferrante, A.; Jermini, M.; Prevostini, M.; Alippi, C. Precise agriculture: Effective deep learning strategies to detect pest insects. IEEE/CAA J. Autom. Sin. 2021, 9, 246–258. [Google Scholar] [CrossRef]
Wei, W.; Cheng, Y.; He, J.; Zhu, X. A review of small object detection based on deep learning. Neural Comput. Appl. 2024, 36, 6283–6303. [Google Scholar] [CrossRef]
Wang, X.; Sun, L.; Chehri, A.; Song, Y. A Review of GAN-Based Super-Resolution Reconstruction for Optical Remote Sensing Images. Remote Sens. 2023, 15, 5062. [Google Scholar] [CrossRef]
Al-Mekhlafi, H.; Liu, S. Single image super-resolution: A comprehensive review and recent insight. Front. Comput. Sci. 2024, 18, 181702. [Google Scholar] [CrossRef]
Gavade, A.; Sane, P. Super resolution image reconstruction by using bicubic interpolation. In Proceedings of the National Conference on Advanced Technologies in Electrical and Electronic Systems, Belgaum, India, January 2014; Volume 10. [Google Scholar]
Irfan, M.A.; Khan, S.; Arif, A.; Khan, K.; Khaliq, A.; Memon, Z.A.; Ismail, M. Single image super resolution technique: An extension to true color images. Symmetry 2019, 11, 464. [Google Scholar] [CrossRef]
Tian, Y.; Zhang, J.; Zhang, Z.; Wu, J. Research on Super-Resolution Enhancement Technology Using Improved Transformer Network and 3D Reconstruction of Wheat Grains. IEEE Access 2024, 12, 62882–62898. [Google Scholar] [CrossRef]
Pu, Z.; Koutti, L.; Masmoudi, L.; de Oliveira, J.V. A super resolution method based on generative adversarial networks with quantum feature enhancement: Application to aerial agricultural images. Neurocomputing 2024, 577, 127346. [Google Scholar] [CrossRef]
Zhao, L.; Chen, L. Comparative study of super resolution methods in complex cotton leaf images. Agric. Technol. 2023, 43, 48–50. (In Chinese) [Google Scholar]
Li, G. The Extraction Method of Crop Planting Structure in Irrigation District Based on Multi-Source Optical Remote Sensing Collaboration. Ph.D. Thesis, Northwest A&F University, Xianyang, China, 2023. (In Chinese) [Google Scholar] [CrossRef]
He, Z.; Zhu, R.; Xu, J. Super-resolution reconstruction of images of weeds in the field based on generative adversarial networks. J. Chin. Agric. Mech. 2023, 44, 154–160. (In Chinese) [Google Scholar]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4681–4690. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Qiao, Y.; Change Loy, C. Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
Wu, M.; Yun, L.; Xue, C.; Chen, Z.; Xia, Y. Walnut Recognition Method for UAV Remote Sensing Images. Agriculture 2024, 14, 646. [Google Scholar] [CrossRef]
Lee, Y.; Park, J. Centermask: Real-time anchor-free instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13906–13915. [Google Scholar]
Barron, J.T. A general and adaptive robust loss function. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4331–4339. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Terven, J.; Córdova-Esparza, D.M.; Romero-González, J.A. A comprehensive review of yolo architectures in computer vision: From yolov1 to yolov8 and yolo-nas. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Jocher, G. Ultralytics YOLOv8. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 17 January 2024).
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. Yolov9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 286–301. [Google Scholar]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1833–1844. [Google Scholar]

Figure 1. Workflow of this study. Walnut tree image data are collected using a DJI Matrice-300-RTK drone equipped with a Zenmuse P1 Camera. After data annotation processing, a training data set is formed. The data set is then input into an object detection model integrated with the super-resolution reconstruction module for training, ultimately detecting small walnut targets.

Figure 2. Study area of this research. The right image shows the geographical location of Yunnan Province (red area) in China, with green dots indicating the locations of walnut sample forests (WGS 84: 24°49′51.51″ N, 99°29′26.02″ E). The left image is a satellite view of the sample forest, with the red area representing the study area of this paper.

Figure 3. Aerial images of walnut trees from a low-altitude drone perspective.

Figure 4. Structure of the Walnut-SR module. It primarily consists of shallow feature extraction, deep feature extraction, and image reconstruction. The shallow feature extraction network incorporates the CBAM attention mechanism. The deep feature extraction network is composed of RRDB branches and MDAARB branches. After the LR image undergoes shallow and deep feature extraction, the image reconstruction network generates the SR image.

Figure 5. MDAARB network structure. The MDAARB is primarily composed of multiple MDAA modules connected in series, followed by channel concatenation to output the final feature map.

Figure 6. Network structure of the residual in residual dense block (RRDB). Each RRDB block contains three dense blocks. The black, red, and blue arrows indicate connections elicited by different stages.

Figure 7. CBAM structure. It mainly consists of the following two submodules: the channel attention module (CAM) and the spatial attention module (SAM).

Figure 8. The two important components of CBAM attention are depicted in (a) for the CAM network structure and (b) for the SAM network structure.

Figure 9. The integration of the Walnut-SR module into the object detection model. (a) shows the structure of a basic object detection model and (b) depicts the structure of the object detection model integrated with the Walnut-SR module. Prior to feature extraction in the object detection model, a super-resolution reconstruction preprocessing step is embedded to reconstruct LR images into SR images with enhanced detail.

Figure 10. Detection comparison of a series of mainstream object detection models before and after integrating Walnut-SR. The first row of results for each model shows the detection results without Walnut-SR, while the second row shows the detection results after integrating Walnut-SR. (a–z) represent YOLOv3, YOLOv3-SPP, YOLOv3-tiny, YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, YOLOv5x, YOLOv6n, YOLOv6s, YOLOv6m, YOLOv6l, YOLOv6x, YOLOv7, YOLOv7-tiny, YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, YOLOv8x, YOLOv9t, YOLOv9s, YOLOv9m, YOLOv9c, w-YOLOt0, w-YOLOt1, respectively.

Figure 11. Walnut-SR super-resolution reconstruction comparison. (a) shows the comparison between the 2× super-resolution reconstruction and the original image and (b) shows the comparison between the 4× super-resolution reconstruction and the original image.

Figure 12. Visualization of detection results for w-YOLO before and after integrating Walnut-SR. The first row of (a–f) shows the detection results without Walnut-SR, while the second row shows the detection results with Walnut-SR integrated. In the figure, the blue and yellow circles represent missed detections and false detections by the model, respectively.

Figure 13. Comparison of the 4× super-resolution reconstruction effects. (a) is the image before reconstruction and (b) is the image after reconstruction.

Figure 14. Comparison of the 2× super-resolution reconstruction effects. (a) is the image before reconstruction and (b) is the image after reconstruction.

Table 1. The detailed information of the walnut data set.

Label Name	Images Number (Pieces)	Target Number (Pieces)
Walnut	2490	12,138

Table 2. Comparison of object detection models integrated with the Walnut-SR module at 2× downsampling.

Downsample Setting	Model	Scale	Walnut-SR	P	R	mAP50	mAP50:95	Parameters (M)
down 2	YOLOv3	-	-	0.928	0.96	0.935	0.651	61.50
	YOLOv3	-	√	0.950	0.96	0.943	0.694	89.23
	YOLOv3-spp	-	-	0.921	0.96	0.935	0.657	62.55
	YOLOv3-spp	-	√	0.948	0.96	0.945	0.699	90.28
	YOLOv3-tiny	-	-	0.944	0.96	0.925	0.597	8.67
	YOLOv3-tiny	-	√	0.963	0.97	0.928	0.638	36.40
	YOLOv5	n	-	0.916	0.98	0.944	0.658	1.77
		s		0.930	0.98	0.943	0.666	7.02
		m		0.923	0.97	0.944	0.666	20.87
		l		0.928	0.96	0.936	0.668	46.14
		x		0.925	0.97	0.946	0.664	86.22
		n	√	0.937	0.98	0.951	0.683	29.50
		s		0.947	0.97	0.944	0.692	34.75
		m		0.948	0.97	0.949	0.705	48.60
		l		0.940	0.97	0.944	0.698	73.87
		x		0.960	0.97	0.949	0.700	113.95
	YOLOv6	n	-	0.872	0.97	0.933	0.659	4.24
		s		0.887	0.95	0.911	0.635	16.31
		m		0.886	0.97	0.931	0.663	52.00
		l		0.879	0.95	0.913	0.652	110.90
		x		0.893	0.97	0.928	0.658	173.02
		n	√	0.899	0.97	0.947	0.706	31.97
		s		0.908	0.97	0.941	0.694	44.04
		m		0.889	0.97	0.947	0.701	79.73
		l		0.891	0.96	0.935	0.693	138.63
		x		0.906	0.97	0.943	0.701	200.75
	YOLOv7	-	-	0.917	0.99	0.943	0.653	37.20
	YOLOv7	-	√	0.944	0.98	0.945	0.692	64.93
	YOLOv7-tiny	-	-	0.985	0.99	0.910	0.576	6.01
	YOLOv7-tiny	-	√	0.992	0.98	0.917	0.609	33.74
	YOLOv8	n	-	0.868	0.97	0.933	0.661	3.01
		s		0.871	0.97	0.936	0.666	11.14
		m		0.883	0.97	0.931	0.661	25.86
		l		0.874	0.97	0.926	0.654	43.63
		x		0.873	0.97	0.932	0.661	68.15
		n	√	0.890	0.97	0.943	0.701	30.74
		s		0.891	0.98	0.947	0.709	38.87
		m		0.898	0.97	0.944	0.702	53.59
		l		0.907	0.97	0.938	0.697	71.36
		x		0.903	0.98	0.945	0.707	95.88
	YOLOv9	t	-	0.871	0.98	0.949	0.661	3.72
		s		0.876	0.97	0.953	0.687	9.80
		m		0.866	0.98	0.956	0.684	32.88
		c		0.884	0.96	0.952	0.700	51.18
		t	√	0.898	0.98	0.953	0.707	31.45
		s		0.897	0.98	0.955	0.722	37.53
		m		0.904	0.98	0.955	0.722	60.61
		c		0.905	0.97	0.958	0.731	78.91
	w-YOLO	t0	-	0.853	0.97	0.925	0.644	5.87
		t1	-	0.856	0.96	0.922	0.618	9.51
		t0	√	0.907	0.98	0.943	0.698	33.60
		t1	√	0.888	0.98	0.946	0.701	37.24

Note: “√” indicates integration with the Walnut-SR module.

Table 3. Comparison of object detection models integrated with the Walnut-SR module at 4× downsampling.

Downsample Setting	Model	Scale	Walnut-SR	P	R	mAP50	mAP50:95	Parameters (M)
down 4	YOLOv3	-	-	0.910	0.82	0.742	0.407	61.50
	YOLOv3	-	√	0.952	0.93	0.895	0.612	89.23
	YOLOv3-spp	-	-	0.924	0.84	0.707	0.404	62.55
	YOLOv3-spp	-	√	0.947	0.93	0.894	0.608	90.28
	YOLOv3-tiny	-	-	0.890	0.88	0.734	0.397	8.67
	YOLOv3-tiny	-	√	0.957	0.94	0.882	0.573	36.40
	YOLOv5	n	-	0.915	0.89	0.764	0.454	1.77
		s		0.928	0.87	0.754	0.425	7.02
		m		0.929	0.88	0.778	0.457	20.87
		l		0.929	0.87	0.790	0.452	46.14
		x		0.914	0.87	0.785	0.449	86.22
		n	√	0.926	0.96	0.898	0.599	29.50
		s		0.957	0.96	0.906	0.621	34.75
		m		0.942	0.93	0.896	0.616	48.60
		l		0.940	0.94	0.901	0.618	73.87
		x		0.946	0.94	0.900	0.613	113.95
	YOLOv6	n	-	0.857	0.88	0.751	0.435	4.24
		s		0.857	0.78	0.610	0.342	16.31
		m		0.870	0.85	0.682	0.401	52.00
		l		0.856	0.81	0.655	0.387	110.90
		x		0.855	0.87	0.692	0.392	173.02
		n	√	0.900	0.95	0.892	0.613	31.97
		s		0.906	0.95	0.889	0.605	44.04
		m		0.887	0.95	0.884	0.606	79.73
		l		0.884	0.93	0.878	0.604	138.63
		x		0.902	0.95	0.884	0.607	200.75
	YOLOv7	-	-	0.870	0.88	0.731	0.400	37.20
	YOLOv7	-	√	0.936	0.97	0.892	0.600	64.93
	YOLOv7-tiny	-	-	0.927	0.96	0.729	0.388	6.01
	YOLOv7-tiny	-	√	0.983	0.97	0.856	0.527	33.74
	YOLOv8	n	-	0.871	0.82	0.662	0.370	3.01
		s		0.889	0.93	0.753	0.445	11.14
		m		0.858	0.86	0.730	0.424	25.86
		l		0.865	0.83	0.695	0.373	43.63
		x		0.864	0.86	0.728	0.420	68.15
		n	√	0.904	0.95	0.890	0.612	30.74
		s		0.890	0.95	0.892	0.613	38.87
		m		0.897	0.96	0.897	0.618	53.59
		l		0.894	0.95	0.884	0.604	71.36
		x		0.897	0.95	0.887	0.610	95.88
	YOLOv9	t	-	0.866	0.87	0.781	0.439	3.72
		s		0.823	0.84	0.772	0.442	9.80
		m		0.888	0.86	0.791	0.479	32.88
		c		0.868	0.76	0.764	0.497	51.18
		t	√	0.893	0.96	0.912	0.625	31.45
		s		0.893	0.96	0.915	0.638	37.53
		m		0.896	0.95	0.916	0.637	60.61
		c		0.902	0.93	0.917	0.648	78.91
	w-YOLO	t0	-	0.833	0.96	0.768	0.443	5.87
		t1	-	0.805	0.9	0.751	0.396	9.51
		t0	√	0.886	0.96	0.901	0.611	33.60
		t1	√	0.886	0.95	0.893	0.607	37.24

Note: “√” indicates integration with the Walnut-SR module.

Table 4. Comparison of FPS for w-YOLO before and after integrating the Walnut-SR module.

Model	Walnut-SR	Downsample Setting	FPS
w-YOLOt0	×	-	50
	√	down 2	1.18 2.13
	√	down 4	1.18 2.13

Note: “√” indicates integration with the Walnut-SR module.

Table 5. Comparison experiments with other mainstream image super-resolution reconstruction networks.

Model	Downsample Setting	PSNR (dB)	SSIM	DS-SSIM	NIQE	Parameters (M)	Inference Time (s)
ESRGAN	down 2	25.3059	0.7821	0.9851	4.6295	16.70	0.40
SRGAN		27.5122	0.8542	0.9911	6.7974	1.37	0.04
EDSR_L		26.9583	0.8382	0.9903	7.1957	40.73	0.25
RCAN		27.1333	0.8456	0.9908	8.3306	15.44	0.20
SwinIR		27.1493	0.8458	0.9907	7.2899	11.75	0.80
Our		26.6227	0.8435	0.9828	8.0534	27.73	0.83
ESRGAN	down 4	21.3371	0.5947	0.9394	6.9394	16.70	0.06
SRGAN		21.3525	0.5914	0.9384	6.8006	1.52	0.01
EDSR_L		21.1442	0.5742	0.9350	6.6678	43.09	0.15
RCAN		21.1961	0.5816	0.9365	6.9260	15.44	0.06
SwinIR		21.3868	0.5910	0.9384	6.4596	11.90	0.15
Our		21.4266	0.5960	0.9402	6.9663	27.73	0.39

Table 6. Ablation study of Walnut-SR module components.

Group	Module			×2				×4				Number of Parameters
Group	MDAARB	RRDB	CBAM	PSNR (dB)	SSIM	DS-SSIM	NIQE	PSNR (dB)	SSIM	DS-SSIM	NIQE	Number of Parameters
1	√			21.28	0.7248	0.9351	5.7271	18.94	0.4657	0.9395	6.8131	2,773,3571
2	√	√		24.74	0.8045	0.9896	7.3237	18.95	0.4867	0.9163	6.6401	2,773,3571
3	√	√	√	24.66	0.8031	0.9828	8.0548	19.26	0.4991	0.9402	6.9663	2,773,4182

Note: “√” means that the model has the module.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, M.; Yang, X.; Yun, L.; Yang, C.; Chen, Z.; Xia, Y. A General Image Super-Resolution Reconstruction Technique for Walnut Object Detection Model. Agriculture 2024, 14, 1279. https://doi.org/10.3390/agriculture14081279

AMA Style

Wu M, Yang X, Yun L, Yang C, Chen Z, Xia Y. A General Image Super-Resolution Reconstruction Technique for Walnut Object Detection Model. Agriculture. 2024; 14(8):1279. https://doi.org/10.3390/agriculture14081279

Chicago/Turabian Style

Wu, Mingjie, Xuanxi Yang, Lijun Yun, Chenggui Yang, Zaiqing Chen, and Yuelong Xia. 2024. "A General Image Super-Resolution Reconstruction Technique for Walnut Object Detection Model" Agriculture 14, no. 8: 1279. https://doi.org/10.3390/agriculture14081279

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A General Image Super-Resolution Reconstruction Technique for Walnut Object Detection Model

Abstract

1. Introduction

2. Materials and Methods

2.1. Research Process

2.2. Study Area and Data Set

2.3. Super-Resolution Reconstruction Module–Walnut-SR

2.3.1. Multilevel Depth Adaptive Attention Residual Block

2.3.2. Residual in Residual Dense Block

2.3.3. Convolutional Block Attention Module

2.4. Integration of the Walnut-SR Module into the Object Detection Model

3. Experimental Results

3.1. Experimental Setup

3.2. Evaluation Indicators

3.3. Experimental Results

3.3.1. Comparison Experiments of Walnut-SR Module Integration

3.3.2. Comparison Experiments of Super-Resolution Networks

3.4. Ablation Study

4. Discussion

4.1. Comparison of Object Detection Model Performance before and after Integrating the Walnut-SR Module

4.2. Analysis of Super-Resolution Reconstruction Effects under Different Lighting Conditions

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI