1. Introduction
High-voltage cable conduit installation has seen widespread adoption in cable routes due to its benefits of easy construction, cost-effectiveness, and minimal impact on subsequent maintenance [
1,
2].
However, underground cable conduits often encounter issues such as misalignment and the accumulation of gravel and other foreign materials during construction. The process of dragging cables during installation poses a risk to the integrity of the cable insulation layer, potentially leading to underground accidents during grid operation. To address these challenges, pipeline robots have been developed and deployed for defect detection within the conduits. Currently, the prevailing method for defect detection involves personnel manually inspecting pipeline interiors by capturing images with these robots. Consequently, the automation level of defect identification remains inadequate. There is an urgent need to develop an efficient and reliable pipeline defect identification algorithm. Given that cable conduits are typically narrow, the hardware used for inspection is limited in capacity. Therefore, it is crucial to optimize the algorithm model to reduce its complexity while maintaining high accuracy, ensuring it can be deployed effectively within the hardware constraints.
In earlier research, some scholars employed traditional machine learning techniques, such as those based on morphological, geometric, and surface texture features, to detect and diagnose defects [
3,
4,
5]. With the advancement of science and technology, machine learning has profoundly influenced the fields of experimental solid mechanics and industrial surface defect detection, driving the progress of related technologies [
6,
7]. However, in recent years, rapid advancements in computer vision and artificial intelligence have led to the emergence of deep learning-based image recognition methods, which have proven to be potent tools for surface defect detection and improving detection processes [
8].
For instance, Kumar et al. [
9] utilized convolutional neural networks (CNNs) for defect recognition in underground drainage pipes, achieving an average testing precision of 86.2%. Qi Li et al. [
10] employed a CNN with variously sized convolutional kernels and pooling layers to classify and recognize a two-dimensional matrix converted from a time series, achieving a precision of 98.67%.
Among the plethora of target detection algorithms, the YOLO series of single-stage detection models has shown promising results in defect detection [
11,
12,
13]. Lv et al. [
14] reduced the model size of YOLOv7 by replacing conventional convolutional blocks with lightweight modules and added attention mechanisms and SPD convolutional modules, demonstrating high performance in strip steel surface defect detection tasks. Xu et al. [
15] enhanced YOLOv5 by integrating attention mechanisms, loss functions, and activation functions to improve small target detection, achieving a recognition precision of 92.2% for welding defects inside pipelines, which is 9% higher than the original model. Additionally, Yin et al. [
16] proposed the VIASP defect identification algorithm based on the YOLOV3 algorithm, which can extract key information from the video to achieve automatic defect marking and output an evaluation report.
Compared to other target detection algorithms, YOLOv8 showcases exceptional detection performance and robust generalization capabilities. These attributes render it highly suitable for tackling intricate defect detection scenarios encountered in underground cable conduits. Several scholars have enhanced the performance of YOLOv8 across various tasks by modifying modules and refining the structure [
17,
18,
19].
In this study, we propose a defect recognition algorithm tailored for real-world underground cable conduit scenarios based on an improved YOLOv8 model. This paper further enhances the backbone network and detection head components from the original model. The constructed cable conduit dataset is utilized for both training and testing purposes. Experimental results demonstrate the model’s effectiveness in detecting misalignment and foreign object defects in cable conduits. The main improvements are summarized as follows:
- (1)
Underground cable conduits exhibit large-scale differences in defects. To address this, we employ the Atrous Spatial Pyramid Pooling (ASPP) module to replace the original Spatial Pyramid Pooling (SPPF) module. This strengthens the model’s ability to extract features across different scales, thereby improving its capability of detecting multi-scale targets.
- (2)
Given the low-light conditions during video acquisition in underground cable conduits and the high noise levels in collected data due to the narrow and unstable environment, we incorporate the Convolutional Block Attention Module (CBAM) mechanism. This mechanism mitigates noise interference, enabling the model to focus more on key pipeline defect areas, thereby enhancing feature extraction and learning capabilities.
- (3)
To mitigate the increase in model parameters resulting from the aforementioned enhancements and facilitate easier deployment, we replace the C2f module in the backbone network with the basic module of ShuffleNet V2. This reduction in model parameters does not significantly impact detection precision, making the model easier to deploy.
2. Related Work
2.1. YOLOv8 Algorithm
As part of the YOLO series [
20], the YOLOv8 target detection network enhances accuracy, efficiency, and robustness compared to its predecessors. The YOLOv8 network architecture comprises four components: the Input layer (Input), the Backbone network (Backbone), the feature fusion layer (Neck), and the Detection layer (Head) [
21].
The Input layer preprocesses the image, ensuring it matches the input layer dimensions of the model by adjusting it to a fixed size. The Backbone network is tasked with extracting semantic and spatial information features from the input image, forwarding these features to the subsequent detection head for target detection. The feature fusion layer incorporates the C2f module, the upsample layer, and the Concat module, which fuses feature maps of different scales to form a better feature representation to improve the performance of the model. In the Detection layer, the Decoupled Head structure (Decou-Head) separates classification and detection tasks, employing distinct loss functions tailored to each task. Additionally, Anchor-Free techniques are utilized in the sample matching process, eliminating the need for anchor boxes to determine positive and negative samples more efficiently, thus enhancing model detection speed.
2.2. Improve YOLOv8 Network Model Construction
In order to effectively extract features from defects of varying scales within the complex underground cable piping system, this paper employs a hollow-space convolutional pooled pyramid. This approach expands the receptive field to capture multi-scale features more comprehensively, leveraging global information to enhance model accuracy with only a marginal increase in computational overhead.
Additionally, the CBAM attention mechanism is integrated into the detection head to enhance feature extraction from both channel and spatial dimensions, mitigating external noise interference and improving model generalization.
To enable real-time and accurate identification of cable conduit obstacles for timely cleanup by relevant authorities, it is imperative to reduce model complexity and computational overhead during runtime. The YOLOv8 model’s introduction of the C2f module, along with the incorporation of ASPP and CBAM attention mechanisms, inevitably increases computational demands. Thus, this paper proposes enhancing the YOLOv8 backbone network by adopting base modules from the ShuffleNet V2 [
22] architecture to reduce model parameters and expedite recognition.
Subsequent subsections in this section will delve into the working principles and technical intricacies of each module. The structure of the improved model network is illustrated in
Figure 1.
2.2.1. Atrous Spatial Pyramid Pooling
The Atrous Spatial Pyramid Pooling (ASPP) module, originally designed for image semantic segmentation tasks, is employed in this paper to enhance the target detection capabilities by replacing the Spatial Pyramid Pooling (SPP) module in YOLOv8. Unlike traditional pooling operations, ASPP increases the receptive field without downsample, thereby effectively improving the model’s ability to detect and recognize targets.
ASPP conducts multi-scale convolutional operations on input feature maps using convolution kernels with varying dilation rates, merging information from different scales. This approach enhances the network’s capacity to perceive targets and comprehend semantics, thereby improving the model’s ability to detect targets across various scales.
The decision to replace the SPP module with ASPP in YOLOv8 is primarily motivated by ASPP’s advantage in capturing multi-scale information, which aligns well with the complex scenarios encountered in underground cable conduit defects. This adaptability ensures improved accuracy and robustness in detecting targets of different scales.
The ASPP module, illustrated in
Figure 2, initiates by applying multi-scale atrous convolution to the input feature map. By defining different dilation rates “R”, it facilitates free multi-scale feature extraction, enabling the model to concurrently consider both small- and large-scale characteristics. Subsequently, a global pooling operation is executed on the input feature map to capture its global information. Following the acquisition of features at each scale, a concatenation (Concat) operation is performed on them along the channel dimension to generate a more comprehensive feature representation. To diminish feature dimensionality and reduce computational load, a pointwise convolution is employed to conduct dimensionality reduction on the merged features, ultimately yielding the final feature map.
2.2.2. Convolutional Block Attention Module
CBAM (Convolutional Block Attention Module) is a convolutional neural network attention mechanism that integrates both channel attention and spatial attention. This module dynamically learns channel and spatial information within input feature maps to enhance network performance. Compared to other attention mechanisms, CBAM often yields superior results. The realization flow of CBAM is illustrated in
Figure 3.
The CBAM process can be delineated into two primary steps, as shown in
Figure 3. Firstly, in the channel attention module, the input feature map undergoes global average pooling along the channel dimension. Subsequently, it passes through two fully connected layers to discern the correlation and significance between channels individually. Secondly, in the spatial attention module, the input feature map undergoes max-pooling along the channel dimension. This is followed by processing through two convolutional layers to ascertain the importance of different spatial positions within the feature map. Ultimately, the attention weights derived from both the channel attention module and the spatial attention module are applied to the input feature map separately. The final feature representation is then obtained through fusion via element-wise multiplication.
The channel attention module can be expressed as
where σ represents the activation function,
F is the feature map,
W0 and
W1 represent two convolution operations, and
and
represent average pooling and maximum pooling, respectively.
The spatial attention module can be written as
where
F7×7 represents the convolution operation with a convolution kernel size of 7 × 7.
Ultimately, the outputs of the channel attention module and the spatial attention module are multiplied to yield a weighted feature map, as depicted in Equation (3). This operation enables the network to prioritize essential channels and regions more effectively, enhancing its focus on critical aspects of the data.
where
F′ and
F″ represent the output feature maps after channel attention and spatial attention, respectively, and
represents element-wise multiplication.
This paper introduces the CBAM attention mechanism into the detection head to enhance the model’s capability to discern and localize defects within underground cable conduits. By incorporating this mechanism, the model can concentrate more effectively on the defect area, thereby enhancing sensitivity in detecting small defects. Additionally, it helps to suppress background noise and decrease the false detection rate, leading to reliable and stable defect recognition in the intricate underground cable conduit environment.
2.2.3. ShuffleNet
To achieve model lightweighting, this paper replaces the C2f module in the YOLOv8 backbone network with the base module from ShuffleNet V2. ShuffleNet V2 is a specialized convolutional neural network architecture crafted for efficient computation and compression of model parameters. Its fundamental concept revolves around reducing computational complexity and model size by leveraging depth-separable convolution and channel blending techniques.
Depth-separable convolution decomposes the convolution operation into two sequential steps: depth convolution and pointwise convolution. This decomposition significantly reduces the number of parameters and computational costs involved. Conversely, channel blending enhances the model’s expressive power and performance by grouping input channels and recombining them after convolution within the group. This process facilitates cross-channel information exchange and feature reorganization, contributing to improved model performance.
These design principles enable the model to maintain high accuracy while exhibiting a smaller model size and faster inference speed. This characteristic makes ShuffleNet V2 suitable for resource-constrained environments and mobile deployments.
The basic module of ShuffleNet V2, illustrated in
Figure 4a, comprises two branches. In the left branch, a 3 × 3 depth convolution operation is performed, followed by a 1 × 1 point convolution operation on the input feature map. Conversely, the right branch conducts a depth convolution operation, along with two 1 × 1 point convolution operations. Subsequently, a concatenation (Concat) operation is conducted with the left branch in the channel dimension, followed by group convolution with channel shuffling.
The feature matrix, subsequent to the group convolution of the input feature map, undergoes further disruption and division. The resultant feature map obtained through channel shuffling effectively integrates information across different channels, as illustrated in
Figure 4b.
3. Experimentation and Analysis
3.1. Data Collection
The cable conduit dataset utilized in this study originates from a test site constructed specifically for this purpose. The cable discharge pipes are fabricated from fiberglass material, featuring three distinct inner diameter specifications: 175 mm, 225 mm, and 250 mm. This variety aligns with real-world engineering application scenarios. Various misalignment conditions, foreign body positions, and light intensities were simulated during the placement process, ensuring that the ratio of misalignment images to foreign body images remained balanced. In total, 510 images were collected for the dataset.
3.2. Dataset Construction
Considering the limited quantity of original data collected and aiming to bolster the robustness of the enhancement algorithm, image enhancement techniques are applied to the original dataset.
The data augmentation process includes several techniques applied with specific probabilities: vertical and horizontal flipping occurs with a 50% probability; brightness adjustment randomly varies between 80% and 120% with a 70% probability; random grid rearrangement divides the image into 3 × 3 grids and rearranges them with a 30% probability; color jittering adjusts contrast and saturation between 80% and 120% with a 20% probability; and piecewise affine transformation distorts the image with a 10% probability, mimicking real-world image distortions encountered in practice, and the cable conduit dataset is expanded to encompass 1145 images. The techniques employed in the data enhancement process are uniformly applied to all images, ensuring no particular bias towards any specific type of defect image. Consequently, the proportion of foreign matter and misalignment defects remains approximately equal in the enhanced dataset. The efficacy of these data enhancement techniques is illustrated in
Figure 5.
The enhanced dataset is annotated using the LabelImg image annotation software. For each annotated image, a corresponding text file is generated, containing information about the types of targets present in the image along with their bounding box positions and sizes. A total of 1912 valid object labels are obtained through annotation, comprising 724 contaminant labels and 1188 misalignment labels.
The labeled images are then randomly divided into training and validation sets at a ratio of 9:1 for model training and validation purposes. Additionally, a separate test set comprising 80 unlabeled images is collected. Together, these datasets constitute the cable discharge pipe defect recognition dataset, as outlined in
Table 1.
3.3. Experimental Deployment Environment
The model utilized in this paper is built upon the PyTorch deep learning framework. The hardware and software environments for conducting the experiments are as follows: Windows 10 operating system, 13th Gen Intel(R) Core(TM) i5-13600KF @3.5GHz CPU, RTX 4070 12G GPU, PyTorch version 2.1.2, and CUDA version 11.8.
3.4. Evaluation Metrics
To quantitatively assess the model’s performance in this paper, three common target detection evaluation metrics, precision, recall, and mean average precision (mAP), are employed.
As illustrated in
Figure 6, true positive (TP) represents the number of samples predicted to be positive cases that are indeed positive cases; false positive (FP) denotes the number of samples predicted to be positive cases that are, in reality, negative cases; true negative (TN) signifies the number of samples predicted to be negative cases that are indeed negative cases; and false negative (FN) indicates the number of samples predicted as negative cases that are, in fact, positive cases.
Precision, denoted as the ratio of correct predictions to all positive detections, including false positives (FPs) and true positives (TPs), serves as a measure of the model’s precision in the detection task, as depicted in Equation (4).
Recall, defined as the ratio of correct predictions to all samples, quantifies the model’s capability to identify all actual defect samples, reflecting its ability to detect real defects accurately. A higher recall rate signifies greater search comprehensiveness of the model. The calculation is as follows:
The mAP is calculated by the precision and recall rate, as shown in Equations (6) and (7).
Among them, P(r) is the precision, and n is the number of target types.
3.5. Model Training and Prediction
Once the platform construction and organization of the cable discharge pipe defect dataset were completed, formal model training commenced. Considering the computational platform’s capabilities, the training iteration was set to 200 times. Following training, the model’s performance was evaluated by processing and predicting the test set. Throughout the training process, the loss value exhibited a consistent decrease with increasing iterations. Convergence was determined when the validation loss value ceased to decrease, as depicted in
Figure 7. The performance of the model is more comprehensively evaluated through the P-R curve and confusion matrix, as shown in
Figure 8 and
Figure 9. The Precision–Recall (P-R) diagram is used to illustrate and assess the trade-off between the precision and recall of the model at various thresholds, effectively reflecting the model’s performance across different confidence levels. The confusion matrix provides a detailed evaluation of the model’s classification performance by analyzing specific misclassifications using four indicators: true positive, true negative, false positive, and false negative.
The loss value curve depicted in
Figure 7 illustrates a rapid decline in the model’s loss value within the initial 40 epochs, followed by stabilization after 190 epochs. This smooth decrease in the overall loss value curve indicates the model’s strong convergence performance. As shown in
Figure 8 and
Figure 9, the improved model demonstrates excellent performance in detecting misalignments and contaminants in cable conduits.
Furthermore,
Figure 10 visualizes the prediction results of different models on various test sets. Marked sections in the figure indicate defect category, defect location, and prediction confidence. Defect locations are outlined with boxes of varying colors, followed by their categorization and corresponding confidence levels. These results highlight the improved model’s superior overall prediction efficacy.
3.6. Ablation Experiment Analysis
In this study, enhancements are made to both the backbone network and the detection head component of the original model. To validate the effectiveness of these improvements, ablation experiments are conducted on the three enhancement schemes proposed in this paper. The experimental results detailing the impact of different enhancement strategies on the model’s performance are presented in
Table 2.
As shown in
Table 2, the incorporation of the ASPP in the YOLOv8 model leads to an increase in mean average precision by 1.1%, 1.3%, and 0.7% compared to the original model, YOLOv8-Shuffle, and YOLOv8-CBAM, respectively. This enhancement underscores the efficacy of the hollow-space convolutional pooling pyramid in augmenting the model’s feature extraction capacity for multi-scale targets, thereby improving defect detection performance. Furthermore, the integration of the CBAM attention mechanism into the detection head enhances the model’s ability to discern defects within the pipeline by effectively suppressing noise and irrelevant environmental information. Simultaneously, the adoption of lightweighting techniques in the backbone network, specifically replacing C2f with the Shuffle V2 base module, resulted in a significant reduction in model size, indicating the practicality of such modifications and facilitating easier deployment of the model. By implementing these three improvements concurrently, the model’s performance surpasses that of the original model across all metrics, achieving the highest mean average precision of 97.6%.
These experimental findings underscore the effectiveness of the enhanced underground cable conduit defect detection algorithm, offering substantial advancements in detection capabilities. This provides robust support for the realization of more accurate and efficient cable conduit inspections.
3.7. Comparisons of Different Attention Mechanism Modules
In order to explore which attention mechanism can provide the best detection performance in this study, we used YOLOv8 as the benchmark network and inserted SE, CA, and CBAM attention mechanism modules in the same location for comparison. The detection results of each module on the cable conduit defect dataset are presented in
Table 3. The comparison results indicate that the detection network incorporating the CBAM module exhibits the highest performance in identifying cable conduit defects, achieving an AP value of 94.2%. The network using the CA module shows moderate overall performance in defect detection, though it increases the number of model parameters. The network utilizing the SE module displays minimal improvement in detection performance compared to the original YOLOv8. This clearly demonstrates that integrating the attention mechanism module into the YOLOv8 network is an effective solution for defect detection. This is because SE focuses solely on the channel dimension’s attention and lacks spatial dimension feature information. The CA attention mechanism calculates the attention weight across the entire feature map, resulting in significant computational overhead. In contrast, the CBAM attention mechanism enhances the model’s ability to capture crucial features by simultaneously modeling channel and spatial attention. This approach maintains the integrity of the feature map’s positional and spatial information while effectively capturing the positional information of defect features. Consequently, CBAM enables the model to accurately identify defects such as dislocations and foreign objects in the cable conduit, thereby enhancing overall performance.
3.8. Comparison of Detection Capabilities of Different Models
To further validate the superiority of our model, this paper trained and compared the cable conduit defect dataset using the traditional convolutional neural network (Fast R-CNN) algorithm, the YOLOv5 algorithm, and the original YOLOv8 algorithm. The comparative evaluation primarily includes average precision (AP), mean average precision (mAP), and model inference speed (FPS), as presented in
Table 4.
Analysis of the results reveals that our improved model surpasses the original models of Faster R-CNN, YOLOv5, and YOLOv8 in both AP and mAP metrics while also exhibiting higher FPS, enabling efficient completion of the recognition task. In conclusion, our enhanced model proves to be highly effective and outperforms the other three algorithm recognition models in identifying defects in cable conduits.
4. Conclusions
Timely detection of recently installed cable conduits is crucial to prevent misalignment and the accumulation of foreign matter, which can lead to cable breakage and subsequent safety hazards.
Traditional detection methods reliant on manual image inspection are inefficient and prone to subjective interpretation; hence, this paper proposes an enhanced algorithm utilizing YOLOv8 for identifying defects in underground cable conduits. The algorithm integrates three key improvements: the ASPP convolution pyramid, the CBAM attention mechanism, and the Shuffle-Net lightweight module to train the model for automated defect detection in urban cable infrastructure. Experimental results demonstrate the efficacy of the proposed model, achieving a mean average precision of 97.6% on the dataset utilized.
The model’s ability to perform real-time video detection facilitates its practical application in real-world scenarios, offering efficient and precise identification and localization of pipeline defects without relying on manual labor. Nonetheless, the limited scope of the dataset used in this study highlights the need for future research to expand and enhance both the quality and quantity of data, thereby improving the model’s generalizability.