CPF-UNet: A Dual-Path U-Net Structure for Semantic Segmentation of Panoramic Surround-View Images

Sun, Qiqing; Qu, Feng

doi:10.3390/app14135473

Open AccessArticle

CPF-UNet: A Dual-Path U-Net Structure for Semantic Segmentation of Panoramic Surround-View Images

by

Qiqing Sun

¹ and

Feng Qu

^1,2,*

¹

College of Computer Science and Technology, Changchun University of Science and Technology, Changchun 130022, China

²

Research Center for Medical Image Computing, Zhongshan Institute, Changchun University of Science and Technology, Zhongshan 528437, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(13), 5473; https://doi.org/10.3390/app14135473

Submission received: 10 May 2024 / Revised: 18 June 2024 / Accepted: 19 June 2024 / Published: 24 June 2024

(This article belongs to the Special Issue Artificial Intelligence in Transportation Safety and Traffic Management)

Download

Browse Figures

Versions Notes

Abstract

:

In this study, we propose a dual-stream UNet neural network architecture design named CPF-UNet, specifically designed for efficient semantic pixel-level segmentation tasks. This architecture cleverly extends the basic structure of the original UNet, mainly through the addition of a unique attention-guided branch in the encoder part, aiming to enhance the model’s ability to comprehensively capture and deeply fuse contextual information. The uniqueness of CPF-UNet lies in its dual-path mechanism, which differs from the dense connectivity strategy adopted in networks such as UNet++. The dual-path structure in this study can effectively integrate deep and shallow features without relying excessively on dense connections, achieving a balanced processing of image details and overall semantic information. Experiments have shown that CPF-UNet not only slightly surpasses the segmentation accuracy of UNet++, but also significantly reduces the number of model parameters, thereby improving inference efficiency. We conducted a detailed comparative analysis, evaluating the performance of CPF-UNet against existing UNet++ and other corresponding methods on the same benchmark. The results indicate that CPF-UNet achieves a more ideal balance between accuracy and parameter quantity, two key performance indicators.

Keywords:

semantic segmentation; encoders; decoders; dual-path structures

1. Introduction

In the field of autonomous and assisted driving, the utilization of panoramic surround view technology and Bird Eye View (BEV) perception capabilities is essential for ensuring safe vehicle operation in complex environments. This technology is particularly significant in underground parking lots with intricate structures and unstable lighting conditions. The panoramic surround view system captures continuous images of the surrounding environment through a multi-camera system deployed around the vehicle. These images are stitches them into a 360 degree panoramic view using algorithms such as distortion correction and inverse perspective transformation, providing a field of view coverage with no blind spots. This enables the vehicle to effectively monitor and identify both dynamic and static objects around it, such as pedestrians, other vehicles, road signs [1], and parking spaces [2]. Consequently, the vehicle’s peripheral perception capabilities are significantly enhanced, leading to improved navigation accuracy and safety.

With its excellent ability to understand and analyze images in detail, deep learning technology has shown broad application prospects in many disciplines, effectively promoting research progress in various fields [3,4,5,6,7]. Although semantic segmentation technology has demonstrated its effectiveness in panoramic surround-view applications [8,9,10], it still faces many challenges. First, the lighting conditions in underground parking lots are usually complex, ranging from very dim to suddenly bright areas, posing extremely high demands on the dynamic range and adaptability of the visual processing system. Second, the spatial constraints and high scene complexity in the parking lot require the algorithm to accurately segment and recognize tightly arranged vehicles and irregular obstacles. Additionally, the special layout and reflective surfaces in underground parking lots can lead to image distortion, which further increases the difficulty of semantic segmentation.

1.1. Motivation

In recent years, deep learning techniques, especially Convolutional Neural Networks (CNNs) and Transformer architectures, have made significant progress in the field of semantic segmentation. Among them, UNet [11] and its improved versions are widely popular due to their excellent segmentation performance. For instance, UNet++ [12] and Unet3+ [13] retain the advantages of the original UNet’s encoder–decoder structure while further enhancing segmentation accuracy by introducing more contextual information and multi-scale feature fusion mechanisms. Nevertheless, issues such as high computational complexity still persist. With the advancements of Transformers in the field of computer vision, more Transformer models have been applied to the UNet model. Recent advancements include models like TransUNet [14], Swin-UNet [15], and Swin-UNet++ [16], which integrate Transformer architectures with UNet structures to enhance segmentation performance. Although Swin-UNet++ has achieved improvements compared to Swin-UNet, its overall architectural complexity and computational complexity are still high, which may limit its deployment in practical applications. Thus, that UNet still plays an important role in various fields, and highlights the necessity for further improvements to the UNet network itself.

To address the above issues, we propose a dual-path UNet architecture. First, to solve the problem of contextual information fusion in UNet, we fuse the third and last layers of UNet, integrating low-level and high-level information. In this context, we introduce the CAM (Channel Attention Module) and PAM (Position Attention Module) [17] attention modules, which significantly improve the segmentation results by modeling the rich contextual dependencies of local features. Subsequently, we fuse the third and last layers of UNet through FFM (Feature Fusion Module) [18], which utilizes the contextual information from the semantic branch to guide the feature responses of the detail branch. By providing guidance at different scales, we can capture feature representations at different scales, inherently encoding multi-scale information. This guidance method facilitates effective communication between the two branches.

1.2. Contribution

This paper’s important contributions are summarized below:

We propose the CPF-UNet network, which integrates the dual-path structure into the UNet network, achieving a better balance between accuracy and the number of parameters.
To overcome the problem of blurring and jagged edges in BEV images, we propose a cross-layer connection-based feature-fusion method that connects the third and last layers of the decoder in the UNet network, thereby enhancing the fusion of deep and shallow features.
We have explored methods to replace dense skip connections, aiming to achieve a better balance between network accuracy and real-time performance.

1.3. Related Work

In the field of semantic segmentation, dual-path structures and encoder–decoder architectures represent two major trends in network design. Dual-path structures achieve the goal of maintaining segmentation accuracy while improving computational efficiency by processing feature information at different resolutions in parallel. For instance, the BiSeNet [19] proposed by Yu Changqian et al. utilizes a dual-path structure to extract detailed features at high resolution and semantic features at low resolution, balancing high-speed inference with high-precision segmentation.

Encoder–decoder architectures are another effective approach for semantic segmentation network design. The Fully Convolutional Network (FCN) [20] proposed by Long et al. marked a milestone in this architecture, paving the way for the application of deep learning in semantic segmentation. Subsequently, Ronneberger O et al. introduced UNet [11], which features a symmetric encoder-decoder U-shaped structure and significantly improves segmentation accuracy through skip connections that effectively integrate features from different levels. To further enhance the performance of UNet, Zhou et al. proposed UNet++ [12], featuring denser skip connections that enable deep fusion of spatial information across scales, surpassing the original UNet. With the development of Transformers in computer vision, more Transformer-based models have been integrated into UNet. Chen J et al. presented TransUNet [14], a novel model combining Transformers with the UNet structure, tailored for medical image segmentation. This addresses the limitations of existing models in modeling long-range dependencies with local convolutions and in capturing low-level details with Transformer architectures. Cao H et al. introduced Swin–UNet [15], a new deep learning architecture based on Transformers for medical image segmentation tasks. This model, which is a pure Transformer architecture, adopts a UNet-like encoder-decoder design and utilizes skip connections to mitigate the loss of spatial information during downsampling. Liu P et al. proposed Swin–UNet++ [16], a model that combines the advantages of the Transformer architecture and UNet++, aiming to better capture long-range dependencies in images.

Ranjan et al. [7] proposed a hybrid neural network architecture combining CNN, LSTM networks, and Transpose CNNs for city-wide traffic congestion prediction. Their work demonstrates the effectiveness of integrating different neural network components to handle complex prediction tasks. This underscores the potential for innovative neural network architectures to significantly improve task-specific performance. However, as network structures become more complex, the number of model parameters and computational requirements also increase, limiting the deployment of these models in practical applications. To overcome this challenge, this study proposes a novel dual-path UNet structure, inspired by the theoretical achievements of dual-path structures [18,19,21,22,23,24,25,26] and attention mechanisms [27]. This structure combines the encoder-decoder backbone of UNet with the characteristics of dual-path structures, and introduces attention mechanisms and feature fusion modules to achieve the goal of maintaining high segmentation accuracy while reducing the number of model parameters. Experiments on the SUPS [28] dataset have validated the practicality and significant effects of this approach, demonstrating a slight improvement in segmentation accuracy while reducing the number of parameters by up to 23.27%. This achievement provides new ideas and methods for research in the field of semantic segmentation.

2. Methods

2.1. Attention Mechanisms

In the early days of deep learning, when networks such as Fully Convolutional Networks (FCN) were utilized for semantic segmentation, although no explicit attention mechanism was introduced, the integration of multi-scale feature fusion within the network structures had already implicitly initiated a focus on how to highlight the significant areas of an image during the segmentation process. In 2017, Hu et al. [29] introduced SENet (Squeeze-and-Excitation Networks), which was the first network to explicitly incorporate an attention mechanism in semantic segmentation. The SE module captures global context information from the entire feature map through global average pooling, followed by two fully connected layers that learn weights for each channel, thereby achieving adaptive recalibration of feature channel responses. CBAM (Convolutional Block Attention Module), proposed in 2018 by Woo et al. [30], integrates both channel attention and spatial attention. Specifically, spatial attention is formed through computing difference maps derived from average pooling and max pooling of the feature maps, creating a spatial attention map that directs the network to focus on particular spatial regions. CCNet [31] introduced Criss-Cross Attention, which propagates information along the diagonal directions of the feature map, effectively exploiting long-range dependencies. DANet [17] combines self-attention and spatial attention, enabling the model to capture global contextual information while refining the resolution of local spatial features. In this paper, we reference the PAM and CAM modules of DANet.

CAM focuses on meaningful information in the input image. This module reorganizes and transforms the input feature maps, calculates the interdependencies between various channels, and extracts attention scores that reflect the importance of features. These scores are then nonlinearly normalized using the softmax function, converting them into attention weight distributions with probabilistic properties. This process further guides the model to assign different levels of importance to different channels of the input feature maps. Next, CAM multiplies the resulting attention weights with the original feature maps channel by channel, achieving a weighted operation on the content of each channel in the feature maps. Finally, the weighted feature maps are integrated with the unmodified original feature maps, and the integration strength is regulated with a learnable parameter gamma. This ensures that the model can automatically learn the most appropriate attention distribution during the training process. This channel-attention-based mechanism enables the model to intelligently identify and amplify those feature channels that are decisive for the target task, resulting in significant performance improvements in various computer vision tasks such as image classification, object detection, and semantic segmentation.

The structure of the channel attention module is illustrated in Figure 1a. It directly computes the channel attention map,

X \in R^{C \times C}

, based on the original features,

A \in R^{C \times H \times W}

. In this process, we reshape

A

to

R^{C \times N}

, and then perform a matrix multiplication between

A

and its transpose. Finally, a softmax layer is applied to process the result, yielding the final channel attention map,

X \in R^{C \times C}

, as follows:

x_{i j} = \frac{e x p (A_{i} \cdot A_{j})}{\sum_{i = 1}^{C} e x p (A_{i} \cdot A_{j})}

(1)

where

x_{j i}

measures the

i^{t h}

channel’s impact on the

j^{t h}

channel. In addition, we perform a matrix multiplication between the transpose of

X

and

A

, and reshape their result to

R^{C \times H \times W}

. Then, we multiply the result by a scale parameter,

β

, and perform an element-wise sum operation with

A

to obtain the final output,

E \in R^{C \times H \times W}

[31]:

E_{K} = β \sum_{i = 1}^{C} (x_{j i} A_{i}) + A_{j}

(2)

where

β

gradually learns a weight from 0.

PAM is a position attention module that incorporates the core idea of the self-attention mechanism, and aims to effectively explore and model the complex dependencies between various positions in the input feature maps. The module first applies convolutional operations to extract spatial features from the input feature maps, generating three-dimensional feature maps representing Query, Key, and Value. By comparing the interaction information between Query features and Key features, PAM calculates an energy score matrix that represents the relative importance between positions. These scores are then normalized using the softmax function, converting them into continuous attention weights for each position, which intuitively reflect the relative attention between different positional features. Subsequently, the obtained positional attention weights are multiplied by the corresponding Value features through a dot-product operation, generating position-weighted feature representations. This process enhances the signals at critical positions in the original feature maps while weakening less important positional information. Finally, to achieve optimal feature fusion, the position-weighted features are combined with the original feature maps, with the degree of fusion controlled by a learnable parameter. This allows the model to adaptively adjust its attention to different positional information during training. Overall, the position attention module greatly enhances the model’s spatial understanding and reasoning abilities in solving computer vision tasks by dynamically capturing and emphasizing the relationships and structural information of key positions in the input feature maps, especially for tasks that highly rely on global context and local details.

As illustrated in Figure 1b, when given a local feature,

A \in R^{C \times H \times W}

, we first feed it into a convolution layers to generate two new feature maps,

B

and

C

, respectively, where

{B, C} \in R^{C \times H \times W}

. Then, we reshape them to

R^{C \times N}

, where

N = H \times W

is the number of pixels. After that, we perform a matrix multiplication between the transpose of Cand

B

, and apply a softmax layer to calculate the spatial attention map

S \in R^{N \times N}

[31]:

S_{j i} = \frac{e x p (B_{i} \cdot C_{j})}{\sum_{i = 1}^{N} e x p (B_{i} \cdot C_{j})}

(3)

where

S_{j i}

measures the

i^{t h}

position’s impact on

j^{t h}

position. The more similar feature representations of the two position contributes to greater correlation between them. Meanwhile, we feed feature

A

into a convolution layer to generate a new feature map,

D \in R^{C \times H \times W}

, and reshape it to

R^{C \times N}

. Then, we perform a matrix multiplication between

D

and the transpose of

S

and reshape the result to

R^{C \times H \times W}

. Finally, we multiply it by a scale parameter,

α

, and perform an element-wise sum operation with the features,

A

, to obtain the final output,

E \in R^{C \times H \times W}

, as follows:

E_{M} = α \sum_{i = 1}^{N} (s_{j i} D_{i}) + A_{j}

(4)

where

α

is initialized as 0 and gradually learns to assign more weight.

2.2. FFM

The FFM module is a neural network module for fusing different feature maps. The primary function of this module is to integrate feature maps from various sources or layers, fully leveraging the information of various features and improve the performance of the model. The FFM module receives feature maps from two paths: the Detail Branch, and the Semantics Branch. In these paths, DetailBranch_1 and SemanticBranch_2 focus on extract and process features at the original scale, while The DetailBranch_2 and SemanticBranch_1 extract features from smaller feature maps. This multi-scale processing helps the model attend to different levels of feature information at the same time. Detail_x and Semantic_x are processed through four separate convolutional sequences to generate four different feature branches. Subsequently, feature fusion is carried out using matrix multiplication to combine DetailBranch_1 and SemanticBranch_2, as well as DetailBranch_2 and SemanticBranch_1. This fusion approach facilitates the combination of features at the channel level to capture the interaction information between feature maps. The specific structure is shown in Figure 2.

2.3. Network Architecture

In this study, we propose a novel network architecture, as illustrated in Figure 3, which aims to improve and optimize the fusion process of deep and shallow features. Specifically targeting the decoder part of the original UNet framework, we designed two parallel branches at its third layer: one is the semantic feature branch, which is primarily responsible for extracting high-level semantic features from images with deep layers and low channel dimensions, and the other is the detail feature branch, which focuses on capturing local detail information from images with wide channels and shallow layers. After undergoing a series of deep convolutions, the semantic feature branch is further split into two streams, which are subsequently fed into the CAM and PAM for refinement and enhancement of the attention mechanism. The obtained

E_{K}

and

E_{M}

are concatenated along the channel dimension to form

\tilde{E} = c o n c a t (E_{K}, E_{M})

. Afterwards, a

1 \times 1

convolution is applied to

\tilde{E}

to adjust the number of channels to match that of the Detail Branch.

2.3.1. Main Branch

The resulting feature maps are designated as the Semantic Branch, while the third layer of the decoder is used as the Detail Branch, both being directed into the FFM for the integration of features. Inside the FFM, a thorough interaction and fusion take place between the deep and refined shallow features. Upon entering the FFM, the Semantic Branch is split into two pathways, referred to as SemanticBranch_1 and SemanticBranch_2. Likewise, the Detail Branch is divided into DetailBranch_1 and DetailBranch_2. Depthwise convolutions are individually applied to DetailBranch_1 and SemanticBranch_1, with their respective input feature maps denoted as

X_{D e t a i l} \in R^{H \times W \times C}

and

X_{S e m a n t i c} \in R^{H / 4 \times W / 4 \times C}

here, focusing on

X_{S e m a n t i c}

as an example:

X_{D W}^{(c)} [m, n] = \sum_{i = 1}^{K} \sum_{j = 1}^{K} W_{i j}^{(c)} \cdot X_{S e m a n t i c}^{(c)} [m + i, n + j]

(5)

here,

X_{D W}^{(c)} [m, n]

represents the value of the

c^{t h}

channel at position

(m, n)

in the output feature map,

W_{i j}^{(c)}

denotes the element within the depthwise convolution kernel corresponding to channel

c

, and

X_{S e m a n t i c}^{(c)} [m + i, n + j]

signifies the value of channel

c

in the input feature map at the relative position

(i, j)

. Subsequently, the feature maps resulting from the depthwise convolution undergo pointwise convolution to adjust the number of channels:

P_{j} [m, n] = \sum_{i = 1}^{i n c h a n n e l s} {w^{'}}_{i j} X_{D W} [m, n, i]

(6)

wherein,

{w^{'}}_{i j}

represents the weight parameter for the pointwise convolution, and

P_{j}

denotes the

j^{t h}

channel of the output feature map. After obtaining

P_{S}

and

P_{D}

through depthwise separable convolution from the Semantic Branch and Detail Branch, respectively, both primary branches undergo batch normalization. Focusing on the Semantic Branch as an example:

X_{B N_S} = γ \frac{P_{S} - μ}{\sqrt{σ^{2} + ϵ}} + β

(7)

a convolution with a

1 \times 1

kernel:

X_{C o n v_s e m a n t i c} = \sum_{i = 1}^{1} \sum_{j = 1}^{1} W_{i j} \cdot X_{B N_S} [i, j]

(8)

following these operations, we obtain

X_{C o n v_S e m a n t i c}

and

X_{C o n v_D e t a i l}

, respectively. Subsequently,

X_{C o n v_s e m a n t i c}

is passed through a sigmoid activation function:

X_{S i g m o i d_S e m a n t i c s} = σ (X_{C o n v_s e m a n t i c})

(9)

where

σ

represents the sigmoid activation function,

μ

denotes the mean of the batch,

σ^{2}

signifies the variance,

γ

is the scaling factor, and

β

represents the offset parameter.

2.3.2. The Additional Semantic Branch

Considering the input feature map as

X_{S e m a n t i c} \in R^{H / 4 \times W / 4 \times C}

, a convolution operation is performed using a kernel size of 3:

X_{S e m a n t i c 2}^{(c)} [m, n] = \sum_{i = 0}^{2} \sum_{i = 0}^{2} \sum_{k = 0}^{C_{i n} - 1} W_{c, k, i, j} \cdot X_{S e m a n t i c}^{(k)} [m + i, n + j] + B_{c}

(10)

here,

W_{c, k, i, j}

represents the weight of the

3 \times 3

convolution kernel corresponding to the

c^{t h}

output channel and the

k^{t h}

input channel, while

B_{c}

denotes the bias term for the

c^{t h}

output channel.

Batch Normalization:

X_{B N_S e m a n t i c 2} = γ \frac{X_{S e m a n t i c 2} - μ}{\sqrt{σ^{2} + ϵ}} + β

(11)

Upsampling:

X_{4 \times U p s a m p l e_S e m a n t i c 2} = U p s a m p l e (X_{B N_S e m a n t i c 2})

(12)

subsequently,

X_{4 \times U p s a m p l e_S e m a n t i c 2}

is passed through a sigmoid activation function:

X_{S i g m o i d_S e m a n t i c 2} = σ (X_{4 \times U p s a m p l e_S e m a n t i c 2})

(13)

2.3.3. The Detail Additional Branch

The input feature map is denoted as

X_{D e t a i l} \in R^{H \times W \times C}

, and a convolution operation is performed with a kernel size of 3, stride of 2, and padding of 1:

X_{D e t a i l 2}^{(c)} [m, n] = \sum_{i = 0}^{2} \sum_{i = 0}^{2} \sum_{k = 0}^{C_{i n} - 1} W_{c, k, i, j} \cdot X_{D e t a i l}^{(k)} [(m * 2 - 1) + i, (n * 2 - 1) + j] + b_{c}

(14)

Batch Normalization:

X_{B N_D e t a i l 2} = γ \frac{X_{D e t a i l 2} - μ}{\sqrt{σ^{2} + ϵ}} + β

(15)

Average Pooling:

X_{A P_D e t a i l 2}^{(c)} [m, n] = \frac{1}{9} \sum_{i = 0}^{2} \sum_{i = 0}^{2} X_{B N_D e t a i l 2}^{(c)} [m * 2 + i, n * 2 + j]

(16)

2.3.4. Feature Fusion

Through the aforementioned process, we obtain the output results from all branches of both the Detail Branch and the Semantic Branch. These outputs are then combined through element-wise operations, such as pointwise multiplication. The output from the Detail Branch is given by:

X_{D e t a i l_O u t p u t} = X_{C o n v_D e t a i l} \times X_{S i g m o i d_S e m a n t i c 2}

(17)

the output from the Semantic Branch is given by:

X_{S e m a n t i c_O u t p u t} = X_{C o n v_S e m a n t i c} \times X_{A P_D e t a i l 2}

(18)

the Semantic Branch is then upsampled:

X_{4 \times U p s a m p l e_S e m a n t i c 3} = U p s a m p l e (X_{S e m a n t i c_O u t p u t})

(19)

this is fused with the Detail Branch output through element-wise addition:

X_{S u m} = X_{D e t a i l_O u t p u t} + X_{4 \times U p s a m p l e_S e m a n t i c 3}

(20)

finally, the fused

X_{S u m}

undergoes an additional convolution followed by Batch Normalization to yield the final output, which is then concatenated with the third layer of the decoding stage to proceed with further decoding operations.

3. Results

3.1. Dataset

The SUPS dataset is an open-source simulation dataset developed by Fudan University for automatic underground parking. It supports multi-task learning using multiple sensors and by aligning multiple semantic labels with continuous images based on timestamps. The dataset consists of 5255 images, which are divided into training, validation, and test sets in a 7:2:1 ratio. The dataset includes eight semantic categories such as drivable areas and parking lines, and all images have a resolution of

1024 \times 1024

.

3.2. Training Details

In this experiment, Mean Intersection over Union (MIoU) and Recall were selected as evaluation metrics. For the network training, the input images were of size

1024 \times 1024 \times 3

. Stochastic Gradient Descent (SGD) was used as the optimization algorithm, with parameters set to learning rate (lr) of 0.01, momentum of 0.9, and weight decay of 0.0001. The training proceeded for 120 epochs. The loss function employed was the cross-entropy loss. These experiments were carried out on a Windows system using the PyTorch framework, with Python version 3.9 and CUDA version 11.7. The computational resource leveraged was an A100 40G GPU for computations.

3.3. Evaluation

To evaluate the performance of our model and other models mentioned in the literature, we use MIoU and Recall as evaluation metrics [32].

3.3.1. MIoU

MIoU is calculated by summing and averaging the ratio of the intersection and union of the predicted result and the true value for each category.

The calculation formula is as follows:

M I o U = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{p i i}{\sum_{j = 0}^{k} p_{i j} + \sum_{j = 0}^{k} p_{j i} - p_{i i}}

(21)

where

i

represents the true value,

j

represents the predicted value, and

p_{i j}

indicates that

i

is predicted as

j .

Equivalent to:

M I o U = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{T P}{F N + F P + T P}

(22)

where TP stands for True Positives (the number of samples that the model correctly predicts as positive), FP stands for False Positives (the number of samples that the model incorrectly predicts as positive), and FN stands for False Negatives (the number of samples that the model incorrectly predicts as negative).

3.3.2. Recall

Recall represents the proportion of actual positive instances that are correctly predicted as positive by the model. It measures the ability of the model to find all positive instances. The calculation formula is:

R e c a l l = \frac{T P}{T P + F N}

(23)

3.4. Segmentation Result

The experimental data presented in Table 1 and Table 2 strongly demonstrate that the proposed CPF-UNet model achieves excellent accuracy in image segmentation tasks. Its Mean Intersection over Union (MIoU) metric reaches 89.95%, representing a significant improvement compared to previous methods. Additionally, the Recall metric increases to 93.33%, demonstrating the outstanding performance of the model in segmenting the entire target region. Notably, despite the significant improvement of CPF-UNet over UNet, it still maintains a relatively low level of parameters. Compared to the UNet++ architecture, the method presented in this paper achieves a substantial performance enhancement with only a limited additional parameter cost (from 32.089 M to 36.202 M), as shown in Table 3. This comparison highlights the efficiency of CPF-UNet in balancing segmentation accuracy and model complexity.

As shown in Figure 4, the intuitive Visualization comparison further validates the significant improvement in the segmentation effect of CPF-UNet compared to the traditional UNet model, while also highlighting its subtle performance advantages in certain aspects compared to the advanced model UNet++.

An analysis of the experimental results demonstrates that the proposed CPF-UNet model achieves significant precision advantages over the baseline UNet model in semantic segmentation tasks. This is evident in the network’s capability to handle anti-aliasing at the edges of fisheye images, as illustrated in Figure 4 (with red arrows indicating the critical areas), where, when dealing with the distant parts of the image, CPF-UNet manages to segment out lane lines that are indiscernible with the UNet. This demonstrates that CPF-UNet possesses an improved contextual processing capability in comparison to UNet. Specifically, CPF-UNet has achieved a remarkable increase in the core metric of Mean Intersection over Union (MIoU) for evaluating segmentation performance, outperforming UNet by 5.65 % points. Simultaneously, the Recall rate has also increased by 2.97% points, reflecting the remarkable improvement in the model’s ability to completely capture the target region. In comparison to the sophisticated UNet++ model, CPF-UNet achieves slightly higher segmentation accuracy, as evidenced by marginal improvements in measures like Mean Intersection over Union (MIoU) and Recall, while also providing expedited inference times. This suggests that, in enhancing segmentation precision, CPF-UNet has effectively gained an edge in computational resource utilization efficiency, thereby offering a more competitive solution for segmentation tasks that necessitate real-time feedback or operate within constrained resource environments. In summary, CPF-UNet demonstrates superior performance in balancing both crucial dimensions of segmentation accuracy and computational efficiency.

4. Ablation Experiment and Discussion

In this paper, we have extended the classic UNet architecture by integrating three innovative components: the Channel Attention Module (CAM), Position Attention Module (PAM), and Feature Fusion Module (FFM). To systematically investigate the performance gains brought by these three modules individually and in combination, we have conducted a series of ablation experiments. Firstly, we removed the FFM module while retaining CAM and PAM to explore the contribution of CAM and PAM in the process of global feature fusion. Secondly, we eliminated the CAM and PAM modules while only retaining FFM, aiming to independently validate the role of the FFM feature fusion mechanism in the overall network performance. The experiments showed that both groups achieved good results, with only a 0.7% compared to the network with all three modules added simultaneously. Since both the first two ablation experiments involved feature fusion operations, we speculated that the connection between the third and the last layers would bring significant benefits. Therefore, we focused on the third ablation experiment, completely discarding the three aforementioned modules and adopting a simplified feature fusion strategy, as follows: we upsampled the last layer features of the original UNet by four times, while compressing the channel dimensions of the lowest-level features through a 1 × 1 convolution kernel to match the dimensions of the third-layer features. After concatenation, the fused features were concatenated again with the corresponding third-layer features in the decoder. The experimental results showed that even without the CAM, PAM, and FFM modules, by skillfully combining high-level semantic information with low-level detail information, we could still achieve similar segmentation accuracy as UNet++. This strongly confirmed our hypothesis that the third and the lowest-level features in the decoder play a crucial role in the process of information transmission and fusion. As seen from the results, the individual CAM, PAM, and FFM modules did not bring significant benefits to the network. We believe that without FFM, the attention-weighted features extracted with CAM and PAM may not be effectively fused and utilized, while without CAM and PAM, FFM can only fuse the original features and cannot perform targeted feature selection based on task requirements. Therefore, only when the three are used jointly can a powerful attention mechanism and feature fusion capability be formed, improving the model’s performance and surpassing the accuracy of UNet++. The specific parameters are shown in Table 4 and Table 5.

Among them, UNet-f represents adding only the FFM module, UNet-cp represents adding the CAM and PAM modules, and UNet-cat represents connecting only the third layer and the last layer.

5. Conclusions

Inspired by the bilateral path mechanism adopted by BiSeNet, this study innovatively designed a dual-stream architecture to collaboratively enhance the basic performance of UNet. This architecture successfully mitigates the loss of spatial detail information caused by downsampling operations in deep networks by effectively integrating deep and shallow feature maps. In this study, we utilized the representative SUPS dataset to extensively train and compare multiple semantic segmentation models, including the newly proposed CPF-UNet model, as well as the UNet++ and baseline UNet models as control groups. The experimental results show that, compared to the original UNet, the proposed CPF-UNet achieves a significant improvement in the key metric of Mean Intersection over Union (MIoU), from 84.3 to 89.95, despite only a small increase in the parameter burden (from 32.089 M to 36.202 M). At the same time, compared to UNet++, CPF-UNet not only demonstrates a slightly higher segmentation accuracy, but also significantly reduces the total number of parameters required (the parameter size of UNet++ is 47.193 M), thereby significantly improving the model’s inference efficiency while maintaining high accuracy. Overall, this dual-path architecture design concept possesses a certain degree of universality, making it especially suitable for network structures that may have deficiencies in capturing contextual information, and provides an effective solution for improving the performance of such models.

In our experimental exploration, we discovered that the deep fusion of the third-layer feature map and the last-layer features of UNet plays a pivotal role in integrating shallow detail information with deep semantic information. Notably, by merely concatenating these two specific layers, we were able to achieve performance comparable to UNet++, while significantly reducing the number of model parameters. This innovative insight not only contributes to enhancing the model’s operational efficiency, but also provides strong support for optimizing the model’s inference speed. More broadly speaking, this strategy is also applicable to other densely connected neural network architectures. By leveraging this idea, we can further explore the potential key roles played by different layers in the network and conduct targeted network pruning operations, aiming to further accelerate inference speed and optimize overall performance.

Author Contributions

Dataset search and selection, F.Q.; Network design, F.Q.; Project administration, F.Q.; Paper review, F.Q.; Writing original draft, Q.S.; Visualization, Q.S.; Network construction, Q.S.; Data analysis and interpretation, Q.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Jilin Provincial Natural Science Foundation, grant number YDZJ202101ZYTS050.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflict or interest.

References

Peng, L.; Chen, Z.; Fu, Z.; Liang, P.; Cheng, E. BEVSegFormer: Bird’s Eye View Semantic Segmentation From Arbitrary Camera Rigs. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 5924–5932. [Google Scholar]
Lai, C.; Yang, Q.; Guo, Y.; Bai, F.; Sun, H. Semantic Segmentation of Panoramic Images for Real-Time Parking Slot Detection. Remote Sens. 2022, 14, 3874. [Google Scholar] [CrossRef]
Papadeas, I.; Tsochatzidis, L.; Amanatiadis, A.; Pratikakis, I. Real-Time Semantic Image Segmentation with Deep Learning for Autonomous Driving: A Survey. Appl. Sci. 2021, 11, 8802. [Google Scholar] [CrossRef]
Khan, M.Z.; Gajendran, M.K.; Lee, Y.; Khan, M.A. Deep Neural Architectures for Medical Image Semantic Segmentation: Review. IEEE Access 2021, 9, 83002–83024. [Google Scholar] [CrossRef]
Anilkumar, P.; Venugopal, P. Research Contribution and Comprehensive Review towards the Semantic Segmentation of Aerial Images Using Deep Learning Techniques. Secur. Commun. Netw. 2022, 2022, 1–31. [Google Scholar] [CrossRef]
Asgari Taghanaki, S.; Abhishek, K.; Cohen, J.P.; Cohen-Adad, J.; Hamarneh, G. Deep Semantic Segmentation of Natural and Medical Images: A Review. Artif. Intell. Rev. 2021, 54, 137–178. [Google Scholar] [CrossRef]
Ranjan, N.; Bhandari, S.; Zhao, H.P.; Kim, H.; Khan, P. City-Wide Traffic Congestion Prediction Based on CNN, LSTM and Transpose CNN. IEEE Access 2020, 8, 81606–81620. [Google Scholar] [CrossRef]
Ng, M.H.; Radia, K.; Chen, J.; Wang, D.; Gog, I.; Gonzalez, J.E. BEV-Seg: Bird’s Eye View Semantic Segmentation Using Geometry and Semantic Point Cloud. arXiv 2020, arXiv:2006.11436. [Google Scholar]
Li, K.; Wu, X.; Zhang, W.; Yu, W. Bird’s-Eye View Semantic Segmentation for Autonomous Driving through the Large Kernel Attention Encoder and Bilinear-Attention Transform Module. World Electr. Veh. J. 2023, 14, 239. [Google Scholar] [CrossRef]
Liang, T.; Pan, W.; Bao, H.; Fan, X.; Li, H. Bird’s Eye View Semantic Segmentation Based on Improved Transformer for Automatic Annotation. KSII Trans. Internet Inf. Syst. 2023, 17, 1996–2015. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer International Publishing: Cham, Switzerland, 2015; Volume 9351, pp. 234–241. [Google Scholar]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018; Springer: Cham, Switzerland, 2018; Volume 11045, pp. 3–11. [Google Scholar]
Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Chen, Y.-W.; Wu, J. UNET 3+: A full-scale connected unet for medical image segmentation. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 1055–1059. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation. In Proceedings of the Computer Vision–ECCV 2022 Workshops, Tel Aviv, Israel, 23–27 October 2022; pp. 205–218. [Google Scholar]
Liu, P.; Song, Y.; Chai, M.; Han, Z.; Zhang, Y. Swin-unet++: A Nested Swin Transformer Architecture for Location Identification and Morphology Segmentation of Dimples on 2.25cr1mo0.25v Fractured Surface. Materials 2021, 14, 7504. [Google Scholar] [CrossRef] [PubMed]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
Yu, C.; Gao, C.; Wang, J.; Yu, G.; Shen, C.; Sang, N. BiSeNet V2: Bilateral Network with Guided Aggregation for Real-Time Semantic Segmentation. Int. J. Comput. Vis. 2020, 129, 3051–3068. [Google Scholar] [CrossRef]
Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. BiSeNet: Bilateral Segmentation Network for Real-Time Semantic Segmentation. In Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018; Volume 11217, pp. 325–341. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 640–651. [Google Scholar]
Fan, M.; Lai, S.; Huang, J.; Wei, X.; Chai, Z.; Luo, J.; Wei, X. Rethinking BiSeNet For Real-Time Semantic Segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 9711–9720. [Google Scholar]
Tsai, T.H.; Tseng, Y.W. BiSeNet V3: Bilateral Segmentation Network with Coordinate Attention for Real-Time Semantic Segmentation. Neurocomputing 2023, 532, 33–42. [Google Scholar] [CrossRef]
Li, H.; Xiong, P.; Fan, H.; Sun, J. DFANet: Deep Feature Aggregation for Real-Time Semantic Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2020; pp. 9514–9523. [Google Scholar]
Wang, Y.; Zhou, Q.; Liu, J.; Xiong, J.; Gao, G.; Wu, X.; Latecki, L.J. LEDNet: A Lightweight Encoder-Decoder Network for Real-Time Semantic Segmentation. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 1860–1864. [Google Scholar]
Fenglei, R.; Lu, Y.; Haibo, Z.; Shiyv, Z.; Xin, H.; Wenxue, X. Real-Time Semantic Segmentation Based on Improved BiSeNet. Opt. Precis. Eng. 2023, 31, 1217–1227. [Google Scholar]
Xu, Q.; Ma, Y.; Wu, J.; Long, C. Faster BiSeNet: A Faster Bilateral Segmentation Network for Real-Time Semantic Segmentation. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Virtual, 18–22 July 2021; pp. 1–8. [Google Scholar]
Guo, M.-H.; Xu, T.-X.; Liu, J.-J.; Liu, Z.-N.; Jiang, P.-T.; Mu, T.-J.; Zhang, S.-H.; Martin, R.R.; Cheng, M.-M.; Hu, S.-M. Attention Mechanisms in Computer Vision: A Survey. Comput. Vis. Media 2021, 8, 331–368. [Google Scholar] [CrossRef]
Hou, J.; Chen, Q.; Cheng, Y.; Chen, G.; Xue, X.; Zeng, T.; Pu, J. SUPS: A Simulated Underground Parking Scenario Dataset for Autonomous Driving. In Proceedings of the 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), Macau, China, 8–12 October 2022; pp. 2265–2271. [Google Scholar]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. CCNet: Criss-Cross Attention for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 45, 6896–6908. [Google Scholar] [CrossRef] [PubMed]
Ying, Y.; Chunping, W.; Qiang, F.; Renke, K.; Weiyi, W.; Tianyong, L. Survey of Evaluation Metrics and Methods for Semantic Segmentation. Comput. Eng. Appl. 2023, 59, 57–69. [Google Scholar]

Figure 2. Feature fusion module.

Figure 3. Overall network architecture diagram.

Figure 4. The first line shows the original BEV image, the second line indicates the ground truth after segmentation, the third line shows the inference result of UNet, and the last line shows the inference result of CPF-UNet. The difference between UNet and CPF-UNet is indicated by the red arrow in the figure.

Table 1. The MIoU of different models on the SUPS dataset.

Model	MIoU	Drivable Area	Wall	Static Vehicle	Parking Lines	Lane Lines	Collision Avoidance Strips	Speed Bumps	Arrows
UNet	84.3	98.02	97.37	97.81	64.73	74.28	93.93	64.14	84.16
UNet++	89.7	98.56	98.1	98.72	73.42	83.1	95.43	80.14	90.11
CPF-UNet	89.95	98.6	98.14	98.79	73.78	83.47	95.46	81.05	90.32

Table 2. The recall of different models on the SUPS dataset.

Model	Recall	Drivable Area	Wall	Static Vehicle	Parking Lines	Lane Lines	Collision Avoidance Strips	Speed Bumps	Arrows
UNet	90.36	99.3	98.08	98.46	76.89	83.0	95.8	81.18	90.17
UNet++	93.06	99.57	98.53	99.08	80.72	87.98	96.62	88.66	93.34
CPF-UNet	93.33	99.58	98.53	99.14	81.29	88.2	96.78	89.39	93.71

Table 3. The FLOPs and the parameters for various models.

Model	FLOPs (G)	Parameters (M)
UNet	51.017	32.089
UNet++	200.383	47.193
CPF-UNet	59.289	36.212

Table 4. The Mean Intersection over Union (MIoU) in ablation studies of various methods under the same number of training epochs (Epochs).

Model	MIou	Drivable Area	Wall	Static Vehicle	Parking Lines	Lane Lines	Collision Avoidance Strips	Speed Bumps	Arrows
UNet-f	89.21	98.54	98.1	98.73	72.15	82.71	95.32	78.58	89.52
UNet-cp	89.20	98.56	98.14	98.85	71.51	82.67	95.26	79.17	89.44
UNet-cat	89.19	98.53	98.08	98.72	72.09	82.61	95.28	78.75	89.45

Table 5. The recall in ablation experiments of different methods under the same number of training epochs (Epochs).

Model	Recall	Drivable Area	Wall	Static Vehicle	Parking Lines	Lane Lines	Collision Avoidance Strips	Speed Bumps	Arrows
UNet-f	93.11	99.53	98.51	99.12	80.81	87.72	96.73	89.21	93.29
UNet-cp	93.07	99.54	98.51	99.16	80.51	87.81	96.64	89.17	93.23
UNet-cat	93.05	99.53	98.50	99.13	80.63	87.84	96.71	88.86	93.23

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, Q.; Qu, F. CPF-UNet: A Dual-Path U-Net Structure for Semantic Segmentation of Panoramic Surround-View Images. Appl. Sci. 2024, 14, 5473. https://doi.org/10.3390/app14135473

AMA Style

Sun Q, Qu F. CPF-UNet: A Dual-Path U-Net Structure for Semantic Segmentation of Panoramic Surround-View Images. Applied Sciences. 2024; 14(13):5473. https://doi.org/10.3390/app14135473

Chicago/Turabian Style

Sun, Qiqing, and Feng Qu. 2024. "CPF-UNet: A Dual-Path U-Net Structure for Semantic Segmentation of Panoramic Surround-View Images" Applied Sciences 14, no. 13: 5473. https://doi.org/10.3390/app14135473

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CPF-UNet: A Dual-Path U-Net Structure for Semantic Segmentation of Panoramic Surround-View Images

Abstract

1. Introduction

1.1. Motivation

1.2. Contribution

1.3. Related Work

2. Methods

2.1. Attention Mechanisms

2.2. FFM

2.3. Network Architecture

2.3.1. Main Branch

2.3.2. The Additional Semantic Branch

2.3.3. The Detail Additional Branch

2.3.4. Feature Fusion

3. Results

3.1. Dataset

3.2. Training Details

3.3. Evaluation

3.3.1. MIoU

3.3.2. Recall

3.4. Segmentation Result

4. Ablation Experiment and Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI