Fire-RPG: An Urban Fire Detection Network Providing Warnings in Advance

Li, Xiangsheng; Liang, Yongquan

doi:10.3390/fire7070214

Open AccessArticle

Fire-RPG: An Urban Fire Detection Network Providing Warnings in Advance

by

Xiangsheng Li

and

Yongquan Liang

^*

College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China

^*

Author to whom correspondence should be addressed.

Fire 2024, 7(7), 214; https://doi.org/10.3390/fire7070214

Submission received: 23 April 2024 / Revised: 30 May 2024 / Accepted: 25 June 2024 / Published: 26 June 2024

(This article belongs to the Topic AI for Natural Disasters Detection, Prediction and Modeling)

Download

Browse Figures

Versions Notes

Abstract

:

Urban fires are characterized by concealed ignition points and rapid escalation, making the traditional methods of detecting early stage fire accidents inefficient. Thus, we focused on the features of early stage fire accidents, such as faint flames and thin smoke, and established a dataset. We found that these features are mostly medium-sized and small-sized objects. We proposed a model based on YOLOv8s, Fire-RPG. Firstly, we introduced an extra very small object detection layer to enhance the detection performance for early fire features. Next, we optimized the model structure with the bottleneck in GhostV2Net, which reduced the computational time and the parameters. The Wise-IoUv3 loss function was utilized to decrease the harmful effects of low-quality data in the dataset. Finally, we integrated the low-cost yet high-performance RepVGG block and the CBAM attention mechanism to enhance learning capabilities. The RepVGG block enhances the extraction ability of the backbone and neck structures, while CBAM focuses the attention of the model on specific size objects. Our experiments showed that Fire-RPG achieved an mAP of 81.3%, an improvement of 2.2%. In addition, Fire-RPG maintained high detection performance across various fire scenarios. Therefore, our model can provide timely warnings and accurate detection services.

Keywords:

urban fire; YOLOv8; early fire warning; small object

1. Introduction

Fires have been a significant issue for a long time, bringing economic loss and endangering the safety of persons at the scene and the firefighters engaged in the rescue efforts. According to the China National Fire Rescue Bureau [1,2,3,4,5], the data on fires have historically been collected based on four indicators: the total number of fires, the number of deaths in total, the number of fatalities in residential fires, and economic losses.

According to Figure 1a, the total number of domestic fires in 2022 grew by approximately 2.5 times compared to those in 2018. In terms of economic loss, the financial loss from fires in 2022 was nearly double that of 2018. These data show that with the development of the economy, potential fire hazards have significantly increased, leading to more destructive fire incidents and more significant property damage. In Figure 1b, it is evident that over half of all fatalities were caused by residential fires, proving that these fires pose significant danger to residents’ safety.

The causes of urban fire incidents [6] are relatively complex, including the improper use of fire and the disposal of flammable items. Moreover, the complex urban environment makes it challenging to detect fires in time and facilitates the rapid spread of fires. Therefore, urban fire accidents are characterized by concealed ignition points and rapid escalation.

Manual inspection is the earliest method of fire detection, involving the arrangement of personnel to inspect potential fire sites regularly. However, with the development of society, the limitations of manual inspection have become increasingly apparent, such as long response times and difficulties in the early stages of fires, making it unsuitable for current needs.

The sensor-detection method [7] requires sensors installed in advance and detects fire accidents through smoke, flames, or high temperatures. This method has many limitations: a single sensor can only detect the presence of a fire, not the severity of the fire or the specific ignition point, and it may even produce false alarms. In addition, fire sensors have a restricted detection range, requiring proximity to the fire to activate an alarm. As a result, sensors may be damaged by the fire, leading to higher maintenance costs for the fire-detection system.

The process of urbanization has also led to the widespread deployment of CCTV (closed-circuit television) monitoring systems throughout streets and inside buildings, making it possible to monitor fire hazards at a low cost. Fire-detection methods can be built into monitoring systems [8], utilizing artificial intelligence based on computer vision technology to detect fires.

The initial intelligent fire-detection methods [9] utilized image processing technology. These methods involve extracting features from images and using machine learning algorithms to produce a model capable of detecting fires. Features like color, edges, and texture can help to identify flame objects in an image. However, due to the complexity of real-world fire scenes, the models lack the generalization ability and cannot handle natural fire scenes.

Advancements in deep learning have led to significant progress in object-detection algorithms utilizing CNNs (convolutional neural networks). Compared to traditional detection algorithms, the CNN extracts more object features than conventional methods, thus significantly improving accuracy. CNN object-detection algorithms have two categories: two-stage and one-stage. The difference between them is the detection steps. Two-stage algorithms typically begin by generating region proposals through the RPN (region proposal network) before using the CNN to identify objects in the image and assign objects to specific categories, for example, Mask-RCNN [10] and Faster-RCNN [11]. However, one-stage algorithms directly detect the category and location of objects by using CNNs, such as SSD [12], RetinaNet [13], and YOLO [14]. Hence, the inference time of one-stage algorithms is shorter than that of two-stage algorithms.

Among the one-stage algorithms, the YOLO series of algorithms stands out for their simple structure, high accuracy, and real-time detection capabilities. Li [15] found that YOLOv3 has a higher average precision than other object-detection algorithms and possesses robust detection capabilities by comparing mainstream object-detection algorithms. The experiment carried out by Li demonstrated that YOLO-series networks perform excellently in fire-detection tasks.

Fire accidents typically progress from the initial ignition to fire escalation. Therefore, fire accidents usually go through a phase that is easy to control but difficult to detect. In order to detect this phase as early as possible, researchers typically choose to reduce the parameters and the required inference time of the model while adding high-performance modules to enhance detection capabilities.

For example, Wang [16] introduced an extra very small object detection layer into the YOLOv5 network and used ghost convolutions to reduce the computational cost caused by this layer. The network achieved improved detection capabilities for small objects, such as faint flames, with a relatively low computational cost. Lin [17] introduced a coordinate attention mechanism into the network to focus the attention of the model on forest fires and improve the learning ability for key objects. Zhao [18] found that the backbone structure of YOLOv3 Darknet-53 has the disadvantage of imbalances in depth, width, and resolution. They replaced Darknet-53 with an improved EfficientNet and proposed an improved model based on YOLOv3 with faster speed, fewer parameters, and greater detection capabilities. Zhang et al. [19] proposed a ship fire-detection network based on the YOLOv8n network. The team introduced an extra object-detection layer and used spatial and channel reconstruction convolutions to reduce redundant features. These modules can improve the detection capabilities of small objects. However, the harsh hardware conditions in the ship environment constrained the detection performance of the extra detection layer. Chen [20] utilized ghost shuffle convolutions in the YOLOv7 network deployed on UAV platforms. Ghost shuffle convolution can improve the network’s expression capabilities and decrease the parameters. Chen also replaced the CIoU loss function with the highly fitting SIoU loss function, accelerating the convergence speed and improving the detection capabilities of the model without additional cost.

Many researchers have proposed excellent detection algorithms. However, the characteristics of urban fire accidents, like concealed ignition points and rapid escalation, still make it difficult for these algorithms to provide timely warnings. Current fire detection networks do well in the detection of strong flames and dense smoke, but their detection performance for faint flames and thin smoke is not reliable. Faint flames and thin smoke are often considered early warning signals of fire accidents because they usually appear in the initial stages. These objects should be given significant attention, but they are usually small and difficult to detect, and the complex urban environment makes timely detection even more challenging.

Hence, detecting these objects earlier without a huge computational burden on the network is a major issue for urban fire detection systems. This study has established an urban fire dataset and proposed an urban fire detection network based on YOLOv8s [21] named Fire-RPG, where RPG stands for Reparameterization and GhostV2. This network focuses on the early warning signals of fire accidents and aims at high accuracy without much computational cost.

This study makes the following contributions:

RepVGG replaces complex structures with a simple equivalent structure in the inference phase, improving accuracy without too much cost.
The bottleneck of GhostV2Net can save significant hardware resources while maintaining accuracy.
The Wise-IoUv3 loss function can reduce the negative impact of low-quality data in the fire dataset.
CBAM can highlight the objects from channel and spatial dimensions and enable the model to focus more on early warning signals of fire accidents.

The remaining sections are organized as follows. The modules, mechanisms, and their functions are explained in Section 2. The dataset and experimental settings are described in Section 3. Then, Section 4 introduces the experimental results and the evaluation metrics. Finally, Section 5 concludes the paper.

2. Method

2.1. YOLOv8

YOLOv8 stands out as an excellent object-detection network. It is well suited for various computer vision tasks, including object detection, instance segmentation, and image classification.

As depicted in Figure 2, YOLOv8 has three parts: backbone, neck, and head. The backbone and neck include C2f, CBS, and SPPF modules. These modules excel at learning key features in images, offering strong feature-representation capabilities. The CBS module is a basic convolution module. The C2f module uses the CSP [22] (cross-stage partial) technique to enhance the learning ability of the model and make the model lighter. The SPPF (spatial pyramid pooling fast) module is a special SPP [23] (spatial pyramid pooling) module. It conducts pooling operations on features with different sizes and dimensions and combines these features into a fixed-size feature map. The parallel structure of the pooling layers in the SPP module is modified into the serial structure in the SPPF module. In addition, SPPF also uses pooling layers with the same kernel size to extract features, effectively reducing the computational cost.

The neck is built on the structures of the FPN (feature pyramid network) and PAN (pyramid attention network), which can generate feature maps of different scales containing both semantic and feature information. These two structures enhance the detection capabilities of the model for objects of varying sizes. The head consists of three object-detection layers that perform inference operations on the output from the neck, including locating objects and determining their categories.

2.2. Fire-RPG

YOLOv8 has three detection layers. When the input image size is 640 × 640, the three detection layers can identify large objects above 32 × 32, medium objects above 16 × 16, and small objects above 8 × 8. However, the distance between the camera and the ignition point may not be close in a real urban situation. This means that the flame and smoke objects whose size is below the smallest detection threshold of YOLOv8 cannot be detected immediately. The photographic distance also affects the detection performance of the model. The farther the distance, the less reliable the network’s prediction. Thus, the detection capability for small objects is crucial to provide timely warnings for fire accidents.

To enable the detection network to detect these potential fire hazards early, we integrated an extra detection layer for very small objects above 4 × 4 into the model. Due to the narrow channel widths, the extra detection layer decreases the parameters. However, the new detection layer enlarges the computational cost, affecting the real-time detection ability, which is the fundamental requirement of fire detection.

Five high-computational-cost C2f modules are replaced with GhostV2C2f modules. This measure significantly reduces the parameters and computational cost. In order to more efficiently extract feature information, the independent CBS modules were replaced with high-performance RepVGG blocks. Moreover, three CBAM blocks were deployed in the backbone. In these positions, CBAM can utilize the abundant spatial and channel information to highlight medium and small objects. The structure of Fire-RPG is illustrated in Figure 3.

2.3. RepVGG Block

In order to provide a reliable detection service, current object-detection algorithms typically use many complex structures in their models. Complex convolutional structures have stronger feature learning and representation capabilities compared to simple convolutional structures, so they have higher accuracy. However, complex structures also have significant drawbacks, such as reduced inference speed and hardware unfriendliness.

For fire-warning tasks, complex convolutional structures can improve the accuracy of fire warnings and reduce error rates. However, the reduced inference speed and strict deployment conditions decrease the usefulness of the overall model. The introduction of complex convolutional structures poses a dilemma for fire detection networks. A high-performance and low-cost module is needed to further improve detection performance.

Ding [24] proposed a convolutional structure called RepVGG. This convolutional structure uses the structural reparameterization technique to transform complex network structures into simple equivalent network structures. During the training phase, RepVGG uses complex structures to participate in training, enabling the model to acquire learning capabilities beyond those of simple structures. In the inference phase, the complex network structures are fused into simple equivalent network structures, reducing the model’s inference time.

The structure reparameterization process of RepVGG includes two fusions, as shown in Figure 4.

The initial state consists of three branches. Each branch has a BN (batch normalization) layer. However, two of the three branches have a 1 × 1 convolution layer and a 3 × 3 convolution layer, respectively. All convolution layers have no bias.

The first fusion is that a convolution layer and a BN layer on the same branch fuse into a 3 × 3 convolution layer with bias. Before the first fusion, a 3 × 3 convolution layer without bias is needed on each branch. Therefore, some adjustments are necessary on the second branch with a 1 × 1 convolution layer and the third branch without any convolution layer.

For the second branch, the 1 × 1 convolution layer can be converted into a 3 × 3 convolution layer by padding its kernel with 0. The third branch constructs a 3 × 3 convolution, with the values of its convolutional kernel shown in Figure 5.

The parameters of a BN layer include

μ

,

σ^{2}

,

β

, and

γ

. A feature map

x

is inputted into the BN layer to produce a new feature map

y

using the following computation. A small constant

ε

is introduced into the formula to prevent division by zero.

y = γ \cdot (\frac{x - μ}{\sqrt{σ^{2} + ε}}) + β

(1)

The convolution layer without bias only has kernel parameters

W

. The parameters of a BN layer can be converted into a set of weight parameters and bias parameters.

The weight parameters from the BN layer are multiplied with

W

to generate the fused convolution kernel parameters

W_{f u s i o n}

. The bias parameters from the BN layer can serve as the fused convolution bias parameters

B_{f u s i o n}

.

W_{f u s i o n} = \frac{γ}{σ} \cdot W

(2)

B_{f u s i o n} = β - \frac{μ γ}{σ}

(3)

Now, each branch has a 3 × 3 convolution layer with bias. The final result

y

is the sum of the results of these convolution layers. Thus, a feature map

x

needs to go through all branches. The calculation process is as follows:

y = W_{1} \otimes x + B_{1} + W_{2} \otimes x + B_{2} + W_{3} \otimes x + B_{3} = (W_{1} + W_{2} + W_{3}) \otimes x + (B_{1} + B_{2} + B_{3})

(4)

The second fusion is the process of fusing three branches into a 3 × 3 convolution layer with bias. This 3 × 3 convolution layer is the inference structure of RepVGG, with its kernel parameters

W_{i n f e r e n c e}

and bias parameters

B_{i n f e r e n c e}

as follows:

W_{i n f e r e n c e} = W_{1} + W_{2} + W_{3}

(5)

B_{i n f e r e n c e} = B_{1} + B_{2} + B_{3}

(6)

The RepVGG block is a high-performance convolutional module without a high computational cost. It satisfies the requirements for fire detection. Thus, we replaced the CBS modules with RepVGG blocks, as shown in Figure 6.

2.4. GhostV2 Bottleneck

The extra detection layer significantly improved the detection performance but also increased the parameters and inference time. Lightweight convolutional structures can reduce the parameters and computational cost, enhancing the deployment capabilities of the model. However, an issue that cannot be ignored is that lightweight convolutional structures may reduce the accuracy of model.

Tang [25] optimized the bottleneck structure of GhostNet and introduced the DFC (Decoupled Fully Connected) attention mechanism. They proposed the latest version of GhostNet, GhostV2Net. The bottleneck in GhostV2Net has the advantages of low computational overheads and low parameters and maintains high accuracy.

The GhostV2 bottleneck consists of two components: the DFC attention mechanism and the Ghost module, as shown in Figure 7.

The Ghost module produces feature maps through its cheap operations, as shown in Figure 8. A standard convolution generates only a small part of the feature maps, called intrinsic feature maps. Then, a depth-wise convolution uses the intrinsic feature maps to generate the rest of the feature maps, called redundant feature maps. The final output of the Ghost module is the concatenation of the intrinsic feature maps and the redundant feature maps.

Tang et al. found that fully connected layers can generate attention maps with a global receptive field. Compared to self-attention mechanisms, fully connected layers are easier to implement and have lower computational costs. They also discovered that decoupling the fully connected layers into horizontal and vertical directions can further reduce computational complexity while preserving the completeness of the spatial information. Therefore, the researchers implemented the DFC attention mechanism by using two decoupled, fully connected layers. The DFC attention mechanism has the advantages of a simple structure, low computational complexity, and hardware friendliness.

Although the cheap operations allow the Ghost module to significantly improve computational efficiency, these operations inevitably weaken the representation capabilities of the Ghost module due to the loss of spatial information.

Therefore, the DFC attention mechanism has been introduced to enhance the output features of the Ghost module. Unlike typical attention mechanisms, the DFC attention mechanism and the Ghost module use the same input and operate in parallel branches. To reduce the burden of parallel computation, the DFC attention mechanism employs downsampling in both vertical and horizontal directions, compressing the feature map to 25% of the original size.

The bottleneck structure of GhostV2Net uses the inverted bottleneck structure. The first Ghost module is used to expand features and increase the number of channels, while the second module reduces the number of channels. This inverted bottleneck naturally decouples the parameters and computational cost from representational capabilities. The structure of GhostV2C2f is shown in Figure 9.

While the C2f module can effectively extract features from images, it brings about problems of high computational cost and slow inference speed when the model depth is increased or its width is expanded.

We selected five high-cost C2f modules with large parameters. These high-cost C2f modules were replaced with Ghostv2 C2f modules, containing Ghostv2 bottlenecks in their structure.

To avoid the overfitting caused by the continuous use of attention mechanisms, two types of Ghostv2 bottlenecks were used in the Ghostv2 C2f modules.

As a result, the Ghostv2 C2f modules decreased the parameters and computational cost while maintaining high accuracy. The introduction of Ghostv2 bottlenecks significantly reduced the hardware resources required by the fire-detection model, providing ample space for further improvements.

2.5. Wise-IoUv3

Annotation work is generally carried out manually, so the annotation principles of different annotators can affect the distribution of data in the dataset. It is inevitable that annotators will include large amounts of irrelevant background information or miss some parts of the objects. This issue is particularly obvious in fire datasets because there are often blurred boundaries between objects and the background in real-world fire accident images. For example, the boundaries between thin smoke and the background are often indistinct. A single object may have different annotations, as shown in Figure 10.

Regardless of whether the data omit some object information or include too much irrelevant background information, these low-quality data can have a detrimental effect on the model.

Many existing studies on IoU loss functions focus on fitting capabilities, like EIoU and SIoU. However, IoU loss functions with high fitting capabilities may decrease the model’s accuracy rather than improve it when the dataset quality is unreliable. Therefore, fire-detection tasks require an IoU loss function that effectively reduces the harmful impact of low-quality data.

Wise-IoU [26] (WIoU) can reduce the impact of high-quality and low-quality anchor boxes, allowing the model to focus on normal-quality anchor boxes. WIoU has three versions. WIoUv1 uses distance attention and IoU loss function, while WIoUv2 and WIoUv3 are based on WIoUv1 but apply different kinds of outlier-degree-focusing mechanisms. The WIoUv3 loss function is shown as follows:

L_{W I o U v 3} = L_{W I o U v 1} \cdot r

(7)

The first factor of WIoUv3 is the WIoUv1 loss function

L_{W I o U v 1}

. It includes the IoU loss function

L_{I o U}

and the distance attention mechanism

R_{W I o U}

.

R_{W I o U}

can increase the IoU loss for normal-quality anchor boxes, while

L_{I o U}

can reduce the distance attention of high-quality anchor boxes.

The formulation of WIoUv1 is as follows:

L_{W I o U v 1} = L_{I o U} \cdot R_{W I o U}

(8)

L_{I o U} = 1 - I o U

(9)

R_{W I o U} = e x p (\frac{{(x - x_{g t})}^{2} + {(y - y_{g t})}^{2}}{W_{g}^{2} + H_{g}^{2}})

(10)

In the distance attention mechanism

R_{W I o U}

,

x

and

y

stand for the central coordinates of the anchor box, whereas

x_{g t}

and

y_{g t}

stand for the central coordinates of the ground truth bounding box.

W_{g}

and

H_{g}

represent the width and height of the smallest enclosing box encompassing both the anchor box and the ground truth bounding box.

W_{g}

and

H_{g}

are detached from the computational graph to prevent intervening in convergence.

The second factor of WIoUv3 is the non-monotonic gradient gain coefficient

r

. It is determined by the outlier degree

β

and hyperparameters

α

and

δ

. The formula for

r

is as follows:

r = \frac{β}{δ α^{β - δ}}

(11)

The outlier degree

β

is equal to the ratio of

L_{I o U}^{*}

to

{\bar{L}}_{I o U}

.

L_{I o U}^{*}

represents the IoU loss detached from the computational graph, while

{\bar{L}}_{I o U}

represents the exponential running average of IoU loss with momentum.

{\bar{L}}_{I o U}

updates after the new IoU loss of each batch is calculated. The formula for

β

is as follows:

β = \frac{L_{I o U}^{*}}{{\bar{L}}_{I o U}}

(12)

Low-quality anchor boxes have a high IoU loss. Therefore, a high outlier degree is assigned to low-quality anchor boxes, while high-quality anchor boxes obtain a low outlier degree. The gradient gain coefficient

r

can serve as a penalty term to adjust the loss function value for anchor boxes of different qualities.

Our dataset also had issues with low-quality data. Therefore, we decided to use the WIoUv3 loss function to replace the CIoU loss function. By setting the hyperparameters

α

and

δ

to 1.9 and 3, respectively, we achieved the final results, as shown in Figure 11.

2.6. CBAM

Attention mechanisms play an important role in enhancing the learning capabilities of models. They highlight key features in the original feature maps and ultimately generate attention maps. These attention maps are usually multiplied with the original feature maps as the weight factor of feature information.

In fire scenes, small objects can be enhanced through attention mechanisms. However, each attention mechanism works in a different manner. For instance, the SE [27] (Squeeze-and-Excitation) attention mechanism focuses on channel information, using fully connected layers to generate attention maps. However, SE attention does not consider spatial information, limiting the performance of the SE. The CA [28] (Coordinate Attention) mechanism takes both the channel information and the spatial information into account. However, its computational process is complex and requires permuting operations on the feature maps, which actually prolongs the inference time and significantly reduces the network detection speed at high resolutions.

CBAM (Convolutional Block Attention Module) [29] has a simple structure, low computational cost, and plug-and-play characteristics. It can be seamlessly and cost-effectively integrated into various CNN models. It captures the interdependencies between the feature information in fire images.

CBAM sequentially applies channel and spatial attention mechanisms. The channel attention focuses on “what” is meaningful, while the spatial attention mechanism focuses on “where” a meaningful part is located. These two attention mechanisms are complementary to each other. The structure of CBAM is shown in Figure 12.

The channel attention mechanism uses average pooling and max pooling layers and squeezes feature maps of dimensions

C \times H \times W

into a feature vector of

C \times 1 \times 1

. Each pooling layer produces its vector independently. A shared network extracts channel information from these vectors to generate the corresponding results. The channel attention maps are acquired after applying the sigmoid function on the sum of all pooling results. Similarly, spatial attention also squeezes the feature map with two pooling layers, generating two feature vectors of

1 \times H \times W

. Two vectors are concatenated and inputted into a convolution layer to produce the spatial attention maps. These maps also need to be processed by the sigmoid function. The attention mechanisms in CBAM are shown in Figure 13.

Although many powerful and novel attention mechanisms have been proposed, we found that CBAM is still effective. This attention mechanism can generate dual-dimensional attention maps at a low cost. Therefore, we integrated CBAMs into the model to enhance the learning capabilities for medium and small objects.

3. Dataset and Experimental Settings

We established a dataset of 7052 images. There were two sources of images in our dataset: fire videos and pictures from the Internet. The frames were extracted from the fire accident videos that recorded the entire process from fire accident occurrence to escalation. These fire video frames contained the features of early stage fire accidents and formed a major part of our dataset. In order to ensure the generalization ability of the model, we also randomly obtained images from the Internet by using a web crawler. Several images from the dataset are depicted in Figure 14.

Labeling software was utilized to annotate the categories of objects in the images: fire and smoke. The labels were saved in txt format. The dataset was split randomly into training, validation, and test sets at an 8:1:1 ratio, as shown in Table 1.

According to the severity of the fire accident in the image, fire objects were classified into faint flames and strong flames. Faint flames are usually small-sized objects, but in some images taken from a close distance, faint flames appear as medium-sized objects. A similar issue also occurs in the image of strong flames due to the photographic distance.

Smoke objects are classified according to the degree to which they blur the background. Smoke that completely obscures the background is classified as dense smoke, while smoke that allows some background information to be seen is classified as thin smoke. Due to the diffusion of smoke and other factors within the scene, smoke objects are usually medium-sized or large-sized.

We collected statistics on the dataset for the numbers of faint flames, strong flames, thin smoke, and dense smoke, as shown in Table 2.

The bounding box information in the dataset was visualized. The distributions of the center and sizes of the bounding boxes are displayed in Figure 15.

In Figure 15a, the x and y coordinates represent the center position of the bounding box relative to the bottom left corner of the image, where (0,0) represents the bottom left corner of the image and (0.5,0.5) represents the center of the image. Likewise, (1,1) represents the top right corner of the image. From this, it is evident that the centers of the bounding boxes tend to cluster in the middle of the image, which corresponds to the reality that photographers usually center their lenses on an object. In Figure 15b, the height and width indicate the size of the bounding box relative to the image. The distribution of the bounding box sizes is concentrated in the lower left corner of the coordinate system, while the rest is spread evenly throughout. This indicates that a large number of objects are small, and the rest of the objects are different in size. Overall, the objects in our dataset are more concentrated in the middle of the image, and most are small objects, which aligns with the natural situation of urban early stage fire accidents.

The experimental environment and hyperparameters are displayed in Table 3 and Table 4 to facilitate the replication of our experimental results.

4. Results and Evaluation

This section provides a detailed introduction to the evaluation metrics used and the experiments implemented in our study.

4.1. Evaluation Metrics

In this study, we utilize classical object-detection metrics, including precision (P), recall (R), mean average precision (mAP), floating-point operations (FLOPs), and parameters. The first three metrics were used to evaluate the detection performance of the model, while FLOPs and parameters assessed the time and space complexity, respectively.

Precision measures the accuracy of correctly predicted positive samples out of all positive samples. TP (true-positive) refers to the samples correctly predicted to be positive, while FP (false-positive) indicates the samples incorrectly predicted to be positive. The number of all positive samples is the sum of TP and FP. The formula for precision is as follows:

P = \frac{T P}{T P + F P}

(13)

Recall measures the accuracy of correctly predicted positive samples out of all true samples. FN (false-negative) are the samples incorrectly predicted as negative, so FNs are actually true samples. All true samples are the sum of TP and FN, and the formula for recall is as follows:

R = \frac{T P}{T P + F N}

(14)

AP (average precision) is a metric that measures the detection performance for a specific category. A PR curve (precision–recall curve) can be plotted by using precision and recall values across various confidence thresholds. AP is equal to the area underneath the PR curve. The formula for AP is as follows:

A P = \int_{0}^{1} P (R) D R

(15)

mAP is the mean value of the AP for all categories in the dataset. When the number of categories is N, the formula for mAP is as follows:

m A P = \sum_{i = 1}^{N} A P_{i}

(16)

In actual experiments, mAP50 and mAP50-95 are usually calculated to assess the comprehensive detection performance of the model. mAP50 is the mAP value with an IoU threshold of 50%, commonly used as mAP. The IoU threshold is utilized to judge the accuracy of the localization. If the IoU score of a predicted bounding box exceeds the IoU threshold, the localization of the bounding box could be seen as accurate localization.

mAP50-95 calculates the average of the mAP values at the IoU thresholds ranging from 50% to 95% with a stride of 5%, providing a broader assessment of the object-detection algorithm performance.

4.2. Comparison of Models

The hardware devices commonly used in urban environments do not reach the level of the laboratory. However, this does not mean that detection performance has to be sacrificed just for better deployment capabilities. Urban systems can select a model that balances parameters, computational cost, and detection performance well. With this in mind, this study adopted YOLOv8s as the baseline model, rather than a smaller scale of YOLOv8, as shown in Table 5 and Figure 16.

We conducted the comparative experiments between Fire-RPG and various mainstream models under the same experimental settings.

Besides YOLOv8s and Fire-RPG, the models involved in the comparative experiments included Faster R-CNN, SSD, RetinaNet, and YOLOv10s. Compared to the latest YOLO series algorithm, YOLOv10, YOLOv8s had higher parameters and computational costs. However, YOLOv8s also achieved higher accuracy. It is evident that Fire-RPG achieved the highest mAP value with relatively low parameters and FLOPs.

4.3. Comparison of IoU Loss Functions

This study used YOLOv8s to compare the influence between WIoUv3 and other loss functions. The change in loss functions does not affect the parameters or FLOPs, so these two metrics are not taken into account.

As shown in Table 6, IoU loss functions with high-fitting capabilities, like SIoU and EIoU, negatively influence detection performance. In contrast, simple but low-fitting IoU loss functions like GIoU and DIoU achieve high mAP scores. This proves that the influence of data quality cannot be ignored. WIoUv3 maintains high detection performance, and its mAP reaches 80.0%, proving that the focus mechanism based on the outlier degree can effectively reduce the harmful impact of low-quality examples.

4.4. Comparison of Attention Mechanisms

We also compared the detection performance between CBAM and other attention mechanisms. The attention mechanisms usually function across the channel or spatial dimensions. The same deployment location affects the performance of different attention mechanisms to different degrees, so we deployed different attention mechanisms at different locations. However, their deployment locations were still confined to the backbone in order to keep the conditions as consistent as possible.

As shown in Table 7, The results indicate that CBAM exhibits excellent enhancement in detection performance, with mAP50 of 80.0% and mAP50-95 of 48.0%. Although the parameter increase for CBAM is also the highest, it is acceptable compared to other non-attention mechanism modules.

4.5. Ablation Experiment

An ablation experiment was conducted to validate the necessity of each module introduced. The enhancement in Fire-RPG is segmented into five stages. The impact of each module is evaluated by comparing the metrics between the current stage and the preceding stages. The results are shown in Table 8.

In stage one, the CIoU loss function is replaced with WIoUv3. This improvement does not change the parameters or FLOPs. Recall and mAP50 increase by 0.5% and 0.9%, respectively, but precision and mAP50-95 decrease. In stage two, the extra object-detection layer is added, reducing the parameters but increasing the FLOPs. The detection performance metrics rise significantly, except for a 2.3% drop in precision. In stage three, the CBS modules are replaced with the RepVGG blocks, bringing a slight decrease in parameters and FLOPs. Precision rises by 3% but only increases by 0.1% compared to YOLOv8s. Stage four is the introduction of the GhostV2 bottleneck, greatly reducing the parameters and FLOPs. Although mAP50 decreases by 0.2%, mAP50-95 has a 0.4% increase. In stage five, the introduction of CBAM slightly increases the parameters and FLOPs. The mAP50 and mAP50-95 enhance by 0.6% and 0.5%, respectively.

4.6. Detection Results

The detection capabilities of the Fire-RPG model can be explicitly shown by comparing the detection results. The results are shown in Figure 17 and Figure 18.

For small and difficult-to-detect objects in images, such as faint flames and thin smoke, Fire-RPG exhibits strong accuracy. It captures extremely small flames and thin smoke sharply in images and assigns them a high confidence value. In contrast, YOLOv8s not only misses these objects but also provides lower confidence scores. This demonstrates that Fire-RPG has better detection capabilities for the early warning signals of fire accidents.

For large-scale and easily detectable objects, such as strong flames and dense smoke, the detection results for Fire-RPG and YOLOv8s are similar. Both models can accurately locate the objects in the images and effectively use bounding boxes to enclose them.

4.7. Detection Performance in Different Scenes

To demonstrate the generalization ability of the model, three extra datasets were used to display the performance of our model in different scenes. We compared the detection performance of YOLOv8s and Fire-RPG to illustrate the effectiveness of our model, as shown in Figure 19.

The D-Fire dataset focuses on open wilderness scenes, with data primarily sourced from long-term surveillance cameras monitoring the wilderness. This dataset annotates two objects: smoke and flame. To avoid excessive time expenditure, we removed the images without annotation information from this dataset. Following this, the dataset contained 11,689 images.

The ForestFire dataset focuses on forest scenes. To improve the quality of the dataset, the dataset creators used data-augmentation methods such as flipping, rotation, and HSV transformation. This dataset only annotated flame objects, and all images were resized to 640 × 640.

The DFS dataset contains fire images from various scenes, with all images collected from real scenarios and annotated in a standardized manner. Considering the potential for false-positives caused by objects similar to flames or smoke, the dataset creators annotated these objects as “other.” Thus, there are three labels in this dataset: flame, smoke, and other.

We collected the performance indicators under the three datasets and displayed the results in the Table 9.

The results show that Fire-RPG surpasses YOLOv8s in detection performance in the above datasets. Specifically, in terms of mAP50, Fire-RPG achieved an improvement of over 1% across all three datasets. In terms of mAP50-95, Fire-RPG achieved a 0.8% improvement on the DFS dataset. Hence, Fire-RPG demonstrates high generalization ability and can adapt to various scenarios such as forest fires and wilderness fires.

5. Conclusions

The initial stage of fire accidents includes objects such as faint flames and thin smoke, which are considered early warning signals of fire. Common CNN models fail to provide reliable early warnings due to their poor capacity to detect these signals. We established a fire dataset with numerous images of early stage fire accidents and proposed a fire detection network based on YOLOv8, named Fire-RPG.

In this model, we introduced an extra detection layer for smaller objects, enabling the model to detect smaller flame objects. Next, the lightweight CNN structure, GhostV2 bottleneck, was used to reduce the inference time and optimize the model structure. To address the issue of low-quality data, the Wise-IoUv3 loss function was used during the training phase. Then, we introduced the RepVGG block to improve the network’s learning capabilities. Finally, we employed the CBAM attention mechanism to highlight the key objects in feature maps, especially small- and medium-sized smoke fire objects.

In comparative experiments with mainstream CNN models, Fire-RPG achieved the best detection performance, with its mAP of 81.3%, and had low-level parameters and FLOPs. In addition, Fire-RPG showed significant improvement in other fire scenarios compared to YOLOv8s, proving the effectiveness of our improvements. Fire-RPG is suitable as a detection model for urban fire alarm systems. It can utilize a large number of cameras installed in urban systems to capture images and detect faint flames and thin smoke. With more powerful small-object detection capabilities, Fire-RPG can detect fire accidents in their early stages and respond in time. This provides more response time for firefighters and effectively ensures the safety of people’s lives and social property.

Author Contributions

Conceptualization, X.L. and Y.L.; methodology, X.L.; software, X.L.; validation, X.L. and Y.L.; formal analysis, Y.L.; investigation, X.L.; resources, X.L.; data curation, X.L.; writing—original draft preparation, X.L.; writing—review and editing, Y.L.; visualization, X.L.; supervision, Y.L.; project administration, Y.L.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Statistics on Fire and Police Situation across the Country in 2018. Available online: https://www.119.gov.cn/gk/sjtj/2022/54.shtml (accessed on 17 April 2024).
In 2019, 233,000 Fires Were Reported Nationwide. Available online: https://www.119.gov.cn/gk/sjtj/2022/386.shtml (accessed on 17 April 2024).
National Fire and Police Response Situation in 2020. Available online: https://www.119.gov.cn/gk/sjtj/2022/13721.shtml (accessed on 17 April 2024).
Firefighting Calls Hit a New High in 2021, with 745,000 Fires Put Out. Available online: https://www.119.gov.cn/gk/sjtj/2022/26442.shtml (accessed on 17 April 2024).
National Police and Fire Situation in 2022. Available online: https://www.119.gov.cn/qmxfxw/xfyw/2023/36210.shtml (accessed on 17 April 2024).
Ten Types of Fire Sources That Often Cause Fires. Available online: https://www.gov.cn/ztzl/djfh/content_436344.htm (accessed on 17 April 2024).
Gaur, A.; Singh, A.; Kumar, A.; Kulkarni, K.S.; Lala, S.; Kapoor, K.; Srivastava, V.; Kumar, A.; Mukhopadhyay, S.C. Fire sensing technologies: A review. IEEE Sens. J. 2019, 19, 3191–3202. [Google Scholar] [CrossRef]
Lestari, D.P.; Kosasih, R.; Handhika, T.; Sari, I.; Fahrurozi, A. Fire hotspots detection system on CCTV videos using you only look once (YOLO) method and tiny YOLO model for high buildings evacuation. In Proceedings of the 2019 2nd International Conference of Computer and Informatics Engineering (IC2IE), Banyuwangi, Indonesia, 10–11 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 87–92. [Google Scholar]
Ko, B.C.; Cheong, K.H.; Nam, J.Y. Fire detection based on vision sensor and support vector machines. Fire Saf. J. 2009, 44, 322–329. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer International Publishing: Berlin/Heidelberg, Germany, 2016. Proceedings, Part I 14. pp. 21–37. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Li, P.; Zhao, W. Image fire detection algorithms based on convolutional neural networks. Case Stud. Therm. Eng. 2020, 19, 100625. [Google Scholar] [CrossRef]
Wang, X.; Cai, L.; Zhou, S.; Jin, Y.; Tang, L.; Zhao, Y. Fire Safety Detection Based on CAGSA-YOLO Network. Fire 2023, 6, 297. [Google Scholar] [CrossRef]
Lin, J.; Lin, H.; Wang, F. A Semi-Supervised Method for Real-Time Forest Fire Detection Algorithm Based on Adaptively Spatial Feature Fusion. Forests 2023, 14, 361. [Google Scholar] [CrossRef]
Zhao, L.; Zhi, L.; Zhao, C.; Zheng, W. Fire-YOLO: A Small Target Object Detection Method for Fire Inspection. Sustainability 2022, 14, 4930. [Google Scholar] [CrossRef]
Zhang, Z.; Tan, L.; Tiong, R.L.K. Ship-Fire Net: An Improved YOLOv8 Algorithm for Ship Fire Detection. Sensors 2024, 24, 727. [Google Scholar] [CrossRef] [PubMed]
Chen, G.; Cheng, R.; Lin, X.; Jiao, W.; Bai, D.; Lin, H. LMDFS: A Lightweight Model for Detecting Forest Fire Smoke in UAV Images Based on YOLOv7. Remote Sens. 2023, 15, 3790. [Google Scholar] [CrossRef]
Ultralytics YOLOv8. Available online: https://github.com/ultralytics/ultralytics (accessed on 17 April 2024).
Wang, C.H.; Mark Liao, H.Y.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capabilities of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13733–13742. [Google Scholar]
Tang, Y.; Han, K.; Guo, J.; Xu, C.; Xu, C.; Wang, Y. GhostNetv2: Enhance cheap operation with long-range attention. Adv. Neural Inf. Process. Syst. 2022, 35, 9969–9982. [Google Scholar]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Gevorgyan, Z. SIoU loss: More powerful learning for bounding box regression. arXiv 2022, arXiv:2205.12740. [Google Scholar]
Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
Zhang, Q.L.; Yang, Y.B. Sa-net: Shuffle attention for deep convolutional neural networks. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 2235–2239. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
D-Fire. Available online: https://github.com/gaiasd/DFireDataset (accessed on 24 May 2024).
ForestFire. Available online: https://universe.roboflow.com/smokedetection-lfdtr/forestfire-aepov (accessed on 24 May 2024).
DFS. Available online: https://github.com/siyuanwu/DFS-FIRE-SMOKE-Dataset (accessed on 24 May 2024).

Figure 1. National fire situation in recent years. (a) Number of fires and economic losses; (b) Fire death toll.

Figure 2. YOLOv8.

Figure 3. Structure of Fire-RPG.

Figure 4. Structural reparameterization process.

Figure 5. The convolutional kernel in the third branch.

Figure 6. RepVGG block.

Figure 7. The DFC attention mechanism and Ghost module.

Figure 8. Cheap operation of the Ghost module.

Figure 9. Structure of GhostV2C2f.

Figure 10. Two annotation principles. (a) The bounding box encloses the main part of the object, but misses the other parts; (b) the bounding box encloses the whole object, but also includes too much irrelevant information.

Figure 11. Relationship between gradient gain and outlier degree.

Figure 12. Structure of CBAM.

Figure 13. Attention mechanisms in CBAM.

Figure 14. Examples in the dataset. (a) Frames extracted from the videos; (b) images obtained from the Internet.

Figure 15. Distribution of bounding box information. (a) Distribution of bounding box center position; (b) distribution of bounding box size.

Figure 16. Comparison of mAP: (a) mAP50; (b) mAP50-95.

Figure 17. Results of the detection of faint flames and thin smoke. (a) YOLOv8; (b) Fire-RPG.

Figure 18. Results for the detection of strong flames and dense smoke. (a) YOLOv8; (b) Fire-RPG.

Figure 19. Detection results for different datasets. (a) D-Fire and YOLOv8; (b) D-Fire and Fire-RPG; (c) ForestFire and YOLOv8; (d) ForestFire and Fire-RPG; (e) DFS and YOLOv8; (f) DFS and Fire-RPG.

Table 1. Dataset information.

Dataset	Train	Val	Test	Total
Image	5642	705	705	7052

Table 2. Object information.

Object	Train	Val	Test	Total
Faint flame	5424	750	654	6828
Strong flame	3599	362	323	4284
Thin smoke	1436	192	141	1769
Dense smoke	3437	424	468	4329

Table 3. Environment settings.

Operating System	CPU	GPU	Programming Language	Deep Learning Framework
Windows 10	Intel i7-13700KF	RTX 3070Ti	Python 3.11	PyTorch 2.0.1

Table 4. Hyperparameter settings.

Batch	Epochs	Patience	Image Size	Optimizer
8	300	50	640	SGD

Table 5. Comparison of models.

Model	mAP50	mAP50-95	Parameters	GFLOPs
Faster RCNN	73.0%	46.2%	41755286	134.4
SSD	71.6%	44.3%	35641314	34.9
RetinaNet	77.7%	46.6%	34014999	151.5
YOLOv10s [30]	78.9%	46.1%	8067900	24.8
YOLOv8s	79.1%	47.7%	11136374	28.6
Fire-RPG	81.3%	48.9%	7808446	32.4

Table 6. Comparison of IoU loss functions.

Model	P	R	mAP50	mAP50-95
YOLOv8s-CIoU	84.8%	70.9%	79.1%	47.7%
YOLOv8s-GIoU [31]	85.1%	70.4%	79.6%	47.6%
YOLOv8s-DIoU [32]	82.3%	71.6%	79.8%	47.5%
YOLOv8s-SIoU [33]	82.6%	71.4%	79.1%	47.2%
YOLOv8s-EIoU [34]	82.5%	71.9%	79.2%	47.4%
YOLOv8s-WIoUv1	85.1%	70.3%	79.0%	47.0%
YOLOv8s-WIoUv2	84.7%	70.7%	79.6%	47.6%
YOLOv8s-WIoUv3	84.2%	71.4%	80.0%	47.3%

Table 7. Comparison of attention mechanisms.

Model	P	R	mAP50	mAP50-95	Parameters	GFLOPs
YOLOv8s	84.8%	70.9%	79.1%	47.7%	11136374	28.6
YOLOv8s-SE	82.7%	71.7%	79.6%	47.9%	11309526	28.7
YOLOv8s-SA [35]	84.6%	70.4%	79.4%	48.0%	11136542	28.7
YOLOv8s-ECA [36]	82.1%	71.1%	79.2%	47.4%	11136383	28.7
YOLOv8s-CA	83.5%	71.5%	79.8%	48.0%	11142270	28.7
YOLOv8s-SimAM [37]	82.0%	71.1%	79.3%	47.8%	11136374	28.6
YOLOv8s-CBAM	83.2%	72.9%	80.0%	48.0%	11223132	28.7

Table 8. Results of ablation experiments.

Model	P	R	mAP50	mAP50-95	Parameters	GFLOPs
YOLOv8s	84.8%	70.9%	79.1%	47.7%	11136374	28.6
YOLOv8s-WIoUv3	84.2%	71.4%	80.0%	47.3%	11136374	28.6
YOLOv8s-WIoUv3 -Detect Layer	81.9%	72.2%	80.6%	48.1%	10637496	37.0
YOLOv8s-WIoUv3 -Detect Layer-RepVGG	84.9%	72.9%	80.9%	48.0%	10626968	36.6
YOLOv8s-WIoUv3 -Detect Layer-RepVGG -GhostV2C2f	85.5%	73.2%	80.7%	48.4%	7380504	31.3
Fire-RPG	85.9%	72.1%	81.3%	48.9%	7808446	32.4

Table 9. Results of the different scenes.

Dataset	Images	Model	P	R	mAP50	mAP50-95
D-Fire [38]	11689	Fire-RPG	78.7%	71.7%	79.2%	46.1%
D-Fire [38]	11689	YOLOv8s	76.6%	72.9%	78.0%	45.6%
ForestFire [39]	7763	Fire-RPG	63.2%	60.0%	63.1%	27.8%
ForestFire [39]	7763	YOLOv8s	63.0%	58.5%	61.8%	27.5%
DFS [40]	9462	Fire-RPG	61.2%	49.2%	54.9%	28.1%
DFS [40]	9462	YOLOv8s	60.7%	48.8%	53.8%	27.3%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, X.; Liang, Y. Fire-RPG: An Urban Fire Detection Network Providing Warnings in Advance. Fire 2024, 7, 214. https://doi.org/10.3390/fire7070214

AMA Style

Li X, Liang Y. Fire-RPG: An Urban Fire Detection Network Providing Warnings in Advance. Fire. 2024; 7(7):214. https://doi.org/10.3390/fire7070214

Chicago/Turabian Style

Li, Xiangsheng, and Yongquan Liang. 2024. "Fire-RPG: An Urban Fire Detection Network Providing Warnings in Advance" Fire 7, no. 7: 214. https://doi.org/10.3390/fire7070214

Article Menu

Fire-RPG: An Urban Fire Detection Network Providing Warnings in Advance

Abstract

1. Introduction

2. Method

2.1. YOLOv8

2.2. Fire-RPG

2.3. RepVGG Block

2.4. GhostV2 Bottleneck

2.5. Wise-IoUv3

2.6. CBAM

3. Dataset and Experimental Settings

4. Results and Evaluation

4.1. Evaluation Metrics

4.2. Comparison of Models

4.3. Comparison of IoU Loss Functions

4.4. Comparison of Attention Mechanisms

4.5. Ablation Experiment

4.6. Detection Results

4.7. Detection Performance in Different Scenes

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI