Open-Pit Mining Area Extraction Using Multispectral Remote Sensing Images: A Deep Learning Extraction Method Based on Transformer

Qiao, Qinghua; Li, Yanyue; Lv, Huaquan

doi:10.3390/app14146384

Open AccessArticle

Open-Pit Mining Area Extraction Using Multispectral Remote Sensing Images: A Deep Learning Extraction Method Based on Transformer

by

Qinghua Qiao

¹

,

Yanyue Li

²

and

Huaquan Lv

^3,*

¹

Natural Resources Survey and Monitoring Research Centre, Chinese Academy of Surveying and Mapping, Beijing 100830, China

²

School of Artificial Intelligence, Beijing Technology and Business University, Beijing 100048, China

³

Guangxi Zhuang Autonomous Region Institute of Natural Resources Remote Sensing, Nanning 530201, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(14), 6384; https://doi.org/10.3390/app14146384

Submission received: 12 June 2024 / Revised: 18 July 2024 / Accepted: 19 July 2024 / Published: 22 July 2024

Download

Browse Figures

Versions Notes

Abstract

:

In the era of remote sensing big data, the intelligent interpretation of remote sensing images is a key technology for mining the value of remote sensing big data and promoting a number of major applications, mainly including land cover classification and extraction. Among these, the rapid extraction of open-pit mining areas plays a vital role in current practices for refined mineral resources development and management and ecological–environmental protection in China. However, existing methods are not accurate enough for classification, not fine enough for boundary extraction, and poor in terms of multi-scale adaptation. To address these issues, we propose a new semantic segmentation model based on Transformer, which is called Segmentation for Mine—SegMine—and consists of a Vision Transformer-based encoder and a lightweight attention mask decoder. The experimental results show that SegMine enhances the network’s ability to obtain local spatial detail information and improves the problem of disappearing small-scale object features and insufficient information expression. It also better preserves the boundary details of open-pit mining areas. Using the metrics of mIoU, precision, recall, and dice, experimental areas were selected for comparative analysis, and the results show that the new method is significantly better than six other existing major Transformer variants.

Keywords:

ecological corridor; minimum cumulative resistance model; traffic resistance

1. Introduction

For a long time, due to the relative scarcity of remote sensing data, intelligent classification and element extraction methods have remained immature, meaning that they cannot realize the intelligent application of remote sensing on a large scale and at a deep level. With the increase of remote sensing images, remote sensing interpretation has begun to play an important role in environmental monitoring [1], resource investigation [2,3], urban planning [4], etc. As a subfield of artificial intelligence, machine learning is expected to help achieve accurate and efficient remote sensing image classification and is therefore considered a reliable method [5]. With the breakthroughs in deep learning (e.g., ResNet, attention mechanisms, transformers, etc.), new opportunities have arisen for the industry application of intelligent interpretation of remote sensing. It is well known that the intelligent interpretation of land cover based on high-resolution remote sensing images, especially classification with accurate types and clear boundaries, is a worldwide challenge. Compared with artificial facilities, natural features are characterized by irregular geometry, complex structure, multiple size scales, and large texture differences, which make it difficult to realize high-precision classification and extraction [6,7,8,9]. In recent years, deep learning [10,11,12] has been regarded as an important means to replace manual interpretation, which presents broad prospects for the automatic interpretation, analysis, and content under-standing of utility-level remote sensing. If automatic interpretation can be truly realized, combined with high-resolution remote sensing, the current status and spatial distribution of land cover at the patch level can be quickly and accurately obtained, which is of great significance for applications such as the detailed management of natural resources and ecological–environmental restoration.

Intelligent interpretation of all kinds of land cover elements from remote sensing images at one time is undoubtedly the most ideal solution. If the conditions are ripe, all of the geographical elements in a region can be systematically and comprehensively classified and extracted, and the one-time processing can meet the practical application needs of various industries. However, current modeling methods are still immature, the computational power of computers is still insufficient, and the labeled samples are not rich enough. Therefore, the intelligent interpretation of all the elements of remote sensing images is still far from meeting the needs of refined practical applications. At the same time, with the deepening of the digital economy and the construction of digital China, the demand for industry applications is growing, and each industry is more concerned about its own management objects, which are often of a certain type or several types. For example, the Ministry of Water Resources is concerned about surface water, the Ministry of Agriculture is concerned about crop cultivation, the Ministry of Transportation is concerned about the construction of transportation facilities, and so on. Therefore, un-der the existing conditions, it is a more feasible program to classify the various types of elements and extract them separately. On the one hand, this can promote in-depth research on and rapid development of remote sensing intelligent interpretation in practical applications, and on the other hand, it can better and faster serve the application needs of industry fields.

Open-pit mining is the process of removing the overburden from an ore body in order to obtain the desired minerals, and it can be used for mineral extraction such as sand and gravel mines, coal mines, and iron ore mines. This type of mining has the advantages of low technical requirements, low investment, and low cost, but it also causes the most serious damage to the environment, as in the case of open-pit mining of sand and gravel mines. Sand and gravel resources are widely distributed in nature and are strategic resources that have an important impact on national defense security, economic security, and sustainable economic and social development. They are mainly used for the construction of roads and buildings. Sand and gravel resources are easy to survey and mine, and open-pit mining is the mainstay. The phenomena of unauthorized mining, over-exploitation, and transboundary mining have never been eradicated in China, which not only disrupts the order of the exploitation and utilization of mineral resources but also causes serious damage to the ecological environment. Therefore, the rapid implementation of the regulation of open-pit mining and illegal mining has become one of the important tasks of the natural resources management authorities.

However, it is well known that the current remote sensing intelligent classification and extraction technology is not mature; the accuracy is not high enough, and the processing speed is not fast enough. Therefore, the accuracy, efficiency, and automation level of the large-scale remote sensing classification and extraction of open-pit mining areas are far from the requirements of practical applications. There is an urgent need to conduct continuous and in-depth research on intelligent extraction methods for open-pit mining areas based on machine learning in order to significantly improve the extraction accuracy and processing speed. This will provide technical support for the Ministry of Natural Resources to comprehensively grasp the current situation of open-pit mining in China and quickly implement dynamic monitoring, analysis, and evaluation.

Due to the wide application of the spatial data of artificial facilities, relatively simple texture features, and relatively clear boundaries, there are relatively more studies on their automatic classification and extraction, and they are also more mature. These artificial facilities include houses and buildings [13,14], roads [15,16], stadiums [17,18], and so on. However, it is difficult to intelligently classify and extract natural elements with high accuracy from high-resolution remote sensing images, especially when the elements are covered by or affected by vegetation, such as open-pit mining areas, which are characterized by complex internal textures and sticky boundaries. Adhesion manifests itself here in the form of neighboring open-pit mining areas that should be separated by vegetation, etc., sticking together to varying degrees at different locations, resulting in extremely elongated areas or nearly overlapping boundaries. The life of an open-pit mine has gone through a process that started from scratch, with the open-pit mining area expanding as production continues. For example, the open-pit mining area of the Fushun West Open-Pit Coal Mine approximates an east–west rectangle, measuring 6.6 km east–west and 2.2 km north–south [19]. The huge difference in scale also brings difficulties in extracting open-pit mining areas with high accuracy. Although the rapid development of image processing, pattern recognition, and computer vision has created conditions for the improvement of the level of image classification and feature extraction, their application in the extraction of open-pit mining areas is not mature enough. All these characteristics of open-pit mining areas put forward high requirements for deep learning methods, and the existing methods are not strong enough to support them, so in-depth research is needed.

In order to optimize the remote sensing extraction of open-pit mining areas in complex environments, we propose a deep learning-based semantic segmentation model called Segmentation for Mine—SegMine—which consists of a Vision Transformer-based encoder and a lightweight attention mask decoder. Our innovations and main contributions can be summarized as follows:

(1): In order to resolve the segmentation errors caused by the disappearance of local area features such as edges and textures in the mining area, this article proposes a multi-scale local spatial feature complementary module. The module learns multi-scale local spatial features and supplements them into the global features of Transformer blocks so as to enhance the network’s ability to obtain local spatial detail information and improve the problem of disappearing small-scale object features and insufficient information expression.
(2): The decoder for pixel-by-pixel classification ignores the importance of contextual learning when assigning labels to each pixel, and the upsampling of features is prone to cause blurring of the edges of the mining area. In order to fully utilize the learned contextual semantic features to solve the problem of sticky edges and boundary blurring in mining areas, this article proposes the attention mask decoder. It is able to retain the edge details of the mined area better and improve the accuracy degradation that may result from downsampling and then upsampling the feature map.
(3): This article demonstrates the promising application of Transformer for the intelligent extraction of open-pit mining areas in complex environments. Considering Transformer’s good parallel computing and global feature acquisition capabilities, as well as SegMine’s good performance, it would be a promising model for the classification and extraction of open-pit mining areas.

The remainder of this article is organized as follows: The related works are described in Section 2. In Section 3, the proposed methodology for open-pit mining area extraction in remote sensing is described. The experimental design and the discussion of the experiment results are introduced in Section 4. Section 5 presents further discussions and gives a summary of our work.

2. Related Works

2.1. Land Cover Extraction by Deep Learning

For a long time, technicians have been implementing fully manual or semi-automated remote sensing extraction of land cover in China through human–computer interaction. Manual extraction of geographic elements from high-resolution remote sensing images is inefficient, time-consuming, costly, and unable to meet processing requirements in a timely manner, so semi-automatic and fully automatic methods have been preferred. Currently, land cover classification has evolved from pixel image analysis [2,20] to object-based image analysis [21]. Compared with pixel image analysis, object-based image analysis makes full use of various features of objects, such as spectral features, texture features, and geometric features. Recently, deep learning has been pro-posed to automatically learn abstract features and has proven to be a state-of-the-art approach. It can provide an end-to-end framework from raw data to target results, reduce human–computer interaction, and greatly improve the efficiency and automation of remote sensing classification [22]. In particular, convolutional neural networks (CNNs) have dominated many areas of computer vision, achieving good results in object recognition, detection, and segmentation. To some extent, these research results can be directly used for remote sensing classification and feature extraction. These applications usually use RGB images, while the current multispectral images acquired by satellite remote sensing, which have a lower cost, the widest coverage, higher resolution, and more application fields, also contain RGB bands. Given the practical application requirements, this study also adopts RGB image data as the basic data.

The birth of technologies such as CNN, attention mechanism, and Transformer have accelerated the research on practical intelligent classification and extraction methods for land cover [23]. CNN is one of the representative models for deep learning applications on remote sensing data [24,25] aiming to utilize the multidimensional grid structure of the input image [26]. Through weight sharing and local connectivity, CNNs build deep networks that can learn features from the pixel level to the semantic level for remote sensing applications. The features extracted by CNN are fed into classifiers such as SVM and random forest in order to obtain better classification results. Sharma proposed a deep patch-based CNN framework for medium-resolution remote sensing data, which targets 5 × 5 patches and improves the classification accuracy compared to pixel-based classification methods [27]. Waldner and Diakogiannis used deep CNN with a fully connected U-Net backbone to extract field boundaries from remote sensing images [28].

CNNs rely on the assumption of independence between instances, which is unacceptable for temporally or spatially correlated data [29]. Local receptive fields in CNNs limit the modeling of long-distance dependencies in images, while the convolution is content-independent. Since the convolutional filter weights are fixed, their values are the same regardless of the attributes of the input. More recently, Transformer-based architectures, originally introduced in natural language processing, have penetrated the field of computer vision, in which the self-attention mechanism is used as an alternative to the popular convolutional operator to capture long-distance dependencies. Due to its attention mechanism, Vision Transformer (ViT) can efficiently capture global interactions by learning the relationships between sequence elements [30]. After the success of ViT in the field of computer vision, the remote sensing community has also introduced it into research on ultra-high-resolution image classification [31], change detection, etc. A two-stream Swin Transformer network (TSTNet) is proposed to address the challenges of complex backgrounds and various stochastically arranged objects on remote sensing images, which makes it difficult to focus on the target objects in the scene [32]. Ma proposed a Homo–Heterogenous Transformer Learning (HHTL) framework for remote sensing image classification in order to overcome limitations, such as the issue of context relationships hidden in remote sensing scenes being unable to be thoroughly mined, the homogenous information does not get the attention it deserves, and the similarities between images and semantic labels are not considered deeply [33].

2.2. Open-Pit Mining Area Extraction by Deep Learning

With the increasingly serious land encroachment and vegetation destruction resulting from open-pit mining, the contradiction between mining activities and environmental protection has gradually intensified [34]. There is an increasing demand for the automatic identification and extraction of open-pit mining areas. Xiang proposed an improved UNet dual network structure that skips the corresponding layers at the encoder end, and this structure finally realizes end-to-end open-pit mining area change detection in remote sensing images [35]. Wang proposed an open-pit mining area extraction model based on the Improved Mask R-CNN (Regional Convolutional Neural Network) and Migration Learning (IMRT) and designed an automated batch production process for open-pit mining areas [36]. Zhang improved full CNN with dense blocks and realized the automatic extraction of open-pit mining areas in Tongling [37]. However, open-pit mining areas are often affected by vegetation cover, and not only do the boundaries tend to be overgrown, but parts of the interior may also be overgrown due to being unmined or to the short-term cessation of mining. Intelligent extraction of fine-grained open-pit mining areas requires continued in-depth research.

3. Methodology

3.1. Model Architecture

Aiming at the challenges of complex texture and uneven scale in open-pit mining areas in remote sensing images, as well as the sticky boundaries of extracted mining areas, a Segmentation for Mine (SegMine) model is proposed, and its overall architecture is shown in Figure 1. The SegMine model consists of a Vision Transformer-based encoder and a lightweight attention mask decoder, which are responsible for extracting image features and learning segmentation mapping, respectively. On the encoder side, a multi-scale priori information complementary structure is designed to minimize the error due to the lack of priori information in the image space for Vision Transformer and to mitigate the segmentation error due to the loss of features in the small open-pit mining area. On the decoder side, a lightweight attention masking module is designed to obtain the final segmentation output through masking operations in order to mitigate low segmentation accuracy due to the blurring of mine boundaries as a result of feature upsampling.

In SegMine, the whole segmentation process consists of three steps: feature encoding, feature decoding, and segmentation computation. First, Vision Transformer containing multi-scale priori information is utilized to encode the features of the target image, and each Transformer block after encoding will output features with global semantic information and rich boundary detail information. Second, the multi-layer attention maps and global semantic features output from Transformer blocks are collected and integrated, and the segmentation mask and multi-layer global semantic features are obtained using the attention mask decoder. Finally, the global semantic features are multiplied with the segmentation mask in order to obtain high-quality segmentation results. Figure 1 shows the overall architecture of the proposed SegMine model.

3.2. Encoder

The encoder of the SegMine model is constructed based on Vision Transformer. In order to solve the segmentation errors caused by the disappearance of local area features such as edges and textures in the mining area, as well as the loss of small-scale features in the mining area, a multi-scale local spatial feature complementary module is proposed. This module learns the multi-scale local spatial features of the image through convolutional bypass branching and then injects the multi-scale local spatial features into the global features of Transformer blocks by using the cross-attention mechanism. Having completed this, the ability of the model to obtain local spatial detail information is enhanced, and problems such as the disappearance of small-scale features and a lack of information expression in the mining area are overcome to a large extent.

The encoding process consists of three main steps: multi-scale local spatial feature acquisition, global semantic feature construction, and feature inter-injection. First, multiple convolution operations are performed on the original image in order to aggregate the multi-scale correlations of the image and obtain multi-scale local spatial features. Second, the original images are passed into Vision Transformer to construct global semantic features. Finally, a cross-attention operation is performed on the multi-scale local spatial features and global semantic features, which can achieve the injecting of local spatial features into the global semantic features and, at the same time, the injecting of global semantic information into the multi-scale local spatial features. The multi-scale local spatial feature complementary module is interspersed between Transformer blocks in order to realize the information interaction between the multi-scale local spatial features and the global semantic features so that the global semantic features output from each Transformer block contain multi-scale local spatial features.

In SegMine, Vision Transformer consists of a patch embedding layer and L Transformer block layers. The patch embedding layer divides the input image into equal-sized image blocks. The linear projection layer maps the features into D dimensions and converts the extraction operation of image features into a sequence-to-sequence task. The Transformer block layer encodes the image features and learns the long-range semantic information of the image by using the mechanism of multi-head attention.

The input image for the patch embedding layer is

x \in R^{H \times W \times C}

, where H, W, C denote the height, width, and number of channels of the input image, respectively. Through the patch embedding layer, the original image is partitioned into a sequence of N fixed-size image blocks

{x_{p}^{i} |i = 1, 2, \dots N}

,

N = H W / P^{2}

, and P is the image block size. Then, through the linear mapping layer, the image block sequences are mapped into D-dimensional sequences in order to accommodate the computation of multi-head self-attention. Finally, a learnable positional embedding

P_{p o s} \in R^{N \times D}

is added to the D-dimensional sequence in order to preserve the positional information between image blocks. The input sequence of Transformer blocks can be represented as (1):

F_{0} = [F (x_{p}^{1}); F (x_{p}^{2}); \dots; F (x_{p}^{N})] + P_{p o s}

(1)

where

F

denotes the linear mapping of markers into D-dimensional features that can be processed by the Transformer.

The output sequence

F_{0} \in R^{N \times D}

of the patch embedding layer is fed into the Transformer blocks for encoding and learning global semantic features. In this case, the Transformer block consists of multi-head self-attention block (MSA) and multilayer perceptron block (MLP) in an alternating manner. Before each block, layer normalization (LN) is applied, and after each block, residual connectivity is applied.

In SegMine, the multi-scale local spatial feature complementary module interspersed between Transformer blocks consists of a bypass convolution branch and multiple feature injection modules. The bypass convolution branch learns the multi-scale local space features of the image. The feature injection module injects the multi-scale local space features into the input features of Transformer blocks through the cross-attention mechanism and injects the global semantic information output from Transformer blocks into the multi-scale local space features.

The convolution kernel has inductive biases such as translation invariance and local sensitivity and is able to capture image local spatial features by means of sliding convolution sharing weights. Multi-scale convolution extracts the features of different scales of mining area, which is favorable for the feature expression of small-scale open-pit mining areas. In order to improve the model’s responsiveness to the local visual information and multi-scale feature processing of the image in the open-pit mining area, the multi-scale local spatial features are captured through the serial computational processing of bypassed convolutional branches. First, the input image is passed through three consecutive 3 × 3 convolutional downsampling layers in order to obtain three different scales of features with image local relevance, constituting the feature pyramid

\{Z_{1}, Z_{2}, Z_{3}\}

. Second, the feature pyramid is sequentially flattened (Flatten), concatenated (Concat), and activated (SiLU). Here, Flatten is used to unfold a tensor in a specific way, e.g., a convolutional layer must be flattened before it is passed to a fully connected layer. Concat is used to stitch together different feature maps according to a certain dimension. SiLU is an improved version of the Sigmoid and ReLU activation functions which is used to convert an input signal into an output signal; has the properties of no upper bound and lower bound, smoothness, and non-monotonicity; and outperforms ReLU on the depth model. Finally, the mapped D-dimensional multi-scale local correlation features are used as inputs to the information injection module. The process can be represented as (2) and (3):

\hat{Z} = C o n c a t (F l a t t e n (Z_{1}, Z_{2}, Z_{3}))

(2)

Z = F (S i L U (N o r m (\hat{Z})))

(3)

The cross-attention mechanism encodes the original sequence and the target sequence separately, and the calculated cross-attention weights can represent the attention paid by each element in the original sequence to each element in the target sequence. According to the cross-attention weights, we selectively pay attention to certain elements of the original sequence in order to realize the information interaction between the two sequences and will not change the shape of the original sequence. This information interaction is accomplished through the information injection module, whose overall flow is shown in Figure 2.

In order to inject the multi-scale local spatial features output from the convolutional branch into the global semantic features of Transformer blocks and to cause the multi-scale local spatial features to have the long-distance dependency of the image features, this study constructs the information injection module through the cross-attention mechanism.

Global semantic features and multi-scale local spatial features are used as inputs to the information injection module. Before each block, layer normalization (LN) is applied, and after each block, residual connectivity is applied. In order to cause the global semantic features to contain multi-scale local spatial information, the module first injects the multi-scale local spatial features into the global semantic features output from the i-th Transformer block, as shown in (4).

{\hat{F}}_{i} = C A (L N (F_{i}), L N (Z)) + F_{i}

(4)

where

i \in \{1, 2, \dots, L\}

. The image feature after encoding by L Transformer blocks is

[F_{1}, F_{2}, \dots, F_{L}] \in R^{N \times D}

. CA is the cross-attention operator function.

In cross-attention, the global semantic feature

F_{i}

is used as the query Q, and the multi-scale local spatial feature Z is used as the key K and the value V. The global semantic feature and the multi-scale local spatial feature are linearly transformed, as shown in (5)–(7).

Q = F_{i} W_{q} \in R^{N \times D}

(5)

K = Z W_{k} \in R^{N \times D}

(6)

V = Z W_{v} \in R^{N \times D}

(7)

where N is the length of the sequence and D is the number of feature dimensions. W_q, W_k, and W_v are three trainable parameter matrices.

Then, the degree of attention paid by the multi-scale local spatial features to the global semantic features, also known as the cross-attention weight, is obtained using dot product and SoftMax operations.

C r o s s A t t e n t i o n (Q, K) = S o f t M a x (\frac{Q K^{T}}{\sqrt{d}})

(8)

where d is the number of channels in each self-attentive header.

Finally, the cross-attention weights are multiplied with V element by element in order to discern the information interaction between multi-scale local spatial features and global semantic features.

C A (F_{i}, Z) = C r o s s A t t e n t i o n (Q, K) \cdot V \in R^{N \times D}

(9)

In order to update the multi-scale local spatial features so that they contain global semantic information, the module injects the global semantic features output from the i-th Transformer block into the multi-scale local spatial features.

\hat{Z} = C A (L N (Z), L N (F_{i})) + Z

(10)

Multi-scale local spatial features containing global semantic features are enhanced expressively by feed-forward neural networks (FFNN). The process can be represented as

Z = F F N N (\hat{Z}) = W_{2} (ReLU (W_{1} \hat{Z}))

(11)

where

W_{1} \in R^{D \times 2 D}, W_{2} \in R^{2 D \times D}

is the learnable transformation matrix.

In cross-attention, the global semantic feature Z is used as the query Q, and the multi-scale local spatial feature

F_{i}

is used as the key K and the value V. The global semantic feature and the multi-scale local spatial feature are linearly transformed. The arithmetic formula is the same as in (5)–(7).

Then, the degree of attention paid by the global semantic features to the multi-scale local spatial features, also known as the cross-attention weight, is obtained using dot product and SoftMax operations, and the calculation formula is the same as in (8).

Finally, the cross-attention weights are multiplied with V element by element in order to discern the information interaction between global semantic features and multi-scale local spatial features, and the calculation formula is same as in (9).

The information injection module outputs both global semantic features containing multi-scale local spatial features and multi-scale local spatial features containing global semantic information. The global semantic features are transmitted to the Transformer block through the residual connection in order to achieve long-distance information modeling, which ensures that the global semantic features have a better response to different scales of mining areas. The multi-scale local spatial features are input to the next multi-scale local spatial feature complementary module as the multi-scale priori information of the next module, which enhances the model convergence and improves the model performance at the same time.

3.3. Decoder

The decoder of the SegMine model solves the semantic segmentation problem by generating masks. Previous decoders for pixel-by-pixel classification ignored the importance of contextual learning when assigning labels to each pixel, and feature upsampling is prone to cause blurring of the boundaries of the mining area. In order to fully utilize the learned contextual semantic features to solve the problem of sticky boundaries and blurred boundaries of the mining area, the attention mask decoder is proposed. The attention mask decoder generates meaningful similarity mapping and then converts the similarity mapping into a segmentation mask. In order to obtain segmentation results with semantic information, it avoids the possible accuracy degradation caused by downsampling and then upsampling operations on feature maps.

The decoding process consists of three main steps: generating a classification token, generating a segmentation mask, and completing semantic segmentation. First, the attention mask module takes the global semantic features output from Transformer blocks as input and generates a specific classification token that represents a mining area. Second, it calculates the similarity between the classification token and the global semantic features and converts the similarity mapping into a segmentation mask through Sigmoid. Meanwhile, the information interaction between the global semantic features and the classification token is realized through cross-attention calculation so that the classification token contains global semantic information and generates meaningful similarity mapping. Finally, the module outputs the updated classification token and L attention masks. The classification token is linearly transformed using the SoftMax operation to obtain the class probability prediction and thereby determine the presence of the class of open-pit mining areas in the image. The combined L attention masks are then multiplied term by term with the class probability prediction in order to obtain the final segmentation output. In the overall structure, the Attention Mask module is used after each Transformer block, and the final segmentation mask is obtained by summing the Attention Mask outputs from the Attention Mask module. The class probability prediction of the classification token is multiplied with the segmentation mask item by item in order to achieve the semantic segmentation of the mining area. The overall flow of the attention mask module is shown in Figure 3.

First, the module generates a sequence of learnable classification token with category number C based on the input global semantic feature

G \in R^{C \times D}

and calculates the feature similarity between the classification token and the global semantic feature. The two sequences are linearly transformed with the classification token as query Q and the global semantic features as key K and value V.

Q = G W_{q} \in R^{C \times D}

(12)

K = F_{i} W_{k} \in R^{N \times D}

(13)

V = F_{i} W_{v} \in R^{N \times D}

(14)

where F_i is the image feature output by the i-th Transformer block, and W_q, W_k, and W_v are three trainable parameter matrices.

The similarity mapping between the categorized token and the global semantic feature query Q and key K is computed by dot product.

S i m i l a r (Q, K) = \frac{Q K^{T}}{\sqrt{d}} \in R^{C \times N}

(15)

The similarity mapping is input to SoftMax to obtain the cross-attention weights, and the cross-attention weights are multiplied term by term with V to obtain the result of the classification token update. The updated categorization token is used as the categorization token for the next attention mask decoder.

C r o s s A t t e n t i o n (Q, K) = S o f t M a x (S i m i l a r (Q, K))

(16)

C A (G, F_{i}) = C r o s s A t t e n t i o n (Q, K) \cdot V \in R^{N \times D}

(17)

Second, similarity mapping is converted to semantic mask by Sigmoid.

M a s k (G, F_{i}) = S i g m o i d (S i m i l a r (Q, K)) \in R^{C \times N}

(18)

Finally, the category probability prediction and segmentation masks are generated to obtain the semantic segmentation results. The sequence

G \in R^{C \times D}

of categorization token output from the last layer of the decoder obtained the category probability prediction through the fully connected layer and SoftMax function operation.

P (G) = S o f t M a x (L i n e a r (G))

(19)

The semantic masks in the decoder are superimposed in order to obtain the segmentation mask.

M a s k_{L} = \sum_{i = 1}^{L} M a s k (G, F_{i})

(20)

L is the number of Transformer blocks.

The category probability prediction is multiplied element by element with the segmentation mask to obtain the semantic segmentation result for the open-pit mining area category.

S e g = P (G) \cdot M a s k_{L}

(21)

4. Experiments

4.1. Datasets and Settings

The original data used in this research are the collected high-resolution Google images, which have the same resolution and are distributed in different regions. First, the open-pit mining areas on the images are manually labeled according to the pre-determined decision rules. Second, the original image data and the labeled data are sliced according to the storage format and size requirements of the model input data to form the original and labeled image database in PNG format with a size of 512 × 512. In this case, the original image slice picture still contains three bands, while the labeled picture has only one band. The label value of each pixel is represented by a number, using the UINT8 data type for storage, and the value indicates whether it is a mining area or not: “1” represents that it is a mining area, and “0” indicates a non-mining area. The dataset has 3437 image pairs, which are randomly divided into 2946 training images and 491 test images according to the ratio 6:1, and there are two classes: open-pit mining area class and background class.

The computer used for the experiments was configured with an NVIDIA GeForce RTX2080 GPU (NVIDIA, Santa Clara, CA, USA) graphics card, and the Ubuntu operating system, Python 3.6, CUDA 11.4, and PaddlePaddle 2.4.2 deep learning framework were installed. In the subsequent experiments, all seven models were implemented and run based on PaddlePaddle 2.4.2.

4.2. Quantitative Experimental Results

As metrics broadly used for the performance evaluation of deep learning models, the mean intersection over union (mIoU), precision, recall, and similarity coefficient (Dice) are used in this paper as the evaluation indexes to assess the model’s segmentation effect on the remote sensing image of the open-pit mining area, and they are calculated as shown in (22) to (25). Among them, as IoU is the ratio of the intersection and merger of real labels and predicted masks in a certain category, and mIoU is the average of intersection and merger ratios of all the categories, which measures the degree of overlap between the pixel categories predicted by the model and the real labels. Precision indicates the proportion of pixels predicted to be true that are actually true. Recall measures the ability of the model to recognize all positive categories, i.e., the ratio of the number of pixels predicted to be true to the total number of pixels that are actually true. Similarity coefficient (dice), an evaluation metric combining precision and recall, is used to compute the similarity between the true labels and the predicted masks and is expressed as the ratio of twice the intersection of the predicted and true labels to the sum of predicted masks plus true labels.

m I o U = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{T P}{F N + F P + T P}

(22)

P r e c i s i o n = \frac{T P}{T P + F P}

(23)

R e c a l l = \frac{T P}{T P + F N}

(24)

D i c e = \frac{2 T P}{F P + 2 T P + F N}

(25)

where, k denotes the total number of categories; true positive example (TP) denotes the number of pixels that are actually true and predicted to be true; false positive example (FP) denotes the number of pixels that are actually false and predicted to be true; true negative example (TN) denotes the number of pixels that are actually false and predicted to be false; and false negative example (FN) denotes the number of pixels that are actually true and predicted to be false.

In order to verify that the proposed model is superior to the traditional models, six additional semantic segmentation models were selected, and a comparative experiment was designed. During the experiment under the same environment, detailed comparative analyses were carried out from both the overall and local perspectives, respectively. Four evaluation metrics were chosen to evaluate the overall accuracy of each model, and seven scenes were selected to evaluate the local accuracy of each model.

The experimental results are shown in Table 1, which compares the mIoU, precision, recall, and dice results of each network model, with the bolded data indicating the optimal results. As can be seen from Table 1, all seven Transformer-based models obtain better segmentation performance. This is due to the fact that Transformer obtains a sufficiently large sensory field through the multi-head self-attention mechanism, which enables the original fuzzy boundaries to be judged more accurately through its overall shape and contextual information. Both Segmenter and RtFormer retain the high-resolution, coarse-grained features and low-resolution, fine-grained features, which makes the pixel classification more accurate and the segmentation effect of the boundaries and other details finer. Among them, the mIoU of Segmenter reaches 85.68%, which is due to the fact that Segmenter utilizes the self-attention mechanism to capture the global context information and generate classification masks through attention weights, which, in turn, leads to more complete segmentation of open-pit mining areas.

SegFormer, TopFormer, Mask2Former, and ViT-Adapter all utilize an encode–decode structure, and all incorporate multi-scale semantic information. Utilizing features at different scales can improve the model’s ability to recognize objects with different sizes and complex shapes. Among them, the mIoU of Mask2Former reaches 86.12%, which is due to the fact that Mask2Former transforms the segmentation task into a combined problem of category prediction and instance segmentation. By combining the predicted pixel categories and instance semantics to generate a classification mask, more advanced semantic information can be provided, enabling the model to better understand and distinguish the segmented objects in the open-pit mining area in the image. SegFormer’s recall reaches 85.56%, implying that SegFormer has high integrity in generating CNN-like multi-level features to represent the target object through the hierarchical Transformer encoder, which is conducive to the model’s segmentation of the open-pit mining area in the presence of complex backgrounds.

SegMine integrates the advantages of Transformer and CNN, injects multi-scale priori information into global semantic features, and integrates multi-layer global semantic information to guide pixel classification by adding an attention mask module. The mIoU of SegMine reaches 86.91%, which is the best performance for this indicator, indicating that the segmentation results of SegMine are more consistent with the real labeling. Additionally, it significantly outperforms other models in precision and dice coefficients. This indicates that the improved encoder can enhance the model’s ability to acquire local spatial detail information, further addressing the problems of disappearing small-scale target features and insufficient information expression. The proposed decoder integrates multi-layer similarity mapping to obtain a more semantically informative segmentation mask and thereby alleviate the problem of low segmentation accuracy due to blurring of the boundaries of the open-pit mining area caused by feature upsampling.

4.3. Qualitative Experimental Results

Figure 4 shows the extraction results of the seven network models on the test image set, and the segmented images in seven different scenes are selected for analysis, namely Scenes 1 to 7 from left to right. Comparing the overall classification effects, no matter if one is comparing with the labeled samples or comparing with the actual open-pit mining area on the image, all seven models present different degrees of failure to recognize the mining area as well as incorrect recognition of the mining area. Among them, relative to the actual mining area, Segmenter fails to recognize the mining area in most cases, with obvious errors such as failing to recognize the small mining area in Scene 1 and failing to correctly recognize parts of the mining area in Scene 2, Scene 3, Scene 5, and Scene 7. RtFormer fails to identify the small mining area in Scene 1, identifies the small, vegetation-covered area in Scene 3 as a mining area, fails to correctly identify a portion of the mining area in Scene 4, and identifies a small portion of the non-mining area in Scene 7 as a mining area. SegFormer identifies the large vegetation-covered area in Scene 3 as a mining area, fails to identify the small mining area in the center, and identifies the large non-mining area in Scene 7 as a mining area. TopFormer fails to correctly identify the mining area in Scene 2 and Scene 3 on the one hand and identifies the large non-mining area of Scene 7 as mining area on the other hand. Mask2Former fails to accurately identify the large mining area in Scene 3 and Scene 7. ViT-Adapter generally fails to correctly identify the mining area, e.g., completely fails to identify the mining area in Scene 1 and Scene 2 and fails to recognize the large-scale mining area in Scene 7, but its recognition results for Scene 4 are the best among the seven models. Overall, Segmenter has poor adaptability to multi-scale targets, resulting in the model not recognizing small open-pit mining areas. RtFormer can only output a fixed-resolution feature map, and it is difficult to realize the segmentation of fine boundaries when the image resolution is not too high. TopFormer performs feature enhancement at 64 times the downsampling resolution, which is prone to losing global information and has poor segmentation accuracy for large-scaled mining areas. Mask2Former is unable to obtain complete semantic information when facing complex images, and accurate segmentation fragments are not effectively combined into a complete and coherent object. ViT-Adapter has limited ability to extract and integrate local features to form a global segmentation profile, resulting in the inability to accurately capture the whole shape of the object.

SegMine improves the model’s ability to segment small-scale targets by supplementing multi-scale local spatial features and uses an attention mask decoder to integrate semantic information and reduce pixel classification errors. In Scenes 1 and 5, the open-pit mining areas are small, and the distribution is scattered, yet SegMine effectively retains the detailed information of these targets compared to other models. In Scenes 2, 3, and 6, SegMine utilizes semantic information to generate segmentation masks, which provides clearer target contour boundaries. In Scene 7, the open-pit mining area has a complex environment with obvious terrain undulations, broken and mixed multi-type land cover, fragmented distribution of vegetation, and obvious shadows, which makes it difficult to achieve a clear segmentation of the target contour boundary. However, among all the models, the SegMine segmentation effect performs better and has the ability to deal with the complex open-pit mining area scene.

The previous section mainly compares and analyzes the consistency of the extraction results of the seven models with the label results. In the following section, we continue to analyze the degree of consistency between the SegMine extraction results and the actual environment. It is reasonable to say that accurate labeling is a prerequisite for completing model training with high accuracy and is also the basis for fine-grained prediction in the later stage. However, on the one hand, the actual environment of mining areas is very complex, and usually there are different types of land cover with very different textural characteristics. Especially when there are fuzzy boundaries, it is sometimes very difficult to accurately label the mining area by relying on remote sensing images alone. Even in the same region, the labeling results of two professional technicians will never be exactly the same, and in extreme cases, they may be very different. On the other hand, to a large extent, current sample labeling usually relies on manual work, and high-precision labeling often implies a large amount of manpower input, thus greatly increasing the production cost, which is often unaffordable for specific applications. Therefore, it is impractical and undesirable to build an absolutely accurate sample database for practical applications. The optimal solution should be to let the model absorb the correct knowledge in the imperfect sample and reject the wrong knowledge through continuous learning. The sample database constructed in this study also has many problems. Unsatisfactorily, in Figure 4, some regions are not accurately labeled. When the samples are not rich enough, to a certain extent, there will be conflicted labeling, which can mislead the model training and lead to large differences between prediction and samples. In turn, this leads to inconsistency with the reality. However, in general, among the seven models, the extraction results of SegMine are the closest to the reality and have obvious advantages.

In the following section, we will analyze the extraction results of SegMine for each of the seven scenes. Each scene consists of three images, which, from left to right, are the original image, the labeled image, and the original image with overlayed extraction results. In the labeled image, the white area is the open-pit mining area, and the black area is the background. In the overlayed image, the brown area is the open-pit mining area, and the green area is the background.

As can be seen in the original image of Scene 1, the upper portion has a small river running from left to right and turning right at 90 degrees downhill in the upper right corner. There is a small area of open-pit mining on the upper-right riverbank. Compared to the original image, the mining area in the labeled image is larger than the actual area, and some parts of the exposed riverbed are also labeled as mining area. In terms of spectral and textural features, the results extracted by SegMine, however, do not recognize the bare riverbed as a mining area, which is more in line with the actual situation.

In Scene 2, there is a small area of mining in the upper part. Compared to the original image, the environment of the mining area is complex, and the lack of green vegetation due to the season causes the boundary of the mining area not to be clear enough. The labeled extent of the mining area in the sample image is inaccurate in local areas, and the technician did not draw along the boundary of the mining area well. In terms of spectral and textural features, the results extracted by SegMine are more in line with the actual conditions; especially in the obvious boundary areas, the extracted boundaries match the actual boundaries very well.

In Scene 3, there is extensive open-pit mining on the left side. Compared to the original image, the boundaries of this mining area are relatively clear, but the extent of the mining area labeled in the sample is not accurate. The two problems of over-extraction (e.g., two small areas covered by vegetation in the top-left and -right) and under-extraction (e.g., strip mining area in the middle-right) coexist. In contrast, SegMine’s extraction results are more consistent with the actual conditions in terms of spectral and textural features, accurately identifying all the mining areas and also accurately excluding all the unmined, vegetation-covered areas.

In Scene 4, there are some mining areas in the upper part which are not yet contiguous. Between the mining areas, there are clearly distributed areas of vegetation cover. However, the samples constructed in this article labeled all the intermediate areas as mining areas. Possibly influenced by the wrong sample, SegMine’s extraction results are basically consistent with the sample labeled region and also identify all the intermediate distributed vegetation cover areas as mining areas. SegMine’s processing here is unreasonable, and further refinement is needed for subsequent studies.

In Scene 5, there are three mining areas in the map, with a road connecting the two mining areas on the right side, and there are obviously no traces of mining on either side of the road. Compared to the original image, the sample labeled this road as a mining area as well, and the width is slightly wider than the actual road. The model may be misled by the wrong sample, but SegMine accurately completed the identification of the two mining areas, and, at the same time, identified the road connecting them as a mining area and as slightly wider than the labeled road, which is unreasonable. Fortunately, the misidentified area is relatively small.

In Scene 6, it can be seen that there is an open-pit mining area in the lower-left. Compared to the original image, the extent of the mining area in the labeled image is inaccurate in local areas, and the boundary line is slightly rough, with excessively long straight-line segments, which is not consistent with the actual mining area boundary. On the other hand, from the spectral and texture features, the results extracted by SegMine are more delicate and match better with the actual boundary, which is more in line with the local reality.

The original image from Scene 7 shows that there is an extensive mining area. Compared to the original image, the internal texture of this mining area is complex, the terrain has obvious undulations, and the shadow influence is obvious. At the same time, the mining area labeled in the sample is obviously too small; the upper part especially omits a large mining area. In terms of spectral and textural features, the results extracted by SegMine are more in line with the actual local reality and are able to identify all the unlabeled mining areas in the sample image.

5. Conclusions

Currently, Transformer has become a new research hotspot, which can parallelize training, can obtain global information, and has the momentum to challenge traditional neural networks. Among them, the performance of ViT in many image classification tasks is directly comparable to that of CNN-based SOTA, which opens up new possibilities in the field of computer vision.

To improve the deficiencies of the existing models in classifying and extracting open-pit mining areas in complex backgrounds, such as the processing errors faced by small mining areas, the poor adaptive performance shown with multi-scale mining areas, and the sticking of extracted boundaries, a new semantic segmentation model based on Transformer—SegMine—is proposed. SegMine consists of a Vision Transformer-based encoder and a lightweight attention mask decoder. Compared to the Transformer family of models, which are currently outstanding performers in the visual field, this model performs better at classifying and extracting open-pit mining areas, is more friendly to small mining areas, and can extracts finer boundary details. In the comparison experiment with seven models, the proposed model had the highest overall performance. Of the four evaluation metrics, SegMine had the highest mIoU (86.91%), precision (89.90%), and dice (92.27%), and its recall (85.33%) was only slightly lower than SegFormer (85.56%). During the detail processing of seven different scenes, SegMine was able to effectively retain the detail information of fragmented and small-scale targets and achieve clearer contour boundaries and better resistance to complex background interference. Considering Transformer’s good parallel computing and global feature acquisition capabilities, as well as SegMine’s good performance, it would be a promising model for the classification and extraction of open-pit mining areas. In future research, we will pay more attention to determining how to utilize defective labeled sample data to better achieve high-precision intelligent extraction of open-pit mining area extraction.

Author Contributions

Conceptualization, Q.Q.; data curation, Y.L.; formal analysis, Y.L.; investigation, H.L.; methodology, Q.Q., Y.L. and H.L.; project administration, Q.Q.; resources, H.L. and Y.L.; software, Y.L.; supervision, Q.Q.; validation, Y.L. and H.L.; visualization, Y.L.; writing—original draft, Q.Q.; writing—review and editing, Q.Q., Y.L. and H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Basic Scientific Research Operating Funds of the Chinese Academy of Surveying and Mapping (No. AR2212, No. AR2203).

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

We thank Google for providing remote sensing images.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analysis, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Bauer, M.E. Remote sensing of environment: History, philosophy, approach and contributions, 1969–2019. Remote Sens. Environ. 2020, 237, 111522. [Google Scholar] [CrossRef]
Tong, X.Y.; Xia, G.S.; Lu, Q.; Shen, H.; Li, S.; You, S.; Zhang, L. Land-cover classification with high-resolution remote sensing images using transferable deep models. arXiv 2022, arXiv:1807.05713. [Google Scholar] [CrossRef]
Shendryk, Y.; Rist, Y.; Ticehurst, C.; Thorburn, P. Deep learning for multi-modal classification of cloud, shadow and land cover scenes in PlanetScope and Sentinel-2 imagery. ISPRS J. Photogramm. Remote Sens. 2019, 157, 124–136. [Google Scholar] [CrossRef]
Zhou, W.; Ming, D.; Lv, X.; Zhou, K.; Bao, H.; Hong, Z. SO–CNN based urban functional zone fine division with VHR remote sensing image. Remote Sens. Environ. 2020, 236, 111458. [Google Scholar] [CrossRef]
Maxwell, A.E.; Warner, T.A.; Fang, F. Implementation of machine-learning classification in remote sensing: An applied review. Int. J. Remote Sens. 2018, 39, 2784–2817. [Google Scholar] [CrossRef]
Xia, G.S.; Yang, W.; Delon, J.; Yann, G.; Hong, S.; Henri, M. Structural high-resolution satellite image indexing. ISPRS TC VII Symp. 2009, 38, 298–303. Available online: https://hal.science/hal-00458685v1 (accessed on 1 January 2024).
Xia, G.-S.; Liu, G.; Bai, X.; Zhang, L. Texture characterization using shape co-occurrence patterns. IEEE Trans. Image Process. A Publ. IEEE Signal Process. Soc. 2017, 26, 5005–5018. [Google Scholar] [CrossRef]
Anwer, R.M.; Khan, F.S.; van de Weijer, J.; Monlinier, M.; Laaksonen, J. Binary patterns encoded convolutional neural networks for texture recognition and remote sensing scene classification. ISPRS J. Photogramm. Remote Sens. 2018, 138, 74–85. [Google Scholar] [CrossRef]
Li, Q.; Mou, L.; Liu, Q.; Wang, Y.; Zhu, X.X. HSF-Net: Multiscale deep feature embedding for ship detection in optical remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2018, 56, 7147–7161. [Google Scholar] [CrossRef]
Zhu, X.X.; Tuia, D.; Mou, L.; Xia, G.-S.; Zhang, L.; Xu, F.; Fraundorfer, F. Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geosci. Remote Sens. Mag. 2017, 5, 8–36. [Google Scholar] [CrossRef]
Reichstein, M.; Camps-Valls, G.; Stevens, B.; Jung, M.; Denzler, J.; Carvalhais, N.; Prabhat. Deep learning and process understanding for data-driven Earth system science. Nature 2019, 566, 195–204. [Google Scholar] [CrossRef]
Hong, D.; Gao, L.; Yokoya, N.; Yao, J.; Chanussot, J.; Du, Q.; Zhang, B. More diverse means better: Multimodal deep learning meets remote-sensing imagery classification. IEEE Trans. Geosci. Remote Sens. 2020, 59, 4340–4354. [Google Scholar] [CrossRef]
Liu, Y.; Chen, D.; Ma, A.; Zhong, Y.; Fang, F.; Xu, K. Multiscale U-shaped CNN building instance extraction framework with edge constraint for high-spatial-resolution remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2020, 59, 6106–6120. [Google Scholar] [CrossRef]
Ji, S.; Wei, S.; Lu, M. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Trans. Geosci. Remote Sens. 2018, 57, 574–586. [Google Scholar] [CrossRef]
Belli, D.; Kipf, T. Image-conditioned graph generation for road network extraction. arXiv 2019, arXiv:1910.14388. [Google Scholar] [CrossRef]
Lian, R.; Huang, L. DeepWindow: Sliding window based on deep learning for road extraction from remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 1905–1916. [Google Scholar] [CrossRef]
Vargas-Munoz, J.E.; Srivastava, S.; Tuia, D.; Falcao, A.X. OpenStreetMap: Challenges and opportunities in machine learning and remote sensing. IEEE Geosci. Remote Sens. Mag. 2020, 9, 184–199. [Google Scholar] [CrossRef]
Zhong, Y.; Han, X.; Zhang, L. Multi-class geospatial object detection based on a position-sensitive balancing framework for high spatial resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2018, 138, 281–294. [Google Scholar] [CrossRef]
Gao, Y.; Li, J.; Yang, T.; Deng, W.; Wang, D.; Cheng, H.; Ma, K. Stability analysis of a deep and large open-pit based on fine geological modeling and large-scale parallel computing: A case study of Fushun West Open-pit Mine, Geomatics. Nat. Hazards Risk 2023, 14, 2266663. [Google Scholar] [CrossRef]
Zhang, C.; Harrison, P.A.; Pan, X.; Li, H.; Sargent, I.; Atkinson, P.M. Scale Sequence Joint Deep Learning (SS-JDL) for land use and land cover classification. Remote Sens. Environ. 2020, 237, 111593. [Google Scholar] [CrossRef]
Zhou, Y.; Li, J.; Feng, L.; Zhang, X.; Hu, X. Adaptive scale selection for multiscale segmentation of satellite images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 3641–3651. [Google Scholar] [CrossRef]
Cao, R.; Tu, W.; Yang, C.; Li, Q.; Liu, J.; Zhu, J.; Zhang, Q.; Li, Q.; Qiu, G. Deep learning-based remote and social sensing data fusion for urban region function recognition. ISPRS J. Photogramm. Remote Sens. 2020, 163, 82–97. [Google Scholar] [CrossRef]
Kemker, R.; Salvaggio, C.; Kanan, C. Algorithms for semantic segmentation of multispectral remote sensing imagery using deep learning. ISPRS J. Photogramm. Remote Sens. 2018, 145, 60–77. [Google Scholar] [CrossRef]
Ma, L.; Liu, Y.; Zhang, X.; Ye, Y.; Yin, G.; Johnson, B.A. Deep learning in remote sensing applications: A meta-analysis and review. ISPRS J. Photogramm. Remote Sens. 2019, 152, 166–177. [Google Scholar] [CrossRef]
Ding, L.; Zhang, J.; Bruzzone, L. Semantic segmentation of large-size VHR remote sensing images using a two-stage multiscale training architecture. IEEE Trans. Geosci. Remote Sens. 2020, 58, 5367–5376. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Sharma, A.; Liu, X.; Yang, X.; Shi, D. A patch-based convolutional neural network for remote sensing image classification. Neural Netw. 2017, 95, 19–28. [Google Scholar] [CrossRef]
Waldner, F.; Diakogiannis, F.I. Deep learning on edge: Extracting field boundaries from satellite images with a convolutional neural network. Remote Sens. Environ. 2020, 245, 111741. [Google Scholar] [CrossRef]
Zhang, X.; Zhou, Y.; Luo, J. Deep learning for processing and analysis of remote sensing big data: A technical review. Big Earth Data 2022, 6, 527–560. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
Bazi, Y.; Bashmal, L.; Al Rahhal, M.M.; Al Dayil, R.; Al Ajlan, N. Vision transformers for remote sensing image classification. Remote Sens. 2021, 13, 516. [Google Scholar] [CrossRef]
Hao, S.; Wu, B.; Zhao, K.; Ye, Y.; Wang, W. Two-stream swin transformer with differentiable sobel operator for remote sensing image classification. Remote Sens. 2022, 14, 1507. [Google Scholar] [CrossRef]
Ma, J.; Li, M.; Tang, X.; Zhang, X.; Liu, F.; Jiao, L. Homo–heterogenous transformer learning framework for RS scene classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 2223–2239. [Google Scholar] [CrossRef]
Xiao, W.; Zhang, W.; Lyu, X.; Wang, X. Spatio-temporal patterns of ecological capital under different mining intensities in an ecologically fragile mining area in Western China: A case study of Shenfu mining area. J. Nat. Res 2020, 35, 68–81. [Google Scholar]
Xiang, Y.; Zhao, Y.; Dong, J. Change Detection of Mining Areas in Remote Sensing Imagery Based on Improved UNet Twin Networks. China Coal Soc. 2019, 44, 3773–3780. [Google Scholar]
Wang, C.; Chang, L.; Zhao, L.; Niu, R. Automatic identification and dynamic monitoring of open-pit mines based on improved mask R-CNN and transfer learning. Remote Sens. 2020, 12, 3474. [Google Scholar] [CrossRef]
Zhang, F.; Wu, Y.; Yao, X.; Liang, Z. Opencast mining area intelligent extraction method for multi-source remote sensing image based on improved densenet. Remote Sens. Technol. Appl. 2020, 35, 673–684. [Google Scholar]
Wang, J.; Gou, C.; Wu, Q.; Feng, H.; Han, J.; Ding, E.; Wang, J. Rtformer: Efficient design for real-time semantic segmentation with transformer. arXiv 2022, arXiv:2210.07124. [Google Scholar] [CrossRef]
Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 7262–7272. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar] [CrossRef]
Zhang, W.; Huang, Z.; Luo, G.; Chen, T.; Wang, X.; Liu, W.; Yu, G.; Shen, C. Topformer: Token pyramid transformer for mobile semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12083–12093. [Google Scholar] [CrossRef]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. arXiv 2021, arXiv:2112.01527. [Google Scholar] [CrossRef]
Chen, Z.; Duan, Y.; Wang, W.; He, J.; Lu, T.; Dai, J.; Qiao, Y. Vision transformer adapter for dense predictions. arXiv 2022, arXiv:2205.08534. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of SegMine.

Figure 2. The overall flow of the information injection module.

Figure 3. Overall flow of the attention mask module.

Figure 4. Segmentation of seven models on the open-pit mining area dataset.

Table 1. Performance comparison of different Transformer models.

Model	mIoU (%)	Precision (%)	Recall (%)	Dice (%)
Segmenter [38]	85.68	85.29	84.11	90.32
RtFormer [39]	84.23	88.85	81.54	90.59
SegFormer [40]	85.56	85.95	85.56	90.60
TopFormer [41]	85.53	88.77	84.94	91.55
Mask2Former [42]	86.12	84.61	82.68	90.57
ViT-Adapter [43]	85.78	85.63	83.07	90.69
SegMine	86.91	89.90	85.33	92.27

The bolded data indicating the optimal results.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qiao, Q.; Li, Y.; Lv, H. Open-Pit Mining Area Extraction Using Multispectral Remote Sensing Images: A Deep Learning Extraction Method Based on Transformer. Appl. Sci. 2024, 14, 6384. https://doi.org/10.3390/app14146384

AMA Style

Qiao Q, Li Y, Lv H. Open-Pit Mining Area Extraction Using Multispectral Remote Sensing Images: A Deep Learning Extraction Method Based on Transformer. Applied Sciences. 2024; 14(14):6384. https://doi.org/10.3390/app14146384

Chicago/Turabian Style

Qiao, Qinghua, Yanyue Li, and Huaquan Lv. 2024. "Open-Pit Mining Area Extraction Using Multispectral Remote Sensing Images: A Deep Learning Extraction Method Based on Transformer" Applied Sciences 14, no. 14: 6384. https://doi.org/10.3390/app14146384

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Open-Pit Mining Area Extraction Using Multispectral Remote Sensing Images: A Deep Learning Extraction Method Based on Transformer

Abstract

1. Introduction

2. Related Works

2.1. Land Cover Extraction by Deep Learning

2.2. Open-Pit Mining Area Extraction by Deep Learning

3. Methodology

3.1. Model Architecture

3.2. Encoder

3.3. Decoder

4. Experiments

4.1. Datasets and Settings

4.2. Quantitative Experimental Results

4.3. Qualitative Experimental Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI