Skip to main content

HR-MPF: high-resolution representation network with multi-scale progressive fusion for pulmonary nodule segmentation and classification


Accurate segmentation and classification of pulmonary nodules are of great significance to early detection and diagnosis of lung diseases, which can reduce the risk of developing lung cancer and improve patient survival rate. In this paper, we propose an effective network for pulmonary nodule segmentation and classification at one time based on adversarial training scheme. The segmentation network consists of a High-Resolution network with Multi-scale Progressive Fusion (HR-MPF) and a proposed Progressive Decoding Module (PDM) recovering final pixel-wise prediction results. Specifically, the proposed HR-MPF firstly incorporates boosted module to High-Resolution Network (HRNet) in a progressive feature fusion manner. In this case, feature communication is augmented among all levels in this high-resolution network. Then, downstream classification module would identify benign and malignant pulmonary nodules based on feature map from PDM. In the adversarial training scheme, a discriminator is set to optimize HR-MPF and PDM through back propagation. Meanwhile, a reasonably designed multi-task loss function optimizes performance of segmentation and classification overall. To improve the accuracy of boundary prediction crucial to nodule segmentation, a boundary consistency constraint is designed and incorporated in the segmentation loss function. Experiments on publicly available LUNA16 dataset show that the framework outperforms relevant advanced methods in quantitative evaluation and visual perception.


Timely diagnosis and treatment of lung cancer can reduce mortality rate of patients, and early manifestation of lung cancer is mainly pulmonary nodules [1]. Popularity of Computed Tomography (CT) enables radiologists to diagnose pulmonary nodules with more convenience. However, it is a time-consuming and labor-intensive task for radiologists to make accurate judgements from a number of CT scans. Moreover, the volumes of pulmonary nodules are relatively small, and the shapes are easily confused with surrounding blood vessels in most cases. Nowadays, Computer-Aided Diagnosis (CAD) has been widely used to assist radiologists for diagnosis, and pulmonary nodule segmentation and classification of benign and malignant are crucial steps in CAD systems [2]. The shape, size, growth position and other characteristics can be observed from precise segmentation results of various types of pulmonary nodules in CT images, which also provide a reference for the classification of benign and malignant pulmonary nodules. Therefore, the accurate segmentation and classification crucial in early detection of pulmonary nodules have raised great research value.

In this paper, we design a network based on adversarial training scheme for pulmonary nodule segmentation and classification at one time. High-Resolution network with Multi-scale Progressive Fusion (HR-MPF) is proposed based on High-Resolution Network (HRNet) [3]. In comparison with HRNet, HR-MPF is a high-resolution network reduced to only three stages with several modifications. In HR-MPF, modified boosted module is inserted in the network in multi-scale progressive feature fusion manner to deliver spatial and context information from different resolutions. Specifically, the modified boosted module engages residual blocks that can resolve vanishing gradient problem and promote effective feature learning. Corresponding to the feature extraction network HR-MPF, a Progressive Decoding Module (PDM) is proposed to recover the pixel-wise segmentation prediction from the output of HR-MPF. Then, fused feature map is fed to classification module to distinguish types of pulmonary nodules. In the adversarial training scheme, a discriminator is integrated to optimize HR-MPF and PDM, inspired by the generative adversarial network. Hence, the optimized segmentation module enhances segmentation and classification performance in general. In addition, a simple but practical loss function with boundary consistency constraint is proposed in the segmentation loss. This constraint measures the inconsistency between the boundaries of segmentation prediction and ground truth such that accuracy of boundary prediction is improved. Experiments on LUNA16 dataset show superior results both for segmentation and classification in quantitative evaluation and visual perception.

The main contributions of this paper are as follows:

  • An HR-MPF enhancing feature communication of all scales is proposed in which the boosted module is introduced to HRNet based on multi-scale progressive fusion strategy;

  • A PDM is proposed to recover final pixel segmentation prediction in progressive fusion manner and refine the output from HR-MPF;

  • A discriminator is set to provide additional supervision to the training process of HR-MPF and PDM;

  • A loss function with boundary consistency constraint is proposed that improves the accuracy of boundary segmentation. At the same time, a reasonably designed multi-task loss function jointly optimizes the whole framework.

Related works

Pulmonary nodule segmentation

With the rapid development of deep learning in recent years, Convolutional Neural Network (CNN) has been widely applied to the field of medical image processing [4] and computer vision [5]. By quick features extraction network trained in supervised manner, nodule segmentation and classification are improved by deep learning-based methods. Previously, encoder-and-decoder architectures were widely utilized in these semantic segmentation tasks. Chen et al. [6] established a framework, in which the encoder extracted features of nodules, Atrous Spatial Pyramid Pooling (ASPP) captured multi-scale information through atrous convolution at different atrous rates, and decoder recovered spatial resolution. In U-Det [7], Bidirectional Feature Pyramid Network (Bi-FPN) was added between the encoder and decoder for multi-scale feature fusion. Mish activation function and class weight of mask were also used to improve segmentation accuracy. Some other works also focus on the context information of pulmonary nodules indicated by the relationship between pixels. For instance, CoLe-CNN [8] accessed the context information of nodules by generating two masks of all background and secondary elements, and introduced an asymmetric loss function that could automatically compensate for errors in annotations of nodules. CDP-ResNet [9] combined the multi-view and multi-scale features of nodules and extracted rich local features and context information. Yan et al. [10] introduced a Mask R-CNN-based segmentation method to deal with class imbalance problem. At present, more and more studies pay attention to 3D segmentation of pulmonary nodules due to the significance of spatial structure characteristics for medical diagnosis. Wang et al. [11] adopted a 3D segmentation network for pulmonary nodules to obtain the three-dimensional global features of pulmonary nodules. Sun et al. [12] and Dong et al. [13] took the slices of different views in CT scans as input to realize multi-view collaborative learning.

Generative adversarial network-based medical image segmentation

Since medical images involve privacy of patients and labeling is labor-consuming, most of the medical image datasets are limited in quantity and annotation information. Among pulmonary nodule datasets, the number of benign nodules also exceeds malignant nodules. In order to generate more diverse CT images for training, Generative Adversarial Network (GAN) [14] is widely used as a data augmentation method. For example, Conditional Generative Adversarial Network (CGAN) was adopted by Qin et al. [15] to synthesize CT images, and a 3DCNN network with residual units was established for pulmonary nodule segmentation, so that the training speed and segmentation accuracy were greatly improved. Onishi et al. [16] synthesized pseudo-pulmonary nodules on the basis of the three-dimensional regions of pulmonary nodules extracted from CTs. Considering of the diversity of pulmonary nodules, Shi et al. [17] introduced a style-based GAN to synthesize the pulmonary nodules with different styles, and the experiments proved that using augmented samples can obtain more accurate and robust segmentation results.

GAN can not only be used to synthesize images and augment datasets, but also effectively improve the quality of segmentation results. Some researchers [18,19,20] combine a segmentation network, which is regarded as a generator, with a discriminator to process the segmentation task. In the works of Nie et al. [18] and Decourt et al. [19], the generator outputted segmentation predictions and the discriminator outputted confidence maps. The regions with high confidence in the segmentation results were used to further guide the training process of segmentation network. The discriminator in Spine-GAN [20] outputted 0 or 1 representing whether the input was ground truth or prediction result. In some other GAN-based segmentation methods, there is not only a segmentation network, but also a generator. For instance, in Parasitic GAN [21], the segmentation network generated pixel-wise segmentation predictions, the generator synthesized supplementary label maps based on the input random noise, so that the discriminator could learn the more accurate boundaries of ground truth.

Pulmonary nodule classification

Accurate segmentation of pulmonary nodules is of great importance to CAD systems, and the classification of benign and malignant nodules is essential to timely treatment of lung diseases. In order to cope with the variant characteristics of pulmonary nodules, some modifications are made on CNN-based framework and applied to nodules classification. Wang et al. [22] proposed a CNN based on multi-path and multi-scale network robustness of variance in pulmonary nodule volumes and shapes. In STM-Net [2], the scale transfer module and multi-feature fusion operation could enlarge small targets and adapt to images of different resolutions. Multi-planar analyses of pulmonary nodules, that take the features from different places into consideration, are also adopted in some classification methods. For example, the methods of pulmonary nodule classification proposed by Zhang et al. [23] and Onishi et al. [24] extracted features from CT images in three planes (coronal plane, sagittal plane and axial plane). In addition, researchers consider that the malignancy of pulmonary nodules is related to not only its morphological characteristics, but also patients’ personal condition. Tong et al. [25] put forward an automated pulmonary nodule diagnosis system and multiple kernel learning algorithm, combined patients’ age, medical history and other personal information to classification based on the shape characteristics of nodules.


Fig. 1

The architecture of our proposed framework for pulmonary nodule segmentation and classification. HR-MPF is a feature extraction network, and PDM generates segmented pulmonary nodules. Classification module differentiates benign and malignant pulmonary nodules. The discriminator jointly optimizes segmentation and classification of pulmonary nodules in which a loss function with boundary consistency constraint is proposed to calculate inconsistency between the boundaries of prediction and ground truth

The architecture of the proposed framework is shown in Fig. 1, which is designed for pulmonary nodule segmentation and classification all at once. In this paper, the proposed HR-MPF with multi-scale progressive fusion scheme is used for segmentation and an adversarial strategy is applied to optimize segmentation module through back propagation. Specifically, the input CT images are preprocessed to assist efficient segmentation of HR-MPF. Meanwhile, through confrontation training between segmentation network and discriminator, the segmentation results can gradually approach to ground truth. To improve boundary pixel prediction accuracy which is a main challenge of pulmonary nodule segmentation, a loss function of boundary consistency constraint is proposed. Finally, the feature map from PDM would be delivered to classification module to discriminate types of the pulmonary nodule (benign/malignant).


As a preliminary step, all images are re-sampled to make the pixel interval uniform, because of the different pixel intervals in different CT scans. The preprocessing process includes three steps: (i) CT scan with an effective value of Hounsfield unit between [− 1000, + 400] is transformed into the range of 0–255 through linear mapping; (ii) lung CT images contain not only lung tissues, but also blood vessels, lung trachea, and other external tissues. However, pulmonary nodules are located inside or at the edge of the lung in most cases, so it is necessary to remove the tissue other than lung parenchyma to reduce interference with pulmonary nodule segmentation. Therefore, we apply a mask to the normalized CT scan for obtaining lung parenchyma shown in Fig. 2; (iii) considering that pulmonary nodule accounts for a small proportion in a whole CT image, a 64\(\times\)64 region of interest (ROI) in the lung parenchyma is cropped, as the part in the red box in Fig. 2, and the cropped images will be input to the proposed feature extraction network.

Fig. 2

a Lung CT image; b lung parenchyma image. Lung parenchyma image is extracted from CT image in preprocessing

Pulmonary nodule segmentation module

Pulmonary nodule segmentation is performed through the proposed HR-MPF and PDM. As shown in Fig. 3b, the segmentation module of our method adopts classical HRNet (HRNetV2) [3] as baseline, and some boosted modules are merged with multi-scale progressive fusion scheme. By this network architecture, spatial and context information of different resolutions could be comprehensively fused.

Fig. 3

The architecture of the networks: a the HRNetV2 [3]; b the proposed HR-MPF network consists of three stages (blue boxes). This is a multi-scale progressive fusion network with boosted modules incorporated. The fusion is implemented from high-to-low or low-to-high resolution so that the spatial and context information of features of all resolutions are fully engaged. The size of the feature map in each path is also marked

As shown in Fig. 3a, stage 2, stage 3 and stage 4 of HRNetV2 [3] contain a multi-resolution group convolution and a multi-resolution convolution. In multi-resolution convolution, the resolution of each input subset is adjusted by strided convolution or bilinear up-sampling, and each output subset fuses the features by adding input subsets of same resolution and channels. This feature fusion manner could merge features from different input subsets, but this pure summation scheme may cause loss of information from different resolutions. Therefore, HR-MPF that incorporates boosted module to HRNet in a progressive fusion manner is designed to eliminate loss of different resolution information to the greatest extent. Then, a corresponding PDM is also designed to generate the final pixel-wise predictions of the same size of as input images recovering from coarse feature maps outputted by HR-MPF.

1) HR-MPF Considering that low-level features with higher resolution contain much spatial and detailed information, while the high-level features are rich in semantic information. Therefore, this study introduces a modified boosted module to original basic units of HRNet and concatenates them through progressive feature fusion manner. In this case, features of different scales can well deliver their information to each other.

The proposed HR-MPF extracts features of different resolutions gradually in three stages and maintains a high-resolution representation throughout the network. Different from HRNetV2 with four stages, the size of the input pulmonary nodule image is 64\(\times\)64 such that three stages are enough for feature extraction. As shown in Fig. 3b, HR-MPF starts at a high-resolution stage with only one branch of feature map size 16\(\times\)16, and a branch whose resolution is 1/2 of the lowest resolution in previous stage is added in each stage subsequently. By the progressive fusion strategy applied, semantic information from low-resolution branches and spatial information from high-resolution branches are delivered to each branch with modified boosting method incorporated. Progressive feature fusion is implemented from the highest resolution to the lowest resolution or the lowest to highest. In stage 3 for example, the strided convolution or up-sampling is used for changing the resolution of input to ensure both inputs of the boosted module have the same resolution. Fusion is implemented from high-to-low or low-to-high resolutions so that the output feature map would maximally preserve structural details and semantic information from all three branches. In low-to-high fusion pathway, the 4\(\times\)4 input first undergoes a convolution and up-sampling and is fused with the 8\(\times\)8 input by a boosted module. Then, the fusion result undergoes a convolution and up-sampling, and is finally fused with the 16\(\times\)16 input. Similarly, in the high-to-low fusion pathway, the 16\(\times\)16 input first undergoes a strided convolution for down-sampling so that its resolution becomes 8\(\times\)8. Then, it is fused with the 8\(\times\)8 input with a boosted module, and the output of the boosted module is 8\(\times\)8. The output also undergoes a strided convolution for down-sampling and is finally fused with the 4\(\times\)4 input with a boosted module. Ultimately, three feature maps engaged feature information of all scales would be conveyed to the proposed PDM for decoding. By this progressive feature fusion strategy, cooperative representations can be promoted [26] and the spatial and context information between multi-scale features are effectively extracted and integrated.

Fig. 4

The architecture of the boosted module. The two inputs of the boosted module are first concatenated and passed through two residual blocks. Structure of the residual block is also amplified

The boosting method in our progressive fusion network refines results by fusing the current and another level of features. Boosted module was initially used in image denoising, but it has also been applied to built a boosted decoder for task such as dehazing [27]. The concatenation style of boosted module we use (see Fig. 4) belongs to “U-Net module” classified as an alternative of the SOS boosted module used in [27]. Considering the size of the input image, our boosted module is a stack of two residual blocks and the computation can be simultaneously compressed. Each residual block consists of two 3\(\times\)3 convolutional layers, batch normalization and ReLU layers. The connection method of residual blocks adopts the method proposed in [28] that the batch normalization and ReLU layers are placed before the convolution layers. The residual learning mechanism is suggested to well solve problem of gradient vanishing in back propagation when layers of the network become deeper.

Fig. 5

Architecture of the proposed high-resolution progressive decoding module (PDM)

2) Progressive decoding module As shown in Fig. 5, a new decoder module PDM for HR-MPF is designed to progressively recover integrated feature maps to original sizes. Different from the representation head in HRNetV1 [29] and HRNetV2 [3] that recover final pixel-wise prediction results by direct up-sampling, PDM integrates feature maps by feature fusion among all levels progressively. Generated feature maps with resolutions of 16\(\times\)16, 8\(\times\)8 and 4\(\times\)4 are taken as input, of which 8\(\times\)8 and 4\(\times\)4 feature maps go through a deconvolution layer and are concatenated with feature maps of higher resolution (16\(\times\)16 and 8\(\times\)8), respectively. Next, the two concatenated feature maps are separately convolved, and then concatenated together after 8\(\times\)8 feature map undergoes a deconvolution layer. Finally, a 16\(\times\)16 feature map with 64 channels is obtained and would be delivered to the classification module to discriminator types of pulmonary nodule. The optimized segmentation prediction would be then passed through a DUpsampling for binarization and fed into the discriminator.

Fig. 6

Structure of DUpsampling, scale denotes the up-sampling ratio, W is the inverse projection matrix and N is the dimension of W

3) DUpsampling The last layer of the many semantic segmentation networks with encoder–decoder architectures [30, 31] typically exploit bilinear up-sampling to recover final segmentation prediction. However, this data-independent and over-simple bilinear up-sampling may result in sub-optimal results, whose capability in recovering detailed edge and texture features is limited. Our segmentation module adopts DUpsampling [32], which is a data-dependent up-sampling that considers the correlation among prediction of each pixel. It recovers pixel-wise prediction from final high-resolution representation in the PDM. As shown in Fig. 6, DUpsamling takes effect the same as applying a 1\(\times\)1 convolutional layer along spatial dimensions, and the convolutional kernels are stored in a learnable reconstruction matrix W. DUpsampling makes use of the redundancy in label space of pulmonary nodule segmentation that can be compressed considerably almost with no loss. Although two matrixes computed in pre-training are suggested [32], matrix P is used to compress segmentation labels with linear projecting and W is the corresponding inverse projection matrix. Considering that two approaches could be selected to compute the segmentation loss in our adversarial framework, one is calculating loss between coarse outputs of the encoder and compressed labels, the other is between the decompressed coarse outputs and segmentation labels. The PDM takes the second strategy that matrix W is utilized for recovering the pixel-wise segmentation prediction from the final 64\(\times\)16\(\times\)16 feature map.

Pulmonary nodule classification module

Since pulmonary nodules usually distribute in very small span, accurate segmentation is appreciated for classification training. Follow-up classification module would classify the types of pulmonary nodule based on the largely improved segmentation results from previous segmentation module. The input size of classification module is 64\(\times\)16\(\times\)16 and would be transformed into a 128-dimensional feature vector after a global average pooling layer and a 1\(\times\)1 convolutional layer. Finally, the possibilities of benign and malignant are output through a fully connected layer.


In current segmentation networks, cross-entropy loss and dice loss are commonly used to minimize the differences between prediction result and ground truth. However, a discriminative network can more efficiently guide the learning of segmentation towards desirable results by back propagating mismatches between prediction result and ground truth [33]. Inspired by GAN, the proposed framework takes the segmentation network as generator and adopts a discriminator at the end for adversarial training. In the training process of our discriminator, ground truth or a prediction result of PDM would be input, and the output is a confidence map with the same size as input. As a supervisory signal, confidence map indicates the quality of the segmentation and helps segmentation module to know the regions it can trust during the training [11]. Specifically, the discriminator network is composed of three convolutional layers of 64, 128 and 1 channels, respectively, and convolution kernels of them are all 4\(\times\)4. The structure of discriminator is shown in Fig. 7. The convolution stride of the first two layers is 2, while the stride of the last layer is 1. Batch normalization and ReLU activation function are used after the first two convolutional layers, and the output is scaled to the size of the input through bilinear up-sampling. Finally, the discriminator outputs a 1\(\times\)64\(\times\)64 confidence map, and each pixel represents the probability of ground truth or prediction results from segmentation module [20]. This process is carried out on a pixel-by-pixel basis, which can improve the accuracy of segmentation.

Fig. 7

The structure of discriminator which back propagates the segmentation modules. Discriminator takes the segmentation predictions from PDM and the ground truth as input alternately

Loss function of networks

In the adversarial training of our method, a segmentation loss with our proposed boundary consistency constraint loss is used for segmentation optimization, and a cross-entropy loss is applied for classification optimization.

1) Segmentation loss Our segmentation task is to classify all pixels of the input images into nodules and background, which is a pixel-level binary classification problem. The loss function \({\mathcal {L}}_{\text {seg}}\) is used to measure the gap between prediction results and ground truth. It adopts cross-entropy loss, which is a practical and effective loss function and often applied in classification problems. In fact, it is a convex function convenient for optimization during training as follows:

$$\begin{aligned} {{\mathcal {L}}_{\text {seg}}} = - \frac{1}{N}\sum \limits _{i,j} {\sum \limits _{c \in C} {y(i,j)\log \left( {{\hat{y}}(i,j)} \right) } }, \end{aligned}$$

where y is ground truth and \({\hat{y}}\) is the prediction results of segmentation module. The pixel value at coordinate (ij) in y is expressed as y(ij). N is the total number of pixels in y, and C denotes the number of categories.

The adversarial loss function is used to train the segmentation module and fool the discriminator D by maximizing the probability of segmentation prediction, thus the segmentation module output can be closer to the ground truth. Our adversarial loss is defined as:

$$\begin{aligned} {{\mathcal {L}}_{\text {adv}}} = - {\mathbb {E}}[D({\hat{y}})], \end{aligned}$$

where \(D({\hat{y}})\) means the output of the discriminator for input segmentation prediction results. As indicated by [34] that it’s difficult to handle the prediction error at the boundaries of segmented objects well and the number of error pixels increases with the distance from the boundary getting closer. This means that the prediction of boundary pixels is relatively unreliable in overall segmentation. Besides, shapes and sizes of pulmonary nodules vary and most of them are irregular, which also increase difficulty of boundary prediction. As a result, the refinement of boundary prediction affects significantly in accurate segmentation results.

Fig. 8

Schematic diagram of the boundary extraction method. B(ij) is the current pixel, and \(B(i, j +1)\), \(B(i, j -1)\), \(B(i +1, j)\), \(B(i -1, j)\) represent four adjacent pixels

Fig. 9

Pulmonary nodules images, ground truth images and boundary images (top to bottom)

Therefore, this paper applies a loss function with boundary consistency constraint to calculate the inconsistency between the boundaries of prediction result and ground truth to improve boundary pixel segmentation accuracy. Since the segmentation results and ground truth of pulmonary nodules are binary images, we only need to make judgement on each pixel. If the current pixel B(ij) is 1, and one or more of its four adjacent pixels is 0, the pixel is a boundary pixel. As shown in Fig. 8, the schematic diagram of the boundary extraction method, B(ij) is the current pixel, and \(B(i, j+1)\), \(B(i, j-1)\), \(B(i+1, j)\), \(B(i-1, j)\) are the four adjacent pixels. Figure 9 shows the several boundary images obtained by the proposed method.

The boundary of prediction results \(B_{\text {pre}}\) and ground truth \(B_{\text {gt}}\) are obtained, respectively, the loss function with boundary consistency constraint \({\mathcal {L}}_b\) is calculated by:

$$\begin{aligned} {{\mathcal {L}}_b} = \frac{{\sum \limits _{i,j} {\left| {{B_{\text {pre}}}(i,j) - {B_{\text {gt}}}(i,j)} \right| } }}{{\sum \limits _{i,j} {{B_{\text {pre}}}(i,j)} + \sum \limits _{i,j} {{B_{\text {gt}}}(i,j)} }}. \end{aligned}$$

In order to improve the performance of pulmonary nodule segmentation, two consistency constraints are introduced in this study. The loss functions \({\mathcal {L}}_{\text {seg}}\) and \({\mathcal {L}}_b\) measure the consistency of pixels and boundaries between the prediction results and ground truth, respectively. Therefore, the loss function needs to be minimized in the training process of the segmentation module, and it is given by:

$$\begin{aligned} {{\mathcal {L}}_S} = {{\mathcal {L}}_{\text {seg}}} + {\lambda _1}{{\mathcal {L}}_{\text {adv}}} + {\lambda _2}{{\mathcal {L}}_b}, \end{aligned}$$

where \({\mathcal {L}}_{\text {seg}}\) is segmentation loss, \({\mathcal {L}}_{\text {adv}}\) is adversarial loss, and \({\mathcal {L}}_b\) is the loss function of boundary consistency constraint. Both \({\lambda }_1\) and \({\lambda }_2\) are two weights to balance \({{\mathcal {L}}_{\text {adv}}}\) and \({{\mathcal {L}}_b}\).

2) Classification loss Since the classification task judges the type of nodule as benign or malignant, we use cross-entropy usually applied for binary classification of discrete target variables to determine how close the actual output is to the expected. Therefore, the loss of classification training could be defined as:

$$\begin{aligned} {{\mathcal {L}}_{\text {cls}}} = - \sum \limits _{k = 1}^n {{p_k}\log {{{\hat{p}}}_k}}, \end{aligned}$$

where \({p_k} \in \{ 0,1\}\) is ground truth, \({p_k}\mathrm{{ = }}0\) represents that nodule is benign, while \({p_k}\mathrm{{ = }}1\) represents nodule is malignant. \({{\hat{p}}_k} \in [0,1]\) is prediction of classification module. n is the batch size, and k is the index of samples in a batch.

Since this network architecture achieves segmentation and classification in one model, a multi-task loss function is used for joint optimization as follows:

$$\begin{aligned} {{\mathcal {L}}_{\text {total}}} = {{\mathcal {L}}_S} + {\lambda _3}{{\mathcal {L}}_{\text {cls}}} = {{\mathcal {L}}_{\text {seg}}} + {\lambda _1}{{\mathcal {L}}_{\text {adv}}} + {\lambda _2}{{\mathcal {L}}_b} + {\lambda _3}{{\mathcal {L}}_{\text {cls}}}. \end{aligned}$$

The values of \({\lambda }_1\), \({\lambda }_2\) and \({\lambda }_3\) will be discussed in Section 4.3. Figure 10 displays the loss function mentioned in (6), which indicates the convergence performance of the proposed method.

Fig. 10

Curve for \({\mathcal {L}}_{total}\) of the proposed model as number of epochs increases during training stage

3) Loss function of discriminator Although the training of GAN is the process of confrontation between generator and discriminator, the segmentation module outputting segmentation results could be regarded as a generator. The objective of the discriminator is to accurately determine the source of input and make the output of generator approach real data distribution during the confrontation training. Therefore, the design of a reasonable loss function of discriminator is also crucial to the training process of GAN. WGAN [35] improves the traditional GAN loss function and solves the problem of unstable training and collapse mode. In our method, the loss function adopts that in WGAN, which is defined as follows:

$$\begin{aligned} {{\mathcal {L}}_D} = - {\mathbb {E}}[D(y)] + {\mathbb {E}}[D({\hat{y}})], \end{aligned}$$

where D(y) means the output of the discriminator for input ground truth. \({\mathcal {L}}_D\) would maximize \({\mathbb {E}}[D(y)]\) when the input is ground truth and minimize \({\mathbb {E}}[D({\hat{y}})]\) when the input is segmentation result. The smaller the \({\mathcal {L}}_D\) is, the smaller the Wasserstein distance between the real distribution and the generated distribution is. After each update of the parameters of discriminator, the absolute values of them are truncated to no more than a fixed constant c, which is set to 0.01 in our experiments. The segmentation task is to classify all the pixels of the input images into nodules and backgrounds, which is a pixel-level binary classification problem.

Results and discussion

Datasets and evaluation metrics

In this paper, we employ publicly available LUNA16Footnote 1 dataset for evaluating all methods. LUNA16 excludes CT images with slice thickness larger than 3mm and pulmonary nodules with diameters less than 3mm from the LIDC-IDRI dataset [36]. LUNA16 contains 888 CT scans where 1186 pulmonary nodules are annotated by at least three radiologists. The degree of malignancy of each pulmonary nodule is evaluated with a score of \(1\sim 5\), the higher the score, the higher the degree of malignancy. Nodules with a mean of 1 or 2 are classified as benign, with a mean of 4 or 5 are classified as malignant, and with a mean of 3 are ignored [23]. Ultimately, 835 nodules are obtained, including 539 benign nodules and 296 malignant nodules. At the same time, the dataset is augmented with the operation of flip, rotation and transpose and then enlarged by 6 times. There are a total of 5845 nodules, of which 4968 are used for training (3207 benign nodules, 1761 malignant nodules) and 877 are used for testing (566 benign nodules, 311 malignant nodules).

Some standard metrics namely Accuracy (Acc), Dice Similarity Coefficient (DSC), Mean Intersection over Union (MIoU), Precision (Prec), Sensitivity (SE) and Specificity (SP) are used to validate the performance of our method. The less popular evaluation criteria MIoU is defined as:

$$\begin{aligned} \text {MIoU} = \left( \frac{{\text {TP}}}{{\text {FN} + \text {TP} + \text {FP}}}\mathrm{{ + }}\frac{{\text {TN}}}{{\text {FP} + \text {TN} + \text {FN}}}\right) /2, \end{aligned}$$

where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives. For the pulmonary nodule segmentation task, the nodular area is positive, the non-nodular area is negative. For the pulmonary nodule classification task, we define that benign nodules are negative, and malignant nodules are positive.

Implementation details

The proposed method is trained on the Pytorch platform with Python 3.6. The training and validation of overall network are performed on a computer with Intel(R) Core (TM) i7-10700 CPU (2.9GHz) with 16G RAM memory, and NVIDIA GeForce GTX 2070 SUPER with 8G memory.

During the experiment, the sizes of all input images are \(64\times 64\) and the number of epoch and batch size are set to 80 and 16, respectively. AdaBelief (learning rate=\(2.5 \times 10 ^ {-4}\), eps=\(10 ^ {-6}\), Betas=(0.5, 0.999)) is used as optimization algorithm of segmentation network, which is of fast convergence and high accuracy, and performs high stability when training a GAN [37]. In addition, the optimizer of discriminator is RMSProp (learning rate=0.002, eps=\(10 ^ {-8}\)).

Parameter setting

Fig. 11

Influence of the values of parameter on experimental results, a \({\lambda }_2=1\), \({\lambda }_3=0.1\); b \({\lambda }_1=0.1\), \({\lambda }_3=0.1\); c \({\lambda }_1=0.1\), \({\lambda }_2=1\)

As a one-off pulmonary nodule segmentation and classification method, total loss function \({\mathcal {L}}_{\text {total}}\) defined in (6) is balanced by three trade-off parameters. Specifically, \({\lambda }_1\), \({\lambda }_2\) and \({\lambda }_3\) control the weights of adversarial loss \({\mathcal {L}}_{\text {adv}}\), boundary consistency constraint \({\mathcal {L}}_b\) and loss function of classification \({\mathcal {L}}_{\text {cls}},\) respectively. To evaluate how individual loss contributes to total loss, we fix any two parameters and measure MIoU, DSC for segmentation task and Acc, SE for classification task when varying the third parameter. As shown in Fig. 11, weight of adversarial loss is less variable compared to the other two as expected. For instance, the value of SE drops from 0.9481 to 0.9365 as \({\lambda }_1\) varies from 0.2 to 0.3 most noticeably. By contrast, larger \({\lambda }_3\) for \({\mathcal {L}}_{\text {cls}}\) seems to cause a decrease in both segmentation and classification performance, although quantitative results fluctuate insignificantly. In addition, it could be observed that more boundary loss applied to total adversarial training results in a slightly better segmentation results in general. Classification result presents a similar trend that Acc and SE peak at 0.9768 and 0.9440 while \({\lambda }_2\) setting at 0.6 and 1, respectively. In general, quantitative results fluctuate no more than 0.01 from a wide range of selection for each parameter. By evaluating experimental results on both segmentation and classification, \({\lambda }_1\), \({\lambda }_2\) and \({\lambda }_3\) are set to 0.1, 1, and 0.1, respectively.

Evaluation on loss function \({\mathcal {L}}_b\)

Fig. 12

Illustration from top to bottom represents original CT images, ground truth, segmentation results by the proposed segmentation framework without \({\mathcal {L}}_b\) and with \({\mathcal {L}}_b\), respectively. Images ae refer to five samples, respectively

To verify effectiveness of the proposed boundary loss function \({\mathcal {L}}_b\), experimental results on five instances are conducted whether using \({\mathcal {L}}_b\) or not, as shown in Fig. 12. Noticeably, visual results shown in this figure present a much similar outline to ground truth (such as w/ \({\mathcal {L}}_b\) for Fig. 12a, c). Qualitative results of segmentation task using Acc, SE, Prec, MloU, DSC are reported in Fig. 13, In all cases, boundary loss contributes to an increase in quantitative results to some extent. Generally, segmentation seems to be affected more significantly since high attention to boundary areas contributes straightforward to segmentation accuracy.

Fig. 13

Comparison of Acc, SE, Prec, MIoU and DSC on segmentation by the proposed framework with (w/) and without (w/o) \({\mathcal {L}}_b\). Graphs ae correspond to same images illustrated in Fig.12

Evaluation on pulmonary nodule segmentation

In order to evaluate the segmentation performance of our proposed network for pulmonary nodules, we compare it with nine different segmentation algorithms based on deep neural networks, ENetFootnote 2 [38], SegNetFootnote 3 [30], PSPNetFootnote 4 [39], UNet++Footnote 5 [40], DeepLabV3+Footnote 6 [31], DFANetFootnote 7 [41], Fast SCNNFootnote 8 [42], FANetFootnote 9 [43], SPNetFootnote 10 [44]. The implementation of these algorithms is based on publicly published codes. As shown in Fig. 14, DeepLabV3+ and UNet++ achieve relatively better performance among all approaches. The reason might be that the atrous convolutions can capture multi-scale information at different atrous rates, and the encoder–decoder architecture can extract features and recover spatial resolution better in DeepLabV3+. The encoder and decoder are connected through nested and dense skip pathways, which can eliminate semantic gap between feature maps as well by UNet++. However, segmentation results of the proposed model are more precise than comparative methods because of the progressive fusion strategy and loss function with boundary consistency constraint. It could be seen from nodules with complex boundaries especially that the proposed network locates the exact region with very similar boundaries to ground truth (e.g. the 5th and 8th images of Fig. 14).

Fig. 14

Original images and segmentation results of pulmonary nodules. From top to bottom: original CT images, ground truth, SegNet [30], ENet [38], PSPNet [39], DeepLabV3+ [31], UNet++ [40], Fast SCNN [42], DFANet [41], FANet [43], SPNet [44], Ours

Table 1 Quantitative results of pulmonary nodule segmentation in which highlighted values represent the best result of each metric

All quantitative experiments for segmentation on LUNA16 dataset obtained by the proposed HR-MPF and recent deep learning-based models are shown in Table 1. Different from Acc and SE in parameter experiment, Acc and SE in Table 1 measure segmentation results in this comparative experiment. It is obvious that Acc for all methods are of minor differences and the values stay at relatively high level. This is mainly because non-nodular area makes up the majority of CT images in general, which is easily divided into non-nodular area. MIoU and DSC both measure region intersection with ground truth, and DeeplabV3+, UNet++ and SPNet all show relatively competitive results similar as in visual comparison. In detail, DeeplabV3+ achieves 0.9222 and 0.9204, and UNet++ achieves even better results at 0.9263 and 0.9227 for MIoU and DSC, respectively. The proposed HR-MPF with progressive feature fusion and adversarial training scheme achieves slightly better results than comparative methods in all aspects. It could be seen that the proposed method could qualitatively locate the area of nodules as MIoU and DSC both stay above 0.937. In addition, the proposed method seems to be particularly sensitive rather than missing suspicious areas. In general, the proposed architecture is verified effective on improving performance for pulmonary nodule segmentation, while DeeplabV3+, UNet++ and SPNet also achieve qualified results.

Table 2 Comparison of parameters (M), training time (m) and FLOPs (G) between our method and other segmentation networks, in which highlighted values represent the best result of each criteria

In this sub-section, computational cost of this method is evaluated in terms of parameters, training time and FLOPs [45] as shown in Table 2. Input size for all comparative methods is set to the same size (\(64\times 64\)) and all experiments are implemented under the same conditions. It could be seen that Fast SCNN uses the least training time (38 min) and has the least model complexity as indicated by FLOPs (0.02G). However, ENet has the least number of parameters (0.36M) though number of parameters for Fast SCNN also stay at a low level (1.2M). Model complexity of the proposed method stays at a moderate level (3.46M parameters, 0.16G FLOPs), but relatively long training time is spent due to the adversarial training scheme we introduce for better results.

Evaluation on pulmonary nodule classification

Fig. 15

Acc curve of classification results by the proposed framework with respect to epochs

Fig. 16

PR curve of pulmonary nodule classification which plots recall against precision

Figure 15 demonstrates the influence of iteration on classification accuracy. It could be observed that the Acc curve raises relatively quickly. In detail, Acc extremely approaches 1 as the epoch increases to 30. Therefore, it could be concluded that the training of classifier is relatively convergent and stable. Figure 16 illustrates the precision–recall curve which indicates the influence of thresholds on classifier performance. It could be observed that the precision–recall curve stays close to the top-right corner where the area under curve (AUC) arrives at 0.9646.

Table 3 Quantitative results of pulmonary nodule classification in which highlighted values represent the best result of each metric

In order to evaluate the classification performance of proposed method, we compare it with five classification networks in terms of Acc, SE, SP and AUC as listed in Table 3. Evaluation results are taken from cited papers and the result for each metric differs noticeably. However, it could be seen that the values of SP are greater than that of SE. This might be because that most malignant nodules have more prominent shape characteristics than benign pulmonary nodules, such as the more irregular shape and larger volume as shown in Fig. 17. In general, Zuo et al. [46] achieves surpassing results where Acc, SE are both over 0.97. The network established by Zuo et al. only conducts classification and uses a multi-resolution convolutional neural network to extract features. In addition, the proposed method also achieves qualified results among comparative methods. Specifically, highest Acc and SP values are achieved (0.9768 and 0.9789) because the proposed framework that jointly conducts segmentation and classification could well capture malignant nodules if it is in prominent shape characteristics. Overall, the proposed method for segmentation and classification provides relatively competitive results among methods for classification only.

Fig. 17

Some samples of benign (the first row) and malignant (the second row) pulmonary nodules in classification results

Ablation study

The proposed HR-MPF is modified based on HRNetV2 [3]. To verify the effectiveness of several improvement proposed by HR-MPF, comparative experiments with HRNetV1 [29] and HRNetV2 [3] are implemented. HRNetV1 and HRNetV2 with four stages originally have been reduced to three stages for equivalent comparison with our network. As shown in Table 4, HRNetV2 achieves slightly higher results than HRNetV1 for all metrics. In comparison, the proposed HR-MPF brings an even noticeable increase especially on MIoU (0.9286) and DSC (0.9252). This experiment generally verifies advantage of the proposed progressive fusion architecture with boosted module.

Table 4 Comparison of our method with other versions of HRNet, the highlighted values represent the best results
Table 5 Quantitative results on segmentation and classification by different arrangements of feature fusion in HR-MPF
Fig. 18

Different arrangements of feature fusion in HR-MPF. ac provide an example of the 16\(\times\)16 feature fusion pathway in Stage3 (Fig. 3)

We also make a comparison between different architectures of the proposed HR-MPF with multi-scale progressive fusion. Because the progressive fusion basically fuses feature maps of different resolutions with a boosted module input by same resolution features, up-sampling and convolution need to precede the boosted module. As shown in Fig. 18, the three available combinations are: (a) an up-sampling, a convolution+Batch Normalization, and a boosted module followed by; (b) an up-sampling, a boosted module, and a convolution+Batch Normalization; (c) ours, a convolution+Batch Normalization, an up-sampling, and a boosted module followed by. As shown in Table 5, change of architecture has mild effect on both segmentation and classification performances, while model parameters are the same. Quantitative results of (a) and (c) are similar, which indicates that the order of convolution and up-sampling effects the results insignificantly. However, the scores of (b) are relatively lower than (a) and (c). In the architecture of (b), an individual up-sampling set before the boosted may not guarantee an explicit and constrained fusing process of two inputs for boosted modules [27].

Table 6 Ablation study of each module in the whole framework, the highlighted values represent the best results

To verify the effectiveness of each module in the whole framework, several ablation experiments are conducted and the results are recorded in Table 6, where S represents only the segmentation network (including HR-MPF and PDM), C denotes the classification module and D stands for the discriminator. Ablation model “S+C” represents the proposed framework ablating the discriminator and adversarial loss in Eq. (6). Experimental results show that performing segmentation and classification together does not affect segmentation results very significantly. However, the discriminator we introduced improves both segmentation and classification results significantly. For instance, segmentation Prec rises from 0.9353 to 0.9427 and Acc for classification increases from 0.9650 to 0.9768. This might be benefited from the adversarial training mechanism that guides the training process of HR-MPF and PDM such that segmentation results more approaching to ground truths are generated. Hence, the accurate segmentation brings an improvement to subsequent classification results as well. Therefore, the architecture with two sub-networks and adversarial training strategy is verified effective.


In this paper, an effective multi-task framework is designed for pulmonary nodule segmentation and classification, which can contribute to clinical diagnosis of pulmonary nodules. Specifically, a widely applicable feature extraction network HR-MPF is proposed. This architecture attributes progressive fusion strategy to HRNet with modified boosted modules incorporated. Corresponding PDM decoding predictions from HR-MPF is also designed which recovers the final pixel-wise segmentation predictions in progressive fusion manner. Then, a feature map from PDM is fed into the classification module to determine the benign and malignant of pulmonary nodules. Joint training of pulmonary nodule segmentation and classification is realized with discriminator established and reasonably designed multi-task loss function. Specifically, a boundary consistency constraint is designed in the segmentation loss which further enhances boundary segmentation crucial in pulmonary nodule segmentation tasks. In comparison with latest segmentation and classification methods individually, the proposed method shows superior results in segmentation and competitive classification behavior in general.

Availability of data and materials

The dataset and materials can be downloaded from the provided hyperlinks.


  1. 1.

  2. 2.

  3. 3.

  4. 4.

  5. 5.

  6. 6.

  7. 7.

  8. 8.

  9. 9.

  10. 10.



Generative adversarial network


High-resolution network with multi-scale progressive fusion


Progressive decoding module


High-resolution network


Computed tomography


Computer-aided diagnosis




Dice similarity coefficient


Mean intersection over union








True positive


True negative


False positive


False negative


Area under curve


  1. 1.

    S. Blandin Knight, P.A. Crosbie, H. Balata, J. Chudziak, T. Hussell, C. Dive, Progress and prospects of early detection in lung cancer. Open Biol. 7(9), 170070 (2017)

    Article  Google Scholar 

  2. 2.

    J. Zheng, D. Yang, Y. Zhu, W. Gu, B. Zheng, C. Bai, L. Zhao, H. Shi, J. Hu, S. Lu et al., Pulmonary nodule risk classification in adenocarcinoma from CT images using deep CNN with scale transfer module. IET Image Process. 14(8), 1481–1489 (2020)

    Article  Google Scholar 

  3. 3.

    K. Sun, Y. Zhao, B. Jiang, T. Cheng, B. Xiao, D. Liu, Y. Mu, X. Wang, W. Liu, J. Wang, High-resolution representations for labeling pixels and regions. arXiv preprint arXiv:1904.04514 (2019)

  4. 4.

    S. Baghersalimi, B. Bozorgtabar, P. Schmid-Saugeon, H.K. Ekenel, J.-P. Thiran, DermoNet: Densely linked convolutional neural network for efficient skin lesion segmentation. EURASIP J. Image Video Process. 2019(1), 1–10 (2019)

    Article  Google Scholar 

  5. 5.

    R. Miao, Y. Gao, L. Ge, Z. Jiang, J. Zhang, Online defect recognition of narrow overlap weld based on two-stage recognition model combining continuous wavelet transform and convolutional neural network. Comput. Ind. 112, 103115 (2019)

    Article  Google Scholar 

  6. 6.

    S. Chen, Y. Wang, Pulmonary nodule segmentation in computed tomography with an encoder-decoder architecture. In: International Conference on Information Technology in Medicine and Education (ITME), pp. 157–162 (2019)

  7. 7.

    N.V. Keetha, C.S.R. Annavarapu, et al., U-Det: A modified U-Net architecture with bidirectional feature network for lung nodule segmentation. arXiv preprint arXiv:2003.09293 (2020)

  8. 8.

    G. Pezzano, V.R. Ripoll, P. Radeva, CoLe-CNN: Context-learning convolutional neural network with adaptive loss function for lung nodule segmentation. Comput. Meth. Prog. Bio. 198, 105792 (2021)

    Article  Google Scholar 

  9. 9.

    H. Liu, H. Cao, E. Song, G. Ma, X. Xu, R. Jin, Y. Jin, C.-C. Hung, A cascaded dual-pathway residual network for lung nodule segmentation in CT images. Phys. Medica 63, 112–121 (2019)

    Article  Google Scholar 

  10. 10.

    H. Yan, H. Lu, M. Ye, K. Yan, Y. Xu, Q. Jin, Improved Mask R-CNN for lung nodule segmentation. In: International Conference on Information Technology in Medicine and Education (ITME), pp. 137–141 (2019)

  11. 11.

    W. Wang, R. Feng, J. Chen, Y. Lu, T. Chen, H. Yu, D.Z. Chen, J. Wu, Nodule-plus R-CNN and deep self-paced active learning for 3D instance segmentation of pulmonary nodules. IEEE Access 7, 128796–128805 (2019)

    Article  Google Scholar 

  12. 12.

    Y. Sun, J. Tang, W. Lei, D. He, 3D segmentation of pulmonary nodules based on multi-view and semi-supervised. IEEE Access 8, 26457–26467 (2020)

    Article  Google Scholar 

  13. 13.

    X. Dong, S. Xu, Y. Liu, A. Wang, M.I. Saripan, L. Li, X. Zhang, L. Lu, Multi-view secondary input collaborative deep learning for lung nodule 3D segmentation. Cancer Imaging 20(1), 1–13 (2020)

    Article  Google Scholar 

  14. 14.

    I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial networks. arXiv preprint arXiv:1406.2661 (2014)

  15. 15.

    Y. Qin, H. Zheng, X. Huang, J. Yang, Y.-M. Zhu, Pulmonary nodule segmentation with CT sample synthesis using adversarial networks. Med. Phys. 46(3), 1218–1229 (2019)

    Article  Google Scholar 

  16. 16.

    Y. Onishi, A. Teramoto, M. Tsujimoto, T. Tsukamoto, K. Saito, H. Toyama, K. Imaizumi, H. Fujita, Investigation of pulmonary nodule classification using multi-scale residual network enhanced with 3DGAN-synthesized volumes. Radiol. Phys. Technol. 13(2), 160–169 (2020)

    Article  Google Scholar 

  17. 17.

    H. Shi, J. Lu, Q. Zhou, A novel data augmentation method using style-based GAN for robust pulmonary nodule segmentation. In: 2020 Chinese Control and Decision Conference (CCDC), pp. 2486–2491 (2020)

  18. 18.

    D. Nie, Y. Gao, L. Wang, D. Shen, ASDNet: Attention based semi-supervised deep networks for medical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 370–378 (2018)

  19. 19.

    C. Decourt, L. Duong, Semi-supervised generative adversarial networks for the segmentation of the left ventricle in pediatric MRI. Comput. Biol. Med. 123, 103884 (2020)

    Article  Google Scholar 

  20. 20.

    Z. Han, B. Wei, A. Mercado, S. Leung, S. Li, Spine-GAN: Semantic segmentation of multiple spinal structures. Med. Image Anal. 50, 23–35 (2018)

    Article  Google Scholar 

  21. 21.

    Y. Sun, C. Zhou, Y. Fu, X. Xue, Parasitic GAN for semi-supervised brain tumor segmentation. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 1535–1539 (2019)

  22. 22.

    Y. Wang, H. Zhang, K.J. Chae, Y. Choi, G.Y. Jin, S.-B. Ko, Novel convolutional neural network architecture for improved pulmonary nodule classification on computed tomography. Multidimens. Syst. Signal Process. 31(3), 1163–1183 (2020)

    Article  Google Scholar 

  23. 23.

    Y. Zhang, J. Zhang, L. Zhao, X. Wei, Q. Zhang, Classification of benign and malignant pulmonary nodules based on deep learning. In: International Conference on Information Science and Control Engineering (ICISCE), pp. 156–160 (2018)

  24. 24.

    Y. Onishi, A. Teramoto, M. Tsujimoto, T. Tsukamoto, K. Saito, H. Toyama, K. Imaizumi, H. Fujita, Multiplanar analysis for pulmonary nodule classification in CT images using deep convolutional neural network and generative adversarial networks. Int. J. Comput. Assist. Radiol. Surg. 15(1), 173–178 (2020)

    Article  Google Scholar 

  25. 25.

    C. Tong, B. Liang, Q. Su, M. Yu, J. Hu, A.K. Bashir, Z. Zheng, Pulmonary nodule classification based on heterogeneous features learning. IEEE J. Sel. Areas Commun. 39(2), 574–581 (2020)

    Article  Google Scholar 

  26. 26.

    Jiang, K., Wang, Z., Yi, P., Chen, C., Huang, B., Luo, Y., Ma, J., Jiang, J.: Multi-scale progressive fusion network for single image deraining. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8346–8355 (2020)

  27. 27.

    H. Dong, J. Pan, L. Xiang, Z. Hu, X. Zhang, F. Wang, M.-H. Yang, Multi-scale boosted dehazing network with dense feature fusion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2157–2167 (2020)

  28. 28.

    K. He, X. Zhang, S. Ren, J. Sun, Identity mappings in deep residual networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 630–645 (2016)

  29. 29.

    K. Sun, B. Xiao, D. Liu, J. Wang, Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5693–5703 (2019)

  30. 30.

    V. Badrinarayanan, A. Kendall, R. Cipolla, SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017)

    Article  Google Scholar 

  31. 31.

    L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, H. Adam, Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 801–818 (2018)

  32. 32.

    Z. Tian, T. He, C. Shen, Y. Yan, Decoders matter for semantic segmentation: Data-dependent decoding enables flexible feature aggregation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3126–3135 (2019)

  33. 33.

    Z. Zhao, Q. Sun, H. Yang, H. Qiao, Z. Wang, D.O. Wu, Compression artifacts reduction by improved generative adversarial networks. EURASIP J. Image Video Process. 2019(1), 1–7 (2019)

    Article  Google Scholar 

  34. 34.

    Y. Yuan, J. Xie, X. Chen, J. Wang, Segfix: Model-agnostic boundary refinement for segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 489–506 (2020)

  35. 35.

    M. Arjovsky, S. Chintala, L. Bottou, Wasserstein generative adversarial networks. In: International Conference on Machine Learning (ICML), pp. 214–223 (2017)

  36. 36.

    S.G. Armato III., G. McLennan, L. Bidaut, M.F. McNitt-Gray, C.R. Meyer, A.P. Reeves, B. Zhao, D.R. Aberle, C.I. Henschke, E.A. Hoffman et al., The lung image database consortium (LIDC) and image database resource initiative (IDRI): A completed reference database of lung nodules on CT scans. Med. Phys. 38(2), 915–931 (2011)

    Article  Google Scholar 

  37. 37.

    J. Zhuang, T. Tang, S. Tatikonda, N. Dvornek, Y. Ding, X. Papademetris, J.S. Duncan, Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. arXiv preprint arXiv:2010.07468 (2020)

  38. 38.

    A. Paszke, A. Chaurasia, S. Kim, E. Culurciello, ENet: A deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147 (2016)

  39. 39.

    H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2881–2890 (2017)

  40. 40.

    Z. Zhou, M.M.R. Siddiquee, N. Tajbakhsh, J. Liang, Unet++: A nested U-Net architecture for medical image segmentation. In: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pp. 3–11 (2018)

  41. 41.

    H. Li, P. Xiong, H. Fan, J. Sun, DFANet: Deep feature aggregation for real-time semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9522–9531 (2019)

  42. 42.

    R.P. Poudel, S. Liwicki, R. Cipolla, Fast-SCNN: Fast semantic segmentation network. arXiv preprint arXiv:1902.04502 (2019)

  43. 43.

    P. Hu, F. Perazzi, F.C. Heilbron, O. Wang, Z. Lin, K. Saenko, S. Sclaroff, Real-time semantic segmentation with fast attention. IEEE Robot. Autom. Lett. 6(1), 263–270 (2020)

    Article  Google Scholar 

  44. 44.

    Q. Hou, L. Zhang, M.-M. Cheng, J. Feng, Strip pooling: Rethinking spatial pooling for scene parsing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4003–4012 (2020)

  45. 45.

    O. Ronneberger, P. Fischer, T. Brox, U-Net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 234–241 (2015)

  46. 46.

    W. Zuo, F. Zhou, Z. Li, L. Wang, Multi-resolution CNN and knowledge transfer for candidate classification in lung nodule detection. IEEE Access 7, 32510–32521 (2019)

    Article  Google Scholar 

  47. 47.

    G. Zheng, G. Han, N.Q. Soomro, An inception module CNN classifiers fusion method on pulmonary nodule diagnosis by signs. Tsinghua Sci. Technol. 25(3), 368–383 (2019)

    Article  Google Scholar 

  48. 48.

    S. Akila Agnes, J. Anitha, Automatic 2D lung nodule patch classification using deep neural networks. In: International Conference on Inventive Systems and Control (ICISC), pp. 500–504 (2020)

Download references


No additional acknowledgements.


This work was supported by the National Nature Science Foundation of China under Grant 61872143.

Author information




LZ and PYW came up with the idea of the work and implemented the proposed method. LZ and YY performed the experiments and drafted the manuscript. HQZ, SYY took part in writing the manuscript. All authors read and approved the final manuscript.

Authors information

Ling Zhu received the B.S. degree of electronic information science from Jiangsu University of Science and Technology, Jiangsu, China, in 2019. She is currently pursuing the M.S. degree of information and communication engineering in East China University of Science and Technology, Shanghai, China. Her research interests include medical image processing, deep learning and object detection and segmentation.

Hongqing Zhu received the Ph.D. degree from Shanghai Jiao Tong University, Shanghai, China, in 2000. From 2003 to 2005, she was a Post-Doctoral Fellow with the Department of Biology and Medical Engineering, Southeast University, Nanjing, China. She is currently a professor with the East China University of Science and Technology, Shanghai. Her current research interests include medical image processing, deep learning, computer vision, and pattern recognition. She is a member of IEEE and IEICE.

Suyi Yang is currently pursuing the B.Sc. degree with the Department of Mathematics, Natural, Mathematical & Engineering Sciences, King’s College London. Her interests include mathematical modeling and mathematical problems in image processing, especially partial differential equations.

Pengyu Wang is currently pursuing the Ph.D. degree with the Department of Electronics and Communication Engineering, East China University of Science and Technology, Shanghai, China. In 2015 and 2018, he received the B.S. degree in automation and M.S. degree in control engineering from Hebei University in Baoding. His research interests include image processing, deep learning, computer vision, and pattern recognition.

Yang Yu is currently working towards his Ph.D. in East China University of Science and Technology, Shanghai, China, and has received B.S. degree in electronic Information Science from Jiangsu University of Science and Technology in 2018. His research domains include medical image processing, deep learning, and image segmentation.

Corresponding author

Correspondence to Hongqing Zhu.

Ethics declarations

Consent for publication


Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhu, L., Zhu, H., Yang, S. et al. HR-MPF: high-resolution representation network with multi-scale progressive fusion for pulmonary nodule segmentation and classification. J Image Video Proc. 2021, 34 (2021).

Download citation


  • Pulmonary nodule
  • Segmentation and classification
  • High-resolution network
  • Multi-scale progressive fusion
  • Generative adversarial network