Classification of lung sounds using convolutional neural networks

In the field of medicine, with the introduction of computer systems that can collect and analyze massive amounts of data, many non-invasive diagnostic methods are being developed for a variety of conditions. In this study, our aim is to develop a non-invasive method of classifying respiratory sounds that are recorded by an electronic stethoscope and the audio recording software that uses various machine learning algorithms. In order to store respiratory sounds on a computer, we developed a cost-effective and easy-to-use electronic stethoscope that can be used with any device. Using this device, we recorded 17,930 lung sounds from 1630 subjects. We employed two types of machine learning algorithms; mel frequency cepstral coefficient (MFCC) features in a support vector machine (SVM) and spectrogram images in the convolutional neural network (CNN). Since using MFCC features with a SVM algorithm is a generally accepted classification method for audio, we utilized its results to benchmark the CNN algorithm. We prepared four data sets for each CNN and SVM algorithm to classify respiratory audio: (1) healthy versus pathological classification; (2) rale, rhonchus, and normal sound classification; (3) singular respiratory sound type classification; and (4) audio type classification with all sound types. Accuracy results of the experiments were; (1) CNN 86%, SVM 86%, (2) CNN 76%, SVM 75%, (3) CNN 80%, SVM 80%, and (4) CNN 62%, SVM 62%, respectively. As a result, we found out that spectrogram image classification with CNN algorithm works as well as the SVM algorithm, and given the large amount of data, CNN and SVM machine learning algorithms can accurately classify and pre-diagnose respiratory audio.


Introduction
Diagnosis or classification requires recognizing patterns. But most of the time, it is very hard to spot these patterns, especially if the data is very large. Data collected from the environment is usually non-linear, so we cannot use traditional methods to find patterns or create mathematical models. In the past decade, various technologies, such as expert systems, have been used to attempt to solve this problem. However, for critical systems, the error rate for the decision was too high [1].
The latest technology that is attempting to solve this problem is machine learning. Over the years, various successful algorithms were developed and now with the deep learning algorithms, error rate became close to negligible. Especially in computer vision and speech recognition, machine learning is reaching human levels of detection.
Research in this area attempts to make better representations and create models to learn these representations from large-scale unlabeled data [2]. Some of the representations are inspired by advances in neuroscience and are loosely based on interpretation of information processing and communication patterns in a nervous system, such as neural coding which attempts to define a relationship between the stimulus and the neuronal responses and the relationship among the electrical activities of the neurons in the brain [3,4].
Deep learning is a branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using model architectures, with complex structures, composed of multiple non-linear transformations [3,5]. An observation (e.g., an image) can be represented in many ways including a vector of intensity values per pixel, or in a more abstract way as a set of edges, regions of a particular shape, and various other features. Some representations make it easier to learn tasks (e.g., face recognition or facial expression recognition) from examples [6][7][8]. One of the promises of deep learning is replacing handcrafted features with efficient algorithms for unsupervised or semi-supervised feature learning and hierarchical feature extraction [9].
Various deep learning architectures such as deep neural networks, convolutional deep neural networks, deep belief networks, and recurrent neural networks have been applied to fields like computer vision, automatic speech recognition, natural language processing, audio recognition, and bioinformatics where they have been shown to produce state-of-the-art results on various tasks [5,10].
The convolutional network architecture is a remarkably versatile yet conceptually simple paradigm that can be applied to a wide spectrum of perceptual tasks. Convolutional networks are trainable, multistage architectures. The input and output of each stage are sets of arrays called feature maps [11]. Convolutional neural networks (CNNs) are designed to process data that come in the form of multiple arrays. There are four key ideas behind CNN that take advantage of the properties of natural signals: local connections, shared weights, pooling, and the use of many layers. The architecture of a typical CNN is structured as a series of stages. The first few stages are composed of two types of layers: convolutional layers and pooling layers. Units in a convolutional layer are organized in feature maps, within which each unit is connected to local patches in the feature maps of the previous layer through a set of weights called a filter bank. Although the role of the convolutional layer is to detect local conjunctions of features from the previous layer, the role of the pooling layer is to merge semantically similar features into one [12]. The CNN has been found highly effective and has been commonly used in computer vision and image recognition. More recently, with appropriate changes from designing CNN for image analysis to taking into account speech-specific properties, the CNN is also found effective for speech recognition [13].
Auscultation, which is the processes of listening to the internal sounds in the human body through a stethoscope, has been an effective tool for the diagnosis of lung disorders and abnormalities. This process mainly relies on the physician. Using a stethoscope, the physicians may hear normal breathing sounds, decreased or absent breath sounds, and abnormal breath sounds (e.g., rale, rhonchus, squawk, stridor, wheeze, rub) [14,15]. Auscultation is a simple, patient-friendly and non-invasive method which is widely used but is of low diagnostic value due to the inherent subjectivity in the evaluation of respiratory sounds and to the difficulty involved in relating qualitative assessments to other people [16].
Murphy et al. built a system for automatically providing an accurate diagnosis based upon an analysis of recorded lung sounds. The sound input comes from a number of microphones that are placed around a patient's chest. The system also has a signal processing circuit to convert data from analog to digital. This data is then recorded, organized, and displayed on a computer monitor using an application program. From each microphone, sound data was gathered both in inspiration and in expiration, combined and separately, so that abnormal sounds could be determined easily. The collected data is then manually analyzed, and a diagnosis is reached [17]. This invention proves that respiratory audio data can be collected from patients in a non-invasive way. However, this invention does not use an automated analysis technique to analyze the data.
In this study, we aim to improve on this invention by analyzing audio data with machine learning algorithms and by classifying respiratory sounds. Our data consists of audio recordings of lung sounds that were recorded by chest physicians. We believe, using machine learning, audio data can be analyzed for patterns that will lead to the detection of various pathological lung sounds and help in the diagnosis of respiratory conditions.

Building the electronic stethoscope
First of all, since we needed a device to record respiratory audio, we started by researching all commercially available electronic stethoscopes. Two models are currently used in medicine: the Littman 2100 electronic stethoscope [18] and the Thinklabs One electronic stethoscope [19]. These devices simply receive audio signals from the head of the stethoscope by a microphone and a series of electronic circuits and transmit this digital signal into the computer by the 3.5-mm microphone jack commonly found on computers and mobile devices. However, the key difference was Littman 2100 electronic stethoscope required proprietary software, so it was constrained to certain platforms. On the other hand, Thinklabs One electronic stethoscope transmits the audio signal to any device using any software [20]. After analyzing the capabilities of these devices, we decided to build our own custom electronic stethoscope which has a directional microphone strapped inside the head of a stethoscope with a 3.5-mm microphone jack.
Since we do not have a signal enhancing hardware, we needed a good, small and directional microphone to obtain the perfect signal. However, the audio was still noisy because of several reasons: Hospital environments are naturally very noisy: people talking, phones, noisy devices, ambulance, police sirens, etc. There is a scratching noise when the diaphragm of the stethoscope comes in contact with dry skin and body hair.
The first problem is difficult to solve because it is impossible to sound proof the rooms where patients are. But the second problem can be solved simply by lubricating the area of contact. We also discovered that this method increases the reception of low-frequency audio by the microphone.

Software for data acquisition
We needed an application to record audio and save patient data. To this end, we developed a .NET application that creates patient records and uses open source audio library "NAudio" to record, play, and modify audio. It has two main sections: Patient information: first name, last name Audio recording: audio recordings from 11 areas of the patient's chest (Fig. 1).
The application and the hardware are tested together by recording respiratory audio and showing the results to the chest physicians.

Data acquisition
After receiving positive feedback from all the chest physicians, we decided to move to data acquisition. In the end, three hospitals agreed to participate in our research in their respiratory diseases department: Ankara University, Yıldırım Beyazıt University, and Yıldırım Beyazıt Education and Research Hospital.  To start the data acquisition, we needed a laptop with a good audio card. Lenovo ThinkPad E550 Laptop offered the best audio card for our purposes. So we purchased the computer. We also purchased two Seagate Expansion 1 TB external hard drives for backup storage. Once we were set with the equipment, we started the data acquisition. We recorded respiratory audio from 1630 subjects and 11 positions from each patient, totaling to 17,930 audio clips, each 10-s long.

Experiments
In this study, we used two feature extraction methods: mel frequency cepstral coefficient (MFCC) feature extraction and spectrogram generation using short-time Fourier transform (STFT).
In sound processing, the mel frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a non-linear mel scale of frequency. MFCCs are coefficients that collectively make up an MFC. They are derived from a type of cepstral representation of the audio clip [17].
A spectrogram is a visual representation of the spectrum of frequencies in a sound or other signal as they vary with time or some other variable. They are used extensively in the fields of music, sonar, radar, and speech processing and seismology [21].
Since MFCC features are widely used in audio detection systems, the experiments we ran using the MFCC features enabled us to find a base value for accuracy, precision, recall, sensitivity, and specificity. Spectrogram images are also used in audio detection. However, they were never tested in respiratory audio with CNNs. We wanted to see if we can match or exceed the audio detection accuracies with MFCC features.
MFCC datasets were built using SciPy library. We used support vector machines to process these datasets. The spectrogram dataset was built using a combination of open source graph generation library Pylab and various open source image processing libraries. The original spectrograms generated were 800 × 600 RGBA, and they were too large for our computer's memory. We changed the algorithm to generate them 28 × 28 grayscale to fit them into the memory for CNN to process (Fig. 2).  We built eight datasets, four for support vector machines (SVMs) and four for convolutional neural networks (CNNs): Two datasets to predict whether respiratory sounds were normal or pathological (17,930 audio clips, two classes) Two datasets to classify respiratory sounds into: normal, rhonchus, squeak, stridor, wheeze, rales, bronchovesicular, friction rub, bronchial, absent, decreased, aggravation, or long expirium duration (LED) (14,453 audio clips, 13 classes) Two datasets for classification of respiratory sounds labeled with rale, rhonchus, or normal (15,328 audio clips, 3 classes) Two datasets for classification of respiratory sounds with all labels including ones with multiple labels (17,930 audio clips, 78 classes) The CNN structures that we used in our experiments are shown below in Figs. 3, 4, 5, and 6.

Results and discussion
Our results are in Table 1.
A number of investigations demonstrating the usefulness of computerized lung sound analysis have been reported [22][23][24]. However, there is a small number of studies available on the clinical utility of auscultation and computerized lung sound analysis for the classification of abnormal lung sounds ( Table 2).
As shown in Table 2, the studies in the literature have very limited datasets with a maximum of 2127 audio samples from 34 subjects [25]. Therefore, their accuracy results were either very high when there was a very distinct set of audio data or very low when the audio data was similar [16,[25][26][27][28][29][30][31][32][33][34][35][36][37]. This is a major problem as these systems deal with a critical decision in patient's diagnosis. In our study, we collected 11 audio recordings from each of the 1630 healthy and sick subjects totaling to 17,930 audio clips. Because of the larger size of our dataset, we managed to get consistent results in all our experiments.  In the literature, the audio clip size varies between 8 and 16 s. Similarly, we recorded all our audio clips in 10 s, as suggested by the chest physicians whom we worked with. In other studies, while commercially available devices and software packages were used, we developed our own hardware and software using open source libraries. Previous studies did not mention the audio format used. This can be an issue as some audio formats sacrifice quality for disk space. We used lossless WAV format as we did not want to lose any data.
Rietveld et al. [38] selected clean audio samples, and Baydar et al. [28] recorded their audio clips in a quiet room. However, if one tries to build a system that is trained from these clean data, it would not work in a real environment such as a hospital. Even the quietest hospital rooms have noise that would impact the recording. That is why we developed our electronic stethoscope with as much sound isolation as possible and selected our recording device carefully. In the end, the data we collected had very little external noise but it was collected from a real environment.
In the literature, lung sound classification was made for a maximum of six classes. Kandaswamy et al. [28] implemented a system to classify the lung sounds to one of the six categories: normal, wheeze, crackle, squawk, stridor, or rhonchus. Forkheim et al. [39], investigated to detect only wheezes in isolated lung sound segments. Bahoura et al. [27], Riella et al. [40], and Hashemi et al. [41] classified sounds as whether containing wheezes or normal respiratory sounds. Lu et al. [42] classified fine crackles and coarse crackles. Kahya et al. [15,30], Flietstra et al. [24], and Serbes et al. [35] classified the presence or absence of a crackle. These studies are very narrow in scope, as they have limited number of classes. Their results are focused on only a few sound types. In our study, we performed 8 different experiments with 2, 3, 13, and 78 classes, diversifying our results greatly.
Previous studies so far used CNNs for classification. In our study, we aimed to use this new classification algorithm on audio and observed that it performs very well and produces consistent results.
Lu et al. [42] acquired their test data set from RALE and ASTRA databases. Riella et al. [40] used lung sounds that were available electronically from different online repositories. The problem with this approach is that the recording hardware and software can be different for each audio clip. This would cause problems in classification because the audio quality is not consistent in all training and test samples. In our study, we used a single recording device and the same recording software on the same device while recording the audio.
While several previous studies [16,30,39,43] compared several algorithms, they did not use a widely accepted audio classification method for benchmarking their neural networks. In our study, we used the classification results of SVMs that use the MFCC features to benchmark our CNN algorithm.
In some studies in the literature in Table 2, the number of audio data or subjects were not mentioned; therefore, it is impossible to compare the results of these studies with our own [39,40,42,[44][45][46].  Previous studies' results were not geared toward a practical system. In our study, we developed our device and software to fit into a hospital environment workflow. We are also planning to fit this workflow into a telemedicine system we are developing that allows physicians to remotely listen to and share patient audio data for consultation.
While our results seem numerically lower than the state-of-the-art results, our data set (17,930 audio clips) is the biggest data set when compared with that of the studies done on this field and the audio clips in the data set are not amplified, modified, cleaned, or pre-recorded by a third party which is the case with many of the studies we looked at. We tested our algorithms on eight datasets and obtained consistent results across the board; this was not done in any of the state-of-the-art study so far.

Conclusions
The goal of this project was to design and construct an electronic stethoscope with an associated software system that can transfer respiratory sounds to a PC for recording and subsequent computer-aided analysis and diagnosis. The hardware-software system was used to collect a dataset of respiratory sounds to train SVM and CNN machine learning algorithms for the automated analysis and diagnosis. The complete system can also be used for all types of body sounds (e.g., lung, heart, intestines) and is expected to be in widespread clinical use.
In this study, we experimented using CNN algorithms in audio classification. Since MFCC features combined with SVM is a generally accepted practice for audio classification, we used it as a benchmark for our CNN algorithm. We found out that spectrogram image classification with CNN algorithm works as well as the SVM system.
As a result, we found out that spectrogram image classification with CNN algorithm works as well as the SVM algorithm, and given the large amount of data, CNN and SVM machine learning algorithms can accurately classify and pre-diagnose respiratory audio. This system can be combined with a telemedicine system to store and share information among physicians. We believe our method can improve the results of previous studies and help in medical research.

Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Availability of data and materials
The data cannot be shared because patients did not allow the actual data to be released on a repository.
Authors' contributions MA is responsible for the data collection, experiment design, algorithm design, and documentation. ÖK did the study design and coordination, performed thesis consultation, and revised the paper. BK and SP provided medical expertise in the data analysis and revision of the paper. All authors read and approved the final manuscript.
Ethics approval and consent to participate This study was approved by the local Human Experiments Ethical Committee of Turgut Özal University (29.12.2015-0123456/0023). The voluntary declaration form was read to the patient and signed with approval for participation in the study.

Consent for publication
The voluntary declaration form was read to the patient and signed with approval for the publication of the study.

Competing interests
We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.