Saturday 6 June 2020

Deep Learning Analysis of COVID-19 lung X-Rays using MATLAB: Part 6

*** DISCLAIMER ***


I have no medical training. Nothing presented here should be considered in any way as informative from a medical point-of-view. This is simply an exercise in image analysis via Deep Learning using MATLAB, with lung X-rays as a topical example in these times of COVID-19. 

INTRODUCTION


In this Part 6 in my series of blog articles on exploring Deep Learning applied to lung X-rays using MATLAB, I bring together the results of the analysis of Parts 1, 2, 3, 4 & 5, and suggest a candidate set of composite models which are particularly suited to the task. I also present a live website whereby anyone can try these composite models by uploading an X-ray image and receiving results on-the-fly. Finally, all the underlying trained networks presented in this series of articles have been posted-up to GitHub (in MATLAB and ONNX formats). They are openly available for anyone wishing to experiment with them.


COMPOSITE MODELS


After much trial-and-error experimentation with the models presented in Part 3 in combination with the grad-CAM analysis of Parts 4 & 5, the following two composite models can be considered as being quite effective.

MODEL 1


This is based on a combination of the four-class networks (from Experiment 4) in Parts 1 and 3, with grad-CAM Discrimination Filtering from Part 5. Specifically, the model comprises the following steps (where the network names refer to the names of the underlying pretrained networks used as the basis for the Transfer Learning):

  1.  Apply (i) alexnet; (ii) vgg16; (iii) googlenet(original); and (iv) googlenet (places) (from Experiment 4) to the X-ray-image-under-test. Each network will generate a score for each of the possible four labels [HEALTHY, BACTERIA, COVID, OTHER-VIRUS]. 
  2. Generate a grad-CAM image map for each of the networks (i)--(iv) in Step 1 using the technique presented in Part 4.
  3. Apply (a) googlenet; (b) darknet19;  and (c) mobilenetv2 from Part 5 to the four grad-CAM images from Step 2. From the results for each grad-CAM image, assign a weighting factor determined as follows: if the majority of  (a), (b), (c) are in agreement with INSIDE_LUNGS, set the weighting factor to 0.8 (rather than  to 1 because the grad-CAM Discrimination Filter classifiers aren't perfectly accurate). If the majority of  (a), (b), (c) are in agreement with OUTSIDE_LUNGS, set the weighting factor to 0.2 (rather than to 0 because the classifiers aren't perfectly accurate). If the majority of  (a), (b), (c) are in agreement with RIBCAGE_CENTRAL, set the weighting factor to 0.5 (i.e., mid-way). In all other cases, set the weighting factor to 0.3 (i.e., ambiguous).
  4.  Multiply each of the scores from Step 1 by the respective weighting factor from Step 3. This will give a grad-CAM weighted score per label per network.
  5. Take the average of the scores from Step 4 across all networks to give an average score per label. Renormalize these average scores so that they add up to one.
  6. Take the maximum of the resulting normalised averaged scores from Step 5, then assign the output classification to the label corresponding to the maximum score. This will give the resulting class from HEALTHY, BACTERIA, COVID, or OTHER-VIRUS with an accompanying score.

MODEL 2


This is based on a cascade of the two-class networks (from Experiments 1, 2, and 3) in Parts 1 and 3, with grad-CAM Discrimination Filtering from Part 5. Specifically, the model comprises the following steps:

  1. Apply (i) darknet19; (ii) resnet101; (iii) squeezenet; and (iv) resnet18 (from Experiment 1) to the X-ray-image-under-test. Each network will generate a score for each of the possible two labels [YES (pneumonia), NO (healthy)].
  2.  Apply the identical approach to Steps 2--4 in MODEL 1 (above) to give the grad-CAM weighted score per label per network.
  3. Take the maximum of the grad-CAM weighted scores per label per network from the previous step across all networks to give a maximum score per label. Renormalize these maximum scores so that they add up to one.
  4. Take the maximum of the resulting normalised maximum scores from the previous step, then assign the output classification to the label corresponding to the maximum score. This will give the resulting class from YES or NO with an accompanying score.
  5. If the result is  NO, the process terminates with the overall result of HEALTHY (plus accompanying score). If the result is YES, continue to the next step.
  6. Apply (i) vgg19; (ii) inceptionv3; (iii) squeezenet; and (iv) mobilenetv2 (from Experiment 2) to the X-ray-image-under-test. Each network will generate a score for each of the possible two labels [BACTERIA, VIRUS].
  7. Apply the identical approach to Steps 2--4 in MODEL 1 (above) to give the grad-CAM weighted score per label per network.
  8. Take the average of the grad-CAM weighted scores per label per network from the previous step across all networks to give an average score per label. Renormalize these average scores so that they add up to one.
  9. Take the maximum of the resulting normalised average scores from the previous step, then assign the output classification to the label corresponding to the maximum score. This will give the resulting class from BACTERIA or VIRUS with an accompanying score.
  10. If the result is  BACTERIA, the process terminates with the overall result of BACTERIA (plus accompanying score). If the result is VIRUS, continue to the next step.
  11. Apply (i) resnet50; (ii) vgg16; (iii) vgg19; and (iv) darknet53 (from Experiment 3) to the X-ray-image-under-test. Each network will generate a score for each of the possible two labels [COVID, OTHER-VIRUS].
  12.  Apply the identical approach to Steps 2--4 in MODEL 1 (above) to give the grad-CAM weighted score per label per network.
  13. Take the average of the grad-CAM weighted scores per label per network from the previous step across all networks to give an average score per label. Renormalize these average scores so that they add up to one.
  14. Take the maximum of the resulting normalised average scores from the previous step, then assign the output classification to the label corresponding to the maximum score. This will give the resulting class from COVID or OTHER-VIRUS with an accompanying score. The process is complete.

Taken together, MODEL 1 and MODEL 2 provide two alternate paths to the classification of the  lung X-ray-image-under-test. If the resulting classifications are in agreement, this represents the final classification, with a score given by the average of the scores for the two models. If the resulting classifications are not in agreement, the classification with the higher score can be considered as representing the final classification (with its corresponding score).

These two composite models were hand-crafted (essentially by trial-and-error). They perform well on the validation images. Of course there are many other combinations of networks (from Part 3) that could be considered. 

DEPLOYMENT


The trial-and-error experimentation to determine the combination of the Deep Neural Networks in MODELS 1 & 2 -- as well as the training of all the underlying Deep Neural Networks (via Transfer Learning), and the grad-CAM Discrimination Filtering -- was all performed in MATLAB. 

The next step was to expose the resulting models in a form that they are generally accessible (for anyone to experiment with) without the need for MATLAB.  That is the topic of this section.

MATLAB Compiler


The approach taken was to utilise the MATLAB Compiler (and the accompanying MATLAB Compiler SDK) to generate a shared library (specifically a Microsoft dotnet library) which contains all the code required to run the models. 

RESTful Web Service


This library was then integrated into a RESTful Web Service application (written in C#), and deployed on a web server (Windows / IIS). 

Web Application Front-End


The RESTful Web Service is exposed to users via a simple ASP.NET Web Application (front end) hosted on a Windows server. Here is the URL and a screenshot of the landing page...














Simply upload a lung X-ray (cropped with no borders, and confined to the rib-cage as far as possible) via the web page, and wait (up to a few minutes) for the analysis to proceed. The results look like this:



























EXPORTED MODELS


All the Deep Neural Networks presented in Part 3 and Part 5, including the subset of models used in the deployed composite MODELS 1 & 2 presented above, have been exported in MATLAB format and in ONNX format. Please feel free to retrieve them from my github repositories (here for MATLAB format, and here for ONNX format) for use in your own experiments.

POTENTIAL NEXT STEPS

  • Try different combinations of underlying models to generate composite models which perform better than MODELS 1  & 2 presented here. Owing to the large number of possible combinations, this search/optimisation should be performed in an automated manner (rather than manually by trial-and-error as applied here).
  • Re-train and compare all the models with larger image datasets whenever they become available. If you have access to such images, please consider posting them to the open source COVID-Net archive here.

Saturday 30 May 2020

Deep Learning Analysis of COVID-19 lung X-Rays using MATLAB: Part 5

*** DISCLAIMER ***


I have no medical training. Nothing presented here should be considered in any way as informative from a medical point-of-view. This is simply an exercise in image analysis via Deep Learning using MATLAB, with lung X-rays as a topical example in these times of COVID-19. 

INTRODUCTION


In this Part 5 in my series of blog articles on exploring Deep Learning of lung X-rays using MATLAB, the observations for Part 4 -- whereby the grad-CAM technique was used to identify which regions of the X-ray images were being activated for all 19 network architectures under consideration -- serve as the basis for a new network to discriminate between those models which are utilising the lung regions (as desired) rather than outside the lung regions, for a given image-under-test. The resulting network can then be used as a discriminating filter applied to the outputs of the main X-ray classifiers in order to choose between those classifiers which focus (correctly) on the lung regions rather than elsewhere.


DATASET


The image dataset for the grad-CAM Discriminating Filter comprised a set of grad-CAM images as presented in Part 4.  Specifically, the dataset was composed by generating a total of 14,333 grad-CAM image files across all 19 network types and X-ray sample images. For training of the Deep Neural Network, these were split into three classes: INSIDE_LUNGS (whereby the grad-CAM images contain activation regions which are focused on the interior of one or both lungs -- the desirable scenario); OUTSIDE_LUNGS (whereby the grad-CAM images contain activation regions which are focused outside of the lungs or even outside of the body -- undesirable scenario); and RIBCAGE_CENTRAL (whereby the grad-CAM images contain activation regions which are focused in the central part of the ribcage rather then explicitly within either lung -- an intermediate scenario which happened to occur commonly so was considered necessary to be included). Sample images of each of these are shown below.

A sample image generated from the grad-CAM technique presented in Part 4 applied to a lung X-ray analysed via a (Transfer Learning) Deep Neural Network from Part 3. This example has been assigned the label INSIDE_LUNGS for the sake of creating a test dataset for the training of the Deep Neural Network Discriminating Filter, the central focus of this current article. Ideally, all the grad-CAM images generated from all the classifiers applied to all the lung X-rays would fall within this INSIDE_LUNGS class. But the results of Part 4 show this not to be the case (and hence the motivation for devising the Discriminating Filter to sort the relevant classifications from the less relevant) .

A sample grad-CAM image which falls within the OUTSIDE_LUNGS category. The purpose of the Discriminating Filter described in this current article is to identify such cases where the X-ray analysis classifier has wrongly focused on regions outside of the lungs (or indeed the body).

A sample grad-CAM image which falls within the RIBCAGE_CENTRAL category. This occurs quite often with the networks from Part 3. The idea is that such cases can be considered less positively definitive than INSIDE_LUNGS, but better than OUTSIDE_LUNGS, when it comes to determining the validity for lung X-ray classification. 

GROUND TRUTH DATA LABELLING VIA AMAZON SAGEMAKER

In order to assign each of the 14,333 grad-CAM images into the appropriate class (INSIDE_LUNGS, OUTSIDE_LUNGS, or RIBCAGE_CENTRAL) in preparation for training the Deep Neural Network to be used as the Discriminating Filter, the Amazon Mechanical Turk service (part of the Amazon SageMaker Ground Truth for Data Labelling product suite) was utilised. This unique service leverages an on-demand, scalable, human workforce to perform the image labelling. The service employs thousands of human workers willing to do piecemeal work at their convenience, and is a far more attractive solution than attempting to manually label all the images oneself (!)

TRAINING THE DEEP NEURAL NETWORKS VIA TRANSFER LEARNING


Once the grad-CAM images had been sorted (via AWS Mechanical Turk) into the three classes  (INSIDE_LUNGS, OUTSIDE_LUNGS, and RIBCAGE_CENTRAL), all 19 pre-trained networks available in MATLAB were used for Transfer Learning on these grad-CAM images, directly analogous to the approach presented in Part 3 for the underlying X-ray image classifier training. 


RESULTS

The results from the (Transfer Learning) training of all the networks is summarised as follows. From consideration of the classification accuracies on the validation dataset, the "best" performing networks were found to be (where the name refers to the base pre-trained network used in the Transfer Learning): googlenet for determining if INSIDE_LUNGS (75% accuracy);  darknet19 for determining if OUTSIDE_LUNGS (85% accuracy); and mobilenetv2 for determining if RIBCAGE_CENTRAL (86% accuracy). The validation Confusion Matrix for each of these is included below.


Confusion Matrix (on  the validation dataset) for a network trained on grad-CAM images via Transfer Learning starting with the pre-trained googlenet. Of all the networks that were tried, this one had the highest accuracy (75%) for the INSIDE_LUNGS class.



Confusion Matrix (on  the validation dataset) for a network trained on grad-CAM images via Transfer Learning starting with the pre-trained darknet19. Of all the networks that were tried, this one had the highest accuracy (85%) for the OUTSIDE_LUNGS class.
Confusion Matrix (on  the validation dataset) for a network trained on grad-CAM images via Transfer Learning starting with the pre-trained mobilenetv2. Of all the networks that were tried, this one had the highest accuracy (86%) for the RIBCAGE_CENTRAL class.


DISCUSSION & NEXT STEPS

The results demonstrate that the technique of Transfer Learning can be used to devise Deep Neural Networks which can successfully distinguish (with reasonable accuracy) the validity of a given lung X-ray classifier network applied to a given X-ray image by determining whether the corresponding grad-CAM image focuses on regions INSIDE the lungs (suggesting that the X-ray lung classification is valid), OUSTIDE the lungs (suggesting that the the X-ray lung classification is not valid), or in the RIBCAGE CENTRAL region (suggesting that the lung X-ray classification may be of some validity: i.e., more relevant than OUTSIDE the lungs though not as relevant as INSIDE the lungs). The Deep Neural Networks presented here can therefore serve as a Discrimination Filter to assist in choosing between all the various networks (presented in Part 3) for X-ray lung image classification.

The next step will  be to combine the results of this article with the results from Part 3 to determine the "best" network (or  combination of networks) for lung X-ray image classification.





Thursday 14 May 2020

Deep Learning Analysis of COVID-19 lung X-Rays using MATLAB: Part 4

Update: see Part 5 where the grad-CAM results presented below are used to train another suite of networks to help choose between all the lung X-ray classifiers presented in Part 3.

*** DISCLAIMER ***


I have no medical training. Nothing presented here should be considered in any way as informative from a medical point-of-view. This is simply an exercise in image analysis via Deep Learning using MATLAB, with lung X-rays as a topical example in these times of COVID-19. 

INTRODUCTION


In this Part 4 in my series of blog articles on exploring Deep Learning of lung X-rays using MATLAB, the analysis of Part 3 is revisited to further compare the performance of all the pre-trained networks available via MATLAB as the basis for the Transfer Learning procedure. Specifically, the grad-CAM technique is applied to (i) gain an insight into how the various networks respond to the underlying images and, moreover, (ii) to investigate the differences between the responses of each network from one another. The goal is to provide some guidance as to how to choose the "best" network for the task at hand. Again, all analysis is performed in MATLAB.

grad-CAM


The grad-CAM technique is introduced here, with a MATLAB implementation provided here which is used as the basis for the present analysis. Note that grad-CAM is a more powerful and more general  extension of  the Class Activation Map (CAM) technique used in Part 2.

The code for generating the results presented in the following sections uses the gradcam function (in MATLAB) provided in the reference example here. The gradcam function presented there is used in precisely the same manner here, so is not repeated here.

That said, the cited reference example is directly applicable only to googlenet. In order to extend to each of the other networks requires the appropriate softmax and feature map layers to be identified through use of the analyzeNetwork function to examine the given network and select the correct layers. The softmax layer is easily identified as the last softmax layer before the output. The feature map layer is identified as follows (from here):

 "Specify either the last ReLU layer with non-singleton spatial dimensions, or the last layer that gathers the outputs of ReLU layers (such as a depth concatenation or an addition layer). If your network does not contain any ReLU layers, specify the name of the final convolutional layer that has non-singleton spatial dimensions in the output".

For convenience, I have performed this identification for all the network types, and bundled them into a function named gradCamLayerNames (available via my github repository.)

Note: my gradCamLayerNames function returns the relevant layer names for the unmodified pre-trained networks distributed with MATLAB. For pre-trained networks which have been modified for Transfer Learning (by replacing the final few layers as described in Part 1), the relevant layer names for use with gradcam may be different (unless the original names happen to have been replicated). For example, all the networks used in the present analysis have been modified in the manner described in Part 1 for Transfer Learning, and so the relevant softmax layer name for use with gradcam is 'softmax' rather than that returned by gradCamLayerNames.

Image Datasets and Transfer Learning Networks


The lung X-ray image datasets (arranged into Examples 1--4) and the corresponding Transfer Learning trained networks from Part 3 are used here "as is"  without further introduction (refer to Part 3 for the details).

Analysis via grad-CAM

EXAMPLE 1: "YES / NO" Classification of Pneumonia

The grad-CAM analysis has been performed on all of the Example 1 Transfer Learning networks with all of the corresponding validation images. A representative sample of results are displayed on the following links (where the network names pertain to the base networks used in the Transfer Learning):

  1. vgg16  applied to all 224 validation images
  2. darknet53 applied to all 224 validation images
  3. all 19 networks applied to a single representative validation image

EXAMPLE 2: Classification Bacterial or Viral Pneumonia 

The grad-CAM analysis has been performed on all of the Example 2 Transfer Learning networks with all of the corresponding validation images. A representative sample of results are displayed on the following links (where the network names pertain to the base networks used in the Transfer Learning):

  1. darknet53 applied to all 640 validation images
  2. all 19 networks applied to a single representative validation image


EXAMPLE 3: Classification of COVID-19 or Other-Viral 

The grad-CAM analysis has been performed on all of the Example 3 Transfer Learning networks with all of the corresponding validation images. A representative sample of results are displayed on the following links (where the network names pertain to the base networks used in the Transfer Learning):

  1. vgg19  applied to all 260 validation images
  2. all 19 networks applied to a single representative validation image

EXAMPLE 4: Determine if COVID-19 pneumonia versus Healthy, Bacterial, or non-COVID viral pneumonia

The grad-CAM analysis has been performed on all of the Example 4 Transfer Learning networks with all of the corresponding validation images. A representative sample of results are displayed on the following links (where the network names pertain to the base networks used in the Transfer Learning):

  1. inceptionresnetv2  applied to all 44 validation images
  2. all 19 networks applied to a single representative validation image


RESULTS & NEXT STEPS

Looking over all these grad-CAM images for all four Examples (via the links above) confirms that the networks are generally responding to regions within the lungs when making their classifications. This is a positive finding in terms of qualifying the overall Deep Learning approach to the analysis of the lung X-rays, and confirms the results of  the (simpler) CAM approach from Part 2. However, the findings are not completely definitive in that it can be seen that some networks on some images are responding to inappropriate regions in the images (e.g., outside the lungs or even outside the body!), thereby reducing the validity of the approach for classifying the lung X-rays.

It is also interesting to observe how the various networks respond differently to the same image. For example, the grad-CAM images below (taken from the results for Experiment 4) illustrate how six different networks (base names darknet19, darknet53, densenet201, googlenet [original], googlenet [places], and inceptionresnetv2)  respond to the same validation image. It can be seen that the given networks are activated by quite different regions within the image. This is perhaps not too surprising given that the networks generally have quite different layer structures. That said, the googlenet variants ([original] and [places]) have identical layer structures but have been pre-trained on different image sets, then Transfer Trained on identical lung X-ray training images. The activations observed from grad-CAM analysis are nevertheless quite different.

All this goes to show that the optimal choice of networks for the task of lung X-ray classification is somewhat subtle since the various networks respond in different ways to the underlying images. It is not sufficient to only consider the classification accuracy scores (from the classification-accuracy results tables presented in Part 3). It is important to also consider the relevance and validity of the activated regions as exposed via this grad-CAM analysis.

Interesting next steps to consider therefore would be to (i) combine the results of the various networks on the classification task rather than simply trying to choose a single 'optimal' network (per Experiment task); (ii) whilst doing so, eliminate any network whose grad-CAM activations are in inappropriate regions (i.e., outside the lungs) on a given sample-image-under-test. This could result in a more accurate and robust COVID-19 classifier.











Tuesday 28 April 2020

Deep Learning Analysis of COVID-19 lung X-Rays using MATLAB: Part 3



UPDATES: See Part 4 for a grad-CAM analysis of all the trained networks presented below, then Part 5 where the grad-CAM results of Part 4 are used to train another suite of networks to help choose between the lung X-ray classifiers presented below.

*** DISCLAIMER ***


I have no medical training. Nothing presented here should be considered in any way as informative from a medical point-of-view. This is simply an exercise in image analysis via Deep Learning using MATLAB, with lung X-rays as a topical example in these times of COVID-19. 

INTRODUCTION


In this Part 3 in my series of blog articles on exploring Deep Learning of lung X-rays using MATLAB, the analysis of Part 1 is revisited but rather than just using the pre-trained googlenet as the basis of Transfer Learning, the performance of all the pre-trained networks available via MATLAB as the basis for the Transfer Learning procedure is compared.


AVAILABLE PRE-TRAINED NETWORKS


See this overview for a list of all the available pre-trained Deep Neural Networks bundled with MATLAB (version R2020a). There are 19 available networks, listed below, which include two alternate versions of googlenet: the original, and the alternate version with identical layer structure but pre-trained on images of places rather than images of objects. 

Available Pre-Trained Networks
squeezenet
googlenet
googlenet (places)
inceptionv3
densenet201
mobilenetv2
resnet18
resnet50
resnet101
xception
inceptionresnetv2
shufflenet
nasnetmobile
darknet19
darknet53
alexnet
vgg16
vgg19

TRANSFER LEARNING

Network Preparation


Each of the above pre-trained networks were prepared for Transfer Learning in the same manner as described in Part 1 (and references therein). This involved replacing the last few layers in each network in preparation for re-training with the lung X-ray images. To determine which layers to replace required identifying the last learning layer (such as a convolution2dLayer) in each network and replacing from that point onwards using new layers with the appropriate number of output classes (e.g., 2 or 4 etc rather than 1000 as per the pre-trained imageNet classes). For convenience, I've collected together the appropriate logic for preparing each of the networks (since the relevant layer names are generally different for the various networks) in the function prepareTransferLearningLayers which you can obtain from my GitHub repository here.

Data Preparation

For each of the Examples 1--4 in Part 1,  the training and validation image datasets were prepared as before (from all the underlying images available) with the important additional action: for each Example, the respective datasets were frozen (rather than randomly chosen each time) so that each of the 19 networks could be trained and tested on precisely the same datasets as one another, thereby enabling ready comparison between network performance.


Training Options

The training options for each case were set as follows:

MaxEpochs=1000; % Placeholder, patience will stop well before
miniBatchSize = 10;
numIterationsPerEpoch = floor(numTrainingImages/miniBatchSize);
options = trainingOptions('sgdm',...
  'ExecutionEnvironment','multi-gpu', ...% for AWS ec2 p-class VM
  'MiniBatchSize',miniBatchSize, 'MaxEpochs',MaxEpochs,...
  'InitialLearnRate',1e-4, 'Verbose',false,...
  'Plots','none', 'ValidationData',validationImages,...
  'ValidationFrequency',numIterationsPerEpoch,...
  'ValidationPatience',4);

Note that the ValidationPatience is set to a finite value (e.g., 4 rather than Inf) to automatically halt the training before overfitting occurs. This also enables the training to be performed within a big loop across all 19 network types without user intervention. Also note that ExecutionEnvironment was set to multi-gpu to take advantage of the multiple GPUs available via Amazon Web Services (AWS)  p-class instance types in order to speed-up the analysis for all networks across all examples. The screenshot below shows the GPU activity when running the training on an AWS p2.x8large instance type. Even with GPUs, some training runs took quite a long time, especially for the larger networks (not surprisingly). For example, nasnetlarge on the Example 2 dataset (3434 training images)  took 11 hours to complete.  All in all, it took a few days to complete the training for all 76 cases (i.e., the 4 Example Cases across each of the 19 networks)

Deep Learning Network training via MATLAB on an AWS p2.x8large instance with 8 NVIDIA Tesla GPUs
















RESULTS


Refer to Part 1 for the motivation and background details pertaining to the following examples. 

EXAMPLE 1: "YES / NO" Classification of Pneumonia


The 19 networks were re-trained (via Transfer Learning) on the relevant training dataset for the given example (1280 images, equally balanced across both classes). The following table shows the performance of each trained network when applied to the validation dataset (balanced, 112 each "yes" / "no") and the holdout dataset (unbalanced, 3806 "yes" only). The results are ordered (descending) by (i) Average Accuracy (across both classes), then (ii) Pneumonia Accuracy (i.e., fraction of "yes" correctly diagnosed). The table also included the Missed Pneumonia rate i.e., the percentage of the total validation population that should have been diagnosed "yes" (pneumonia) but which were missed i.e., wrongly diagnosed as "no" (healthy).


Base networkValidation: Average Accuracy

Validation: Pneumonia Accuracy

Validation: Healthy Accuracy

Validation: Missed Pneumonia

Holdout: Average Accuracy

vgg1691%88%95%6%86%
alexnet90%86%94%7%85%
darknet1988%88%88%6%87%
darknet5388%89%87%5%89%
shufflenet88%84%92%8%84%
googlenet88%83%93%8%84%
googlenetplaces88%89%86%5%87%
resnet10188%77%98%12%76%
nasnetlarge87%83%91%8%84%
resnet5087%86%88%7%88%
vgg1986%90%81%5%91%
xception86%79%93%11%83%
resnet1885%71%100%15%77%
squeezenet84%92%76%4%91%
densenet20183%71%96%15%72%
inceptionresnetv2 83%92%73%4%86%
nasnetmobile72%84%60%8%85%
inceptionv372%83%61%8%83%
mobilenetv269%93%46%4%93%

EXAMPLE 2: Classification Bacterial or Viral Pneumonia


The 19 networks were re-trained (via Transfer Learning) on the relevant training dataset for the given example (3434 images, equally balanced across both classes). The following table shows the performance of each trained network when applied to the validation dataset (balanced, 320 each "bacteria" / "virus") and the holdout dataset (unbalanced, 520 "bacteria" only). The results are ordered (descending) by (i) Average Accuracy (across both classes), then (ii) Viral Accuracy (i.e., fraction of viral cases correctly diagnosed). Also shown is the Missed Viral rate (i.e., the fraction of the total validation population that should have been diagnosed viral but which were missed (wrongly diagnosed as bacterial).

Base networkValidation: Average Accuracy

Validation: Viral Accuracy

Validation: Bacterial Accuracy

Validation: Missed Viral

Holdout: Average Accuracy

darknet5380%76%84%12%84%
vgg1680%73%87%14%83%
squeezenet79%75%83%12%80%
vgg1978%79%78%10%78%
mobilenetv278%81%75%9%71%
googlenetplaces78%71%86%15%85%
densenet20178%70%87%15%85%
inceptionresnetv2 78%82%74%9%70%
alexnet78%81%75%10%71%
googlenet77%71%83%15%83%
nasnetlarge77%78%76%11%76%
darknet1977%62%92%19%89%
inceptionv376%91%60%4%58%
resnet5075%68%83%16%81%
nasnetmobile74%66%81%17%76%
shufflenet69%50%89%25%88%
xception69%43%94%28%90%
resnet10165%38%92%31%93%
resnet1858%80%35%10%39%


EXAMPLE 3: Classification of COVID-19 or Other-Viral 


The 19 networks were re-trained (via Transfer Learning) on the relevant training dataset for the given example (130 images, equally balanced across both classes). The following table shows the performance of each trained network when applied to the validation dataset (balanced, 11 each "covid" / "other-viral") and the holdout dataset (unbalanced, 1938 "other-viral" only). The results are ordered (descending) by (i) Average Accuracy (across both classes), then (ii) COVID-19 Accuracy (i.e., fraction of COVID-19 cases correctly diagnosed). Also shown is the Missed COVID-19 i.e., the fraction of the total validation population that should have been diagnosed COVID-19 but which were missed (wrongly diagnosed as Other-Viral).

Base networkValidation: Average Accuracy

Validation: COVID-19 Accuracy

Validation: Other-Viral Accuracy

Validation: Missed COVID-19

Holdout: Average Accuracy

alexnet100%100%100%0%95%
vgg16100%100%100%0%96%
vgg19100%100%100%0%97%
darknet19100%100%100%0%93%
darknet53100%100%100%0%96%
densenet201100%100%100%0%96%
googlenet100%100%100%0%95%
googlenetplaces 100%100%100%0%95%
inceptionresnetv2 100%100%100%0%96%
inceptionv3100%100%100%0%96%
mobilenetv2100%100%100%0%95%
resnet18100%100%100%0%96%
resnet50100%100%100%0%96%
resnet101100%100%100%0%96%
shufflenet100%100%100%0%95%
squeezenet100%100%100%0%94%
xception100%100%94%0%96%
nasnetmobile95%100%91%0%94%
nasnetlarge95%91%100%5%96%


EXAMPLE 4: Determine if COVID-19 pneumonia versus Healthy, Bacterial, or non-COVID viral pneumonia 


The 19 networks were re-trained (via Transfer Learning) on the relevant training dataset for the given example (260 images, equally balanced across all four classes). The following table shows the performance of each trained network when applied to the validation dataset (balanced, 11 each of "covid" / "other-viral" / "bacterial" / "healthy") and the holdout dataset (unbalanced, zero "covid", 1934 "other-viral", 2463 "bacterial", 676 "healthy"). For succinctness, not all four classes are shown in the table (just the key ones of interest which the network should ideally distinguish: COVID-19 and Healthy). The results are ordered (descending) by (i) Average Accuracy (across all four classes), then (ii) COVID-19 (i.e., fraction of COVID-19 cases correctly diagnosed). Also shown is the Missed COVID-19 i.e., the fraction of the total validation population that should have been diagnosed COVID-19 but which were missed (wrongly diagnosed as belonging to one of the other three classes).


Base networkValidation: Average Accuracy

Validation: COVID-19 Accuracy

Validation: Healthy Accuracy

Validation: Missed COVID-19

Holdout: Average Accuracy

alexnet82%100%100%0%58%
inceptionresnetv2 80%100%100%0%61%
googlenet80%91%100%2%61%
xception77%100%100%0%58%
inceptionv377%91%100%2%58%
mobilenetv277%91%100%2%61%
densenet20175%100%100%0%61%
darknet19 75%100%100%0%59%
nasnetlarge 75%91%100%2%61%
vgg1973%100%100%0%52%
nasnetmobile73%91%100%2%58%
darknet5373%91%100%2%63%
vgg1673%91%100%2%61%
googlenetplaces73%82%100%5%57%
resnet1873%73%100%7%60%
resnet5070%91%100%2%61%
squeezenet70%73%100%7%55%
shufflenet68%91%91%2%59%
resnet10152%64%100%9%51%

 DISCUSSION & CONCLUSIONS

The main points of discussion surrounding these experiments are summarised as follows:

  • It is interesting to observe that the best performing networks (i.e., those near the top of the lists of results presented above) per Experiment generally differ per Experiment. The differences must be due to the nature and number of  images being compared in a given Experiment and in the detailed structure of the networks and their specific response to the respective image sets in training. 
  • For each Experiment, the most accurate network turned out not to be googlenet as used exclusively in Part 1. This emphasises the importance of trying different networks for a given problem -- and it is not at all clear a priori which network is going to perform best. The results also suggest that resnet50, as used here, is not actually the optimal choice when analysing these lung images via Transfer Learning.
  • Since each Example reveals a different preferred network, a useful strategy for diagnosing COVID-19 could be as follows: (i) use a preferred network from Example 1 (e.g., vgg16 at the top of the list, or some other network from near the top of the list) to determine whether a given X-ray-image-under-test is healthy or unhealthy; (ii) if unhealthy, use a preferred network from Example 2 to determine if viral or bacterial pneumonia; (iii) if viral, use a preferred network from Example 3 to determine if COVID-19 or another type of viral pneumonia; (iv) test the same image using a preferred network from Example 4 (which directly assesses whether-or-not COVID-19). Compare the conclusion of step (iv) with that of step (iii) to see if they reinforce one another by being in agreement on a COVID-19 diagnosis (or not, as the case may be). This multi-network cascaded approach should be more robust than just using a single network (e.g., as per Example 4 alone) to perform the diagnosis.
  • Care was taken to ensure that the training and validation sets used throughout were chosen to be balanced i.e., with equal distribution across all classes in the given Experiment. This left the holdout sets i.e., those containing the unused images from the total available pool, comprising an unbalanced set of test images per Experiment representing a further useful test set. Despite the imbalances, the performance on the networks when applied to the holdout images was generally good, suggesting that the trained networks behave consistently.

POTENTIAL NEXT STEPS

  • In the interests of time, the training runs were only conducted once per model per Experiment  i.e., using one sample of training and validation images per Experiment. For completeness, the training should be repeated with different randomly selected training & validation images (from the available pool) to ensure that the results (in terms of assessing favoured models per Experiment, etc) are statistically significant.
  • Likewise, in the interests of time, the training options (hyper-parameter settings) were fixed (based on quick trial-and-error tests, then frozen for all ensuing experiments). Ideally, these should be optimised, for example using Bayesian Optimisation as described here
  • It would be interesting to gain an understanding of the differences in the performance of the various networks across the various Experiments. Perhaps a comparative Activation Mapping Analysis (akin to that presented in Part 2) could shed some light (?)
  • It would be interesting to compare the performance of the networks presented in this article with the COVID-Net custom network. Unfortunately, after spending many hours in TensorFlow, I was unable to export the COVID-Net -- either as a Keras model or in ONNX format -- in a manner suitable for importing into MATLAB (via importKerasNetwork or importONNXNetwork). Perhaps, then, the COVID-Net would need to built from scratch within MATLAB in order to perform the desired comparison. I'm not sure if that is possible (given the underlying structure of COVID-Net). Note: I was able to import and work with the COVID-Net model from here in TensorFlow, but could not successfully export it for use within MATLAB.
  • Re-train and compare all the models with larger image datasets whenever they become available. If you have access to such images, please consider posting them to the open source COVID-Net archive here.

Monday 20 April 2020

Deep Learning Analysis of COVID-19 lung X-Rays using MATLAB: Part 2

UPDATE: See Part 4 where I've performed a grad-CAM analysis on all the trained networks from Part 3, in the theme of Part 2.

*** DISCLAIMER ***


I have no medical training. Nothing presented here should be considered in any way as informative from a medical point-of-view. This is simply an exercise in image analysis via Deep Learning using MATLAB, with lung X-rays as a topical example in these times of COVID-19. 

INTRODUCTION


This follows on from my previous post (Part 1) where I presented results of a preliminary investigation into COVID-19 lung X-ray classification using Deep Learning in MATLAB. The results were promising, but I did emphasise my main caveat that the Deep Neural Networks may have been skewed by extraneous information embedded in the X-ray images leading to exaggerated performance of the classifiers. In this post, I utilise the approach suggested here (another MATLAB-based COVID-19 image investigation) based on the Class Activation Mapping technique described here to determine the hotspots in the images which drive the classification results. This verification analysis mirrors that presented in the original COVID-Net article (where they utilise the GSInquire tool for similar purpose). As before, my approach is to use MATLAB for all calculations, and to provide code snippets which may be useful to others..

GOTCHA: In Part 1 I was  using MATLAB version R2019b. For this current investigation I upgraded to R2020a for the following reasons:
  • The mean function in R2020a has an additional option for vecdim as the second input argument, as required in the code I utilised from here
  • The structure of the pre-trained networks e.g., googlenet which I use, has changed such that the class names are held in the Classes property of the output layer in R2020a rather than in the ClassNames property as in R2019b. I could have simply modified my code to workaround the difference, but given the first reason above (especially), I decided to upgrade the versioning (and hopefully this will avoid future problems).

CLASS ACTIVATION MAPPING

Dataset

Using the validation results from the Deep Neural Net analysis in Example 4 of Part 1 provides a set of 44 sample X-rays and predicted classes, 11 from each of the four classes in question: "healthy", "bacteria", "viral-other", and "covid".  By choosing Example 4, we have selected the most challenging case to investigate (i.e., the 4-class classifier trained on relatively few images compared with Examples 1--3, each of which were 2-class classifiers trained on more images than Example 4).

The images are contained in validationImages (the validation imageDatastore) from Example 4 and the trained network (from Transfer Learning) is contained in the netTransfer variable.  The task at hand is to analyse the Class Activation Mappings to determine which regions of the X-rays play the dominant role in assessing the predicted class. 

Code snippet 

The code which performs the Class Activation Mapping using the netTransfer network (in a loop around all 44 images in validationImages) is adapted directly from this example, and presented in full as follows (the utility sub-functions -- identical to those in the example -- are not included here):

net=netTransfer;
netName = "googlenet";
classes = net.Layers(end).Classes;
layerName = activationLayerName(netName);
for i=1:length(validationImages.Files)
   h = figure('Units','normalized','Position',[0.05 0.05 0.9
         0.8],'Visible','on');
   
   [img,fileinfo] = readimage(validationImages,i);
   im=img(:,:,[1 1 1]); %Convert from grayscale to rgb
   imResized = imresize(img, [224 224]);
   imResized=imResized(:,:,[1 1 1]); %Convert to rgb
   
   imageActivations = activations(net,imResized,layerName);
   
   scores = squeeze(mean(imageActivations,[1 2]));
   fcWeights = net.Layers(end-2).Weights;
   fcBias = net.Layers(end-2).Bias;
   scores = fcWeights*scores + fcBias;
   [~,classIds] = maxk(scores,4); %since 4 classes to compare
   weightVector = shiftdim(fcWeights(classIds(1),:),-1);
   classActivationMap = sum(imageActivations.*weightVector,3);
   scores = exp(scores)/sum(exp(scores));
   maxScores = scores(classIds); labels = classes(classIds);
   [maxScore, maxID] = max(maxScores);
   labels_max = labels(maxID);
   
   CAMshow(im,classActivationMap)
   title("Predicted: "+string(labels_max) + ", " +
     string(maxScore)+" (Actual: "+
           string(validationImages.Labels(i))+")",'FontSize', 18);
   
   drawnow
end

Results & Conclusions 

The resulting Class Activation Maps for all 44 validation images are shown below. The title of each image contains the predicted class (plus the corresponding score) and the actual class. Since the network is not 100% accurate, some of the predictions are incorrect. However, it is clear from these image activation heat-maps that the networks are generally using the detail within the lungs (albeit with a few use regions further away) rather than extraneous factors and artefacts (embedded text, pacemakers, etc.) to make the predictions. This is an encouraging result, successfully countering the caveat from Part 1 regarding the possibility of the classifier performance being exaggerated by such artefacts, and is in line with the conclusions reported here and here from similar studies.

Class Activation Maps