Commit e7bae7b1 authored by lindawangg's avatar lindawangg

added 2 more test covid samples and 6 more train covid samples

parent 50e7edf9
......@@ -5,7 +5,7 @@
<em>Example chest radiography images of COVID-19 cases from 2 different patients and their associated critical factors (highlighted in red) as identified by GSInquire.</em>
</p>
[Linda Wang and Alexander Wong, "COVID-Net: A Tailored Deep Convolutional Neural Network Design for Detection of COVID-19 Cases from Chest Radiography Images", 2020. (pdf)](https://arxiv.org/abs/2003.09871)
[Linda Wang and Alexander Wong, "COVID-Net: A Tailored Deep Convolutional Neural Network Design for Detection of COVID-19 Cases from Chest Radiography Images", 2020.](https://arxiv.org/abs/2003.09871)
The COVID-19 pandemic continues to have a devastating effect on the health and well-being of global population. A critical step in the fight against COVID-19 is effective screening of infected patients, with one of the key screening approaches being radiological imaging using chest radiography. It was found in early studies that patients present abnormalities in chest radiography images that are characteristic of those infected with COVID-19. Motivated by this, a number of artificial intelligence (AI) systems based on deep learning have been proposed and results have been shown to be quite promising in terms of accuracy in detecting patients infected with COVID-19 using chest radiography images. However, to the best of the authors' knowledge, these developed AI systems have been closed source and unavailable to the research community for deeper understanding and extension, and unavailable for public access and use. Therefore, in this study we introduce COVID-Net, a deep convolutional neural network design tailored for the detection of COVID-19 cases from chest radiography images that is open source and available to the general public. We also describe the chest radiography dataset leveraged to train COVID-Net, which we will refer to as COVIDx and is comprised of 5941 posteroanterior chest radiography images across 2839 patient cases from two open access data repositories. Furthermore, we investigate how COVID-Net makes predictions using an explainability method in an attempt to gain deeper insights into critical factors associated with COVID cases, which can aid clinicians in improved screening. By no means a production-ready solution, the hope is that the open access COVID-Net, along with the description on constructing the open source COVIDx dataset, will be leveraged and build upon by both researchers and citizen data scientists alike to accelerate the development of highly accuracy yet practical deep learning solutions for detecting COVID-19 cases and accelerate treatment of those who need it the most.
......@@ -43,19 +43,20 @@ We provide jupyter notebooks for [creating the COVIDx dataset](create_COVIDx.ipy
This project is still a work in progress and will continuously update these files.
The COVIDx dataset can be downloaded [here](https://drive.google.com/file/d/1-T26bHP7MCwB8vWeKufjGmPKl8pesM1J/view?usp=sharing).
Preprocessed ready-for-training COVIDx dataset can be downloaded [here](https://drive.google.com/file/d/1zCnmcMxSRZTqJywur7jCqZk0z__Mevxp/view?usp=sharing).
Preprocessed ready-for-training COVIDx dataset can be downloaded [here](https://drive.google.com/file/d/1zCnmcMxSRZTqJywur7jCqZk0z__Mevxp/view?usp=sharing). Note: for most up-to-date data for train/test,
generate using the [preprocessing script](preprocessing.ipynb).
Chest radiography images distribution
| Type | Normal | Bacterial| Non-COVID19 Viral | COVID-19 Viral | Total |
|:-----:|:------:|:--------:|:-----------------:|:--------------:|:-----:|
| train | 1349 | 2540 | 1355 | 60 | 5304 |
| test | 234 | 246 | 149 | 8 | 637 |
| train | 1349 | 2540 | 1355 | 66 | 5310 |
| test | 234 | 246 | 149 | 10 | 639 |
Patients distribution
| Type | Normal | Bacterial | Non-COVID19 Viral| COVID-19 Viral | Total |
|:-----:|:------:|:---------:|:----------------:|:--------------:|:-----:|
| train | 1001 | 853 | 534 | 41 | 2429 |
| test | 202 | 78 | 126 | 4 | 410 |
| train | 1001 | 853 | 534 | 47 | 2435 |
| test | 202 | 78 | 126 | 5 | 411 |
## Training and Evaluation
Releasing soon but can download COVID-Net and start training/inferencing [here](https://drive.google.com/file/d/1FyfcAkRf-0gQ1nOrDJ9ccGVSZAO9VFP1/view?usp=sharing).
......
......@@ -2,7 +2,7 @@
"cells": [
{
"cell_type": "code",
"execution_count": 161,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
......@@ -15,14 +15,14 @@
},
{
"cell_type": "code",
"execution_count": 215,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# set parameters here\n",
"savepath = 'data'\n",
"seed = 0\n",
"np.random.seed(seed). # Reset the seed so all runs are the same.\n",
"np.random.seed(seed) # Reset the seed so all runs are the same.\n",
"random.seed(seed)\n",
"MAXVAL = 255 # Range [0 255]\n",
"\n",
......@@ -43,7 +43,7 @@
},
{
"cell_type": "code",
"execution_count": 162,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
......@@ -65,18 +65,9 @@
},
{
"cell_type": "code",
"execution_count": 218,
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'normal': 0, 'viral': 11, 'bacteria': 6, 'COVID-19': 68}\n",
"68\n"
]
}
],
"outputs": [],
"source": [
"# get non-COVID19 viral, bacteria, and COVID-19 infections from covid-chestxray-dataset\n",
"# stored as patient id, image filename and label\n",
......@@ -95,24 +86,9 @@
},
{
"cell_type": "code",
"execution_count": 243,
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Key: viral\n",
"Test patients: ['8']\n",
"Key: bacteria\n",
"Test patients: ['31']\n",
"Key: COVID-19\n",
"Test patients: ['36', '42', '19', '20']\n",
"test count: {'normal': 0, 'viral': 1, 'bacteria': 4, 'COVID-19': 8}\n",
"train count: {'normal': 0, 'viral': 10, 'bacteria': 2, 'COVID-19': 60}\n"
]
}
],
"outputs": [],
"source": [
"# add covid-chestxray-dataset into COVIDx dataset\n",
"# since covid-chestxray-dataset doesn't have test dataset\n",
......@@ -120,7 +96,7 @@
"# for COVIDx:\n",
"# patient 8 is used as non-COVID19 viral test\n",
"# patient 31 is used as bacterial test\n",
"# patients 19, 20, 36, 42 are used as COVID-19 viral test\n",
"# patients 19, 20, 36, 42, 86 are used as COVID-19 viral test\n",
"\n",
"for key in filename_label.keys():\n",
" arr = np.array(filename_label[key])\n",
......@@ -150,7 +126,7 @@
},
{
"cell_type": "code",
"execution_count": 244,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
......@@ -194,21 +170,9 @@
},
{
"cell_type": "code",
"execution_count": 245,
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Final stats\n",
"Train count: {'normal': 1349, 'viral': 1355, 'bacteria': 2540, 'COVID-19': 60}\n",
"Test count: {'normal': 234, 'viral': 149, 'bacteria': 246, 'COVID-19': 8}\n",
"Total length of train: 5304\n",
"Total length of test: 637\n"
]
}
],
"outputs": [],
"source": [
"# final stats\n",
"print('Final stats')\n",
......@@ -220,7 +184,83 @@
},
{
"cell_type": "code",
"execution_count": 246,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# run this cell when adding in new covid data from covid-chextxray-dataset\n",
"\n",
"# load in current train/test information\n",
"train_filepath = 'train_split.txt'\n",
"test_filepath = 'test_split.txt'\n",
"file = open(train_filepath, 'r') \n",
"trainfiles = file.readlines() \n",
"trainfiles = np.array([line.split() for line in trainfiles])\n",
"file = open(test_filepath, 'r')\n",
"testfiles = file.readlines()\n",
"testfiles = np.array([line.split() for line in testfiles])\n",
"\n",
"# find the new entries in csv \n",
"new_entries = []\n",
"for key in filename_label.keys():\n",
" arr = np.array(filename_label[key])\n",
" if arr.size == 0:\n",
" continue\n",
" for patient in arr:\n",
" if patient[1] not in trainfiles and patient[1] not in testfiles:\n",
" # if key is normal, bacteria or viral add to train folder\n",
" if key in ['normal', 'bacteria', 'viral']:\n",
" copyfile(os.path.join(imgpath, patient[1]), os.path.join(savepath, 'train', patient[1]))\n",
" train.append(patient)\n",
" train_count[patient[2]] += 1\n",
" else: \n",
" new_entries.append(patient)\n",
"new_entries = np.array(new_entries)\n",
"\n",
"# 10% of new entries should go into in test\n",
"if new_entries.size > 0:\n",
" num_diff_patients = len(np.unique(new_entries[:,0]))\n",
" num_test = max(1, round(split*num_diff_patients))\n",
"\n",
" i = 0\n",
" used_i = []\n",
" # insert patients who are already in dataset into the respective train/test\n",
" for patient in new_entries:\n",
" if patient[0] in trainfiles:\n",
" copyfile(os.path.join(imgpath, patient[1]), os.path.join(savepath, 'train', patient[1]))\n",
" train.append(patient)\n",
" train_count[patient[2]] += 1\n",
" used_i.append(i)\n",
" elif patient[0] in testfiles:\n",
" copyfile(os.path.join(imgpath, patient[1]), os.path.join(savepath, 'test', patient[1]))\n",
" test.append(patient)\n",
" test_count[patient[2]] += 1\n",
" used_i.append(i)\n",
" i += 1\n",
" # delete patients who have already been inserted\n",
" new_entries = np.delete(new_entries, used_i, axis=0)\n",
"\n",
" # select num_test number of random patients\n",
" test_patients = random.sample(list(new_entries[:,0]), num_test)\n",
" print('test patients: ', test_patients)\n",
" # add to respective train/test folders\n",
" for patient in new_entries:\n",
" if patient[0] in test_patients:\n",
" copyfile(os.path.join(imgpath, patient[1]), os.path.join(savepath, 'test', patient[1]))\n",
" test.append(patient)\n",
" test_count[patient[2]] += 1\n",
" else:\n",
" copyfile(os.path.join(imgpath, patient[1]), os.path.join(savepath, 'train', patient[1]))\n",
" train.append(patient)\n",
" train_count[patient[2]] += 1\n",
"\n",
"print('added test count: ', test_count)\n",
"print('added train count: ', train_count)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
......@@ -240,6 +280,13 @@
"\n",
"test_file.close()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
......
......@@ -2,12 +2,11 @@
"cells": [
{
"cell_type": "code",
"execution_count": 16,
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import cv2\n",
"import keras\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import os"
......@@ -15,7 +14,7 @@
},
{
"cell_type": "code",
"execution_count": 61,
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
......@@ -28,7 +27,7 @@
},
{
"cell_type": "code",
"execution_count": 13,
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
......@@ -41,15 +40,15 @@
},
{
"cell_type": "code",
"execution_count": 14,
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5304\n",
"637\n"
"Total samples for train: 5310\n",
"Total samples for test: 639\n"
]
}
],
......@@ -60,15 +59,15 @@
},
{
"cell_type": "code",
"execution_count": 62,
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(224, 224, 3)\n",
"(224, 224, 3)\n"
"Shape of test images: (224, 224, 3)\n",
"Shape of train images: (224, 224, 3)\n"
]
}
],
......@@ -105,7 +104,7 @@
},
{
"cell_type": "code",
"execution_count": 63,
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
......
......@@ -635,3 +635,5 @@
1000109 person109_bacteria_512.jpeg bacteria
100083 person83_bacteria_410.jpeg bacteria
1000112 person112_bacteria_538.jpeg bacteria
86 B2D20576-00B7-4519-A415-72DE29C90C34.jpeg COVID-19
86 6C94A287-C059-46A0-8600-AFB95F4727B7.jpeg COVID-19
......@@ -5302,3 +5302,9 @@
10001946 person1946_bacteria_4875.jpeg bacteria
10001949 person1949_bacteria_4880.jpeg bacteria
10001954 person1954_bacteria_4886.jpeg bacteria
75 covid-19-pneumonia-19.jpg COVID-19
77 03BF7561-A9BA-4C3C-B8A0-D3E585F73F3C.jpeg COVID-19
78 353889E0-A1E8-4F9E-A0B8-F24F36BCFBFB.jpeg COVID-19
80 figure1-5e75d0940b71e1b702629659-98-right.jpeg COVID-19
81 figure1-5e71be566aa8714a04de3386-98-left.jpeg COVID-19
85 2966893D-5DDF-4B68-9E2B-4979D5956C8E.jpeg COVID-19
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment