Commit c4da4b05 authored by lindawangg's avatar lindawangg

NEW DATASET, along with eval script

parent 31c69380
.ipynb_checkpoints
chest_xray
data
.DS_Store
*.zip
__pycache__
callbacks.py
class_labels.csv
explain/
masks.ipynb
eval.ipynb
get_stats.ipynb
graph.csv
hyperparameter_op.ipynb
models/
output/
train-Copy1.ipynb
train_3000.txt
test_split_v2.txt
gensynth/
train.py
train.ipynb
data.py
export_to_meta.py
model.py
train_tf.py
# COVID-Net and COVIDx Dataset
# COVID-Net Open Source Initiative
<p align="center">
<img src="assets/covid-2p-rca.png" alt="photo not available" width="70%" height="70%">
<img src="assets/covidnet-small-exp.png" alt="photo not available" width="70%" height="70%">
<br>
<em>Example chest radiography images of COVID-19 cases from 2 different patients and their associated critical factors (highlighted in red) as identified by GSInquire.</em>
</p>
[Linda Wang and Alexander Wong, "COVID-Net: A Tailored Deep Convolutional Neural Network Design for Detection of COVID-19 Cases from Chest Radiography Images", 2020.](https://arxiv.org/abs/2003.09871)
**Core COVID-Net team: Linda Wang, Alexander Wong, Zhong Qiu Lin, James Lee, Paul McInnis**
**Linda Wang and Alexander Wong, "COVID-Net: A Tailored Deep Convolutional Neural Network Design for Detection of COVID-19 Cases from Chest Radiography Images", 2020.**
The COVID-19 pandemic continues to have a devastating effect on the health and well-being of the global population. A critical step in the fight against COVID-19 is effective screening of infected patients, with one of the key screening approaches being radiological imaging using chest radiography. It was found in early studies that patients present abnormalities in chest radiography images that are characteristic of those infected with COVID-19. Motivated by this, a number of artificial intelligence (AI) systems based on deep learning have been proposed and results have been shown to be quite promising in terms of accuracy in detecting patients infected with COVID-19 using chest radiography images. However, to the best of the authors' knowledge, these developed AI systems have been closed source and unavailable to the research community for deeper understanding and extension, and unavailable for public access and use. Therefore, in this study we introduce COVID-Net, a deep convolutional neural network design tailored for the detection of COVID-19 cases from chest radiography images that is open source and available to the general public. We also describe the chest radiography dataset leveraged to train COVID-Net, which we will refer to as COVIDx and is comprised of 16,756 posteroanterior chest radiography images across 13,645 patient cases from two open access data repositories. Furthermore, we investigate how COVID-Net makes predictions using an explainability method in an attempt to gain deeper insights into critical factors associated with COVID cases, which can aid clinicians in improved screening. By no means a production-ready solution, the hope is that the open access COVID-Net, along with the description on constructing the open source COVIDx dataset, will be leveraged and build upon by both researchers and citizen data scientists alike to accelerate the development of highly accurate yet practical deep learning solutions for detecting COVID-19 cases and accelerate treatment of those who need it the most.
The COVID-19 pandemic continues to have a devastating effect on the health and well-being of global population. A critical step in the fight against COVID-19 is effective screening of infected patients, with one of the key screening approaches being radiological imaging using chest radiography. It was found in early studies that patients present abnormalities in chest radiography images that are characteristic of those infected with COVID-19. Motivated by this, a number of artificial intelligence (AI) systems based on deep learning have been proposed and results have been shown to be quite promising in terms of accuracy in detecting patients infected with COVID-19 using chest radiography images. However, to the best of the authors' knowledge, these developed AI systems have been closed source and unavailable to the research community for deeper understanding and extension, and unavailable for public access and use. Therefore, in this study we introduce COVID-Net, a deep convolutional neural network design tailored for the detection of COVID-19 cases from chest radiography images that is open source and available to the general public. We also describe the chest radiography dataset leveraged to train COVID-Net, which we will refer to as COVIDx and is comprised of 5941 posteroanterior chest radiography images across 2839 patient cases from two open access data repositories. Furthermore, we investigate how COVID-Net makes predictions using an explainability method in an attempt to gain deeper insights into critical factors associated with COVID cases, which can aid clinicians in improved screening. By no means a production-ready solution, the hope is that the open access COVID-Net, along with the description on constructing the open source COVIDx dataset, will be leveraged and build upon by both researchers and citizen data scientists alike to accelerate the development of highly accuracy yet practical deep learning solutions for detecting COVID-19 cases and accelerate treatment of those who need it the most.
For a detailed description of the methodology behind COVID-Net and a full description of the COVIDx dataset, please click [here](assets/COVID_Netv2.pdf).
Currently, the COVID-Net team is working on COVID-RiskNet, a deep neural network tailored for COVID-19 risk stratification. Stay tuned as we make it available soon.
If you would like to contribute COVID-19 x-ray images, please contact us at linda.wang513@gmail.com and a28wong@uwaterloo.ca or alex@darwinai.ca. Lets all work together to stop the spread of COVID-19!
......@@ -29,91 +36,162 @@ If you find our work useful, can cite our paper using:
```
## Requirements
Install requirements using `pip install -r requirements.txt`
The main requirements are listed below:
* Tested with Tensorflow 1.13 and 1.15
* Keras 2.3.1
* OpenCV 4.2.0
* Python 3.6
* OpenCV
* PyDicom
## COVIDx Dataset
**Update: We are currently constructing an new COVIDx dataset in light of the Kaggle dataset being composed of pediatric patient cases. Please stay tuned for the new COVIDx dataset and accompanying COVID-Net model.**
Currently, the COVIDx dataset is constructed by the following open source chest radiography datasets:
**Update: we have released the brand-new COVIDx dataset with 16,756 posteroanterior chest radiography images across 13,645 patient cases.**
The current COVIDx dataset is constructed by the following open source chest radiography datasets:
* https://github.com/ieee8023/covid-chestxray-dataset
* https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia
* https://www.kaggle.com/c/rsna-pneumonia-detection-challenge
We especially thank the Radiological Society of North America and others involved in the RSNA Pneumonia Detection Challenge, and Dr. Joseph Paul Cohen and the team at MILA involved in the COVID-19 image data collection project, for making data available to the global community.
### Steps to generate the dataset
We provide jupyter notebooks for [creating the COVIDx dataset](create_COVIDx.ipynb) and the [preprocessing](preprocessing.ipynb) used for training.
This project is still a work in progress and will continuously update these files.
1. Download the datasets listed above
* `git clone https://github.com/ieee8023/covid-chestxray-dataset.git`
* go to this [link](https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/data) to download the RSNA pneumonia dataset
2. Create a `data` directory and within the data directory, create a `train` and `test` directory
3. Use [create\_COVIDx\_v2.ipynb](create_COVIDx_v2.ipynb) to combine the two dataset to create COVIDx. Make sure to remember to change the file paths.
4. We provide the train and test txt files with patientId, image path and label (normal, pneumonia or COVID-19). The description for each file is explained below:
* [train\_COVIDx.txt](train_COVIDx.txt): This file contains the samples used for training.
* [test\_COVIDx.txt](test_COVIDx.txt): This file contains the samples used for testing.
The COVIDx dataset can be downloaded [here](https://drive.google.com/file/d/1-T26bHP7MCwB8vWeKufjGmPKl8pesM1J/view?usp=sharing).
Preprocessed ready-for-training COVIDx dataset can be downloaded [here](https://drive.google.com/file/d/1zCnmcMxSRZTqJywur7jCqZk0z__Mevxp/view?usp=sharing). Note: for most up-to-date data for train/test,
generate using the [preprocessing script](preprocessing.ipynb).
### COVIDx data distribution
Chest radiography images distribution
| Type | Normal | Bacterial| Non-COVID19 Viral | COVID-19 Viral | Total |
|:-----:|:------:|:--------:|:-----------------:|:--------------:|:-----:|
| train | 1349 | 2540 | 1355 | 66 | 5310 |
| test | 234 | 246 | 149 | 10 | 639 |
| Type | Normal | Pneumonia | COVID-19 | Total |
|:-----:|:------:|:---------:|:--------:|:-----:|
| train | 7966 | 8514 | 66 | 16279 |
| test | 100 | 100 | 10 | 210 |
Patients distribution
| Type | Normal | Bacterial | Non-COVID19 Viral| COVID-19 Viral | Total |
|:-----:|:------:|:---------:|:----------------:|:--------------:|:-----:|
| train | 1001 | 853 | 534 | 47 | 2435 |
| test | 202 | 78 | 126 | 5 | 411 |
| Type | Normal | Pneumonia | COVID-19 | Total |
|:-----:|:------:|:---------:|:--------:|:------:|
| train | 7966 | 5429 | 48 | 13443 |
| test | 100 | 97 | 5 | 202 |
## Training and Evaluation
Releasing soon but can download COVID-Net and start training/inferencing [here](https://drive.google.com/file/d/1FyfcAkRf-0gQ1nOrDJ9ccGVSZAO9VFP1/view?usp=sharing).
The network takes as input an image of shape (N, 224, 224, 3) and outputs the softmax probabilities as (N, 3), where N is the number of batches.
If using the TF checkpoints, here are some useful tensors:
* input tensor: `input_1:0`
* output tensor: `dense_3/Softmax:0`
* label tensor: `dense_3_target:0`
* class weights tensor: `dense_3_sample_weights:0`
* loss tensor: `loss/mul:0`
### Steps for training
Releasing TF training script from pretrained model soon.
<!--1. To train from scratch, `python train.py`
2. To train from an existing hdf5 file, `python train.py --checkpoint output/example/cp-0.hdf5`
3. For more options and information, `python train.py --help`
4. If you have a GenSynth account, to convert hdf5 file to TF checkpoints,
`python export_to_meta.py --weightspath output/example --weightspath cp-0.hdf5`-->
### Steps for evaluation
1. We provide you with the tensorflow evaluation script, [eval.py](eval.py)
2. Locate the tensorflow checkpoint files
3. To evaluate a tf checkpoint, `python eval.py --weightspath models/COVID-Netv2 --metaname model.meta --ckptname model`
4. For more options and information, `python eval.py --help`
5. If evaluating a hdf5 model, evaluation will be the same as what is given at the end of [train.py](train.py)
Input tensor (N, 224, 224, 3): `input_1:0`
## Results
These are the final results for COVID-Net Small and COVID-Net Large.
Output tensor (N, 4): `dense_3/Softmax:0`
### COVIDNet Small
<p align="center">
<img src="assets/cm-covidnet-small.png" alt="photo not available" width="50%" height="50%">
<br>
<em>Confusion matrix for COVID-Net on the COVIDx test dataset.</em>
</p>
## Results
The results do not reflect the new COVIDx dataset.
The results are still from the previous COVIDx dataset with 8 COVID test samples.
Will update when new results become available.
<div class="tg-wrap" align="center"><table class="tg">
<tr>
<th class="tg-7btt" colspan="3">Sensitivity (%)</th>
</tr>
<tr>
<td class="tg-7btt">Normal</td>
<td class="tg-7btt">Pneumonia</td>
<td class="tg-7btt">COVID-19</td>
</tr>
<tr>
<td class="tg-c3ow">95.0</td>
<td class="tg-c3ow">91.0</td>
<td class="tg-c3ow">80.0</td>
</tr>
</table></div>
<div class="tg-wrap"><table class="tg">
<tr>
<th class="tg-7btt" colspan="3">Positive Predictive Value (%)</th>
</tr>
<tr>
<td class="tg-7btt">Normal</td>
<td class="tg-7btt">Pneumonia</td>
<td class="tg-7btt">COVID-19</td>
</tr>
<tr>
<td class="tg-c3ow">91.3</td>
<td class="tg-c3ow">93.8</td>
<td class="tg-c3ow">88.9</td>
</tr>
</table></div>
### COVID-Net Large
<p align="center">
<img src="assets/confusion.png" alt="photo not available" width="50%" height="50%">
<img src="assets/cm-covidnet-large.png" alt="photo not available" width="50%" height="50%">
<br>
<em>Confusion matrix for COVID-Net on the COVIDx test dataset.</em>
</p>
<div class="tg-wrap" align="center"><table class="tg">
<tr>
<th class="tg-7btt" colspan="4">Sensitivity (%)</th>
<th class="tg-7btt" colspan="3">Sensitivity (%)</th>
</tr>
<tr>
<td class="tg-7btt">Normal</td>
<td class="tg-7btt">Bacterial</td>
<td class="tg-7btt">Non-COVID19 Viral</td>
<td class="tg-7btt">COVID-19 Viral</td>
<td class="tg-7btt">Pneumonia</td>
<td class="tg-7btt">COVID-19</td>
</tr>
<tr>
<td class="tg-c3ow">73.9</td>
<td class="tg-c3ow">93.1</td>
<td class="tg-c3ow">81.9</td>
<td class="tg-c3ow">100.0</td>
<td class="tg-c3ow">94.0</td>
<td class="tg-c3ow">90.0</td>
<td class="tg-c3ow">90.0</td>
</tr>
</table></div>
<div class="tg-wrap"><table class="tg">
<tr>
<th class="tg-7btt" colspan="4">Positive Predictive Value (%)</th>
<th class="tg-7btt" colspan="3">Positive Predictive Value (%)</th>
</tr>
<tr>
<td class="tg-7btt">Normal</td>
<td class="tg-7btt">Bacterial</td>
<td class="tg-7btt">Non-COVID19 Viral</td>
<td class="tg-7btt">COVID-19 Viral</td>
<td class="tg-7btt">Pneumonia</td>
<td class="tg-7btt">COVID-19</td>
</tr>
<tr>
<td class="tg-c3ow">95.1</td>
<td class="tg-c3ow">87.1</td>
<td class="tg-c3ow">67.0</td>
<td class="tg-c3ow">80.0</td>
<td class="tg-c3ow">90.4</td>
<td class="tg-c3ow">93.8</td>
<td class="tg-c3ow">90.0</td>
</tr>
</table></div>
## Pretrained Models
Can download COVID-Net tensorflow model from [here](https://drive.google.com/file/d/1FyfcAkRf-0gQ1nOrDJ9ccGVSZAO9VFP1/view?usp=sharing)
| Type | COVID-19 Sensitivity | # Params (M) | Model |
|:-----:|:--------------------:|:------------:|:-------------------:|
| ckpt | 80.0 | 116 |[COVID-Net Small](bit.ly/COVID-Net-Small)|
| ckpt | 90.0 | 126 |[COVID-Net Large](bit.ly/COVID-Net-Large)|
{
"cells": [
{
"cell_type": "code",
"execution_count": 161,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import os\n",
"import random \n",
"from shutil import copyfile"
]
},
{
"cell_type": "code",
"execution_count": 215,
"metadata": {},
"outputs": [],
"source": [
"# set parameters here\n",
"savepath = 'data'\n",
"seed = 0\n",
"np.random.seed(seed). # Reset the seed so all runs are the same.\n",
"random.seed(seed)\n",
"MAXVAL = 255 # Range [0 255]\n",
"\n",
"# path to covid-19 dataset from https://github.com/ieee8023/covid-chestxray-dataset\n",
"imgpath = '../covid-chestxray-dataset/images' \n",
"csvpath = '../covid-chestxray-dataset/metadata.csv'\n",
"\n",
"# path to kaggle chest xray data from https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia\n",
"data_path = 'chest_xray'\n",
"\n",
"# parameters for COVIDx dataset\n",
"train = []\n",
"test = []\n",
"split = 0.1 # train/test split\n",
"test_count = {'normal': 0, 'viral': 0, 'bacteria': 0, 'COVID-19': 0}\n",
"train_count = {'normal': 0, 'viral': 0, 'bacteria': 0, 'COVID-19': 0}"
]
},
{
"cell_type": "code",
"execution_count": 162,
"metadata": {},
"outputs": [],
"source": [
"# adapted from https://github.com/mlmed/torchxrayvision/blob/master/torchxrayvision/datasets.py#L814\n",
"csv = pd.read_csv(csvpath, nrows=None)\n",
"idx_pa = csv[\"view\"] == \"PA\" # Keep only the PA view\n",
"csv = csv[idx_pa]\n",
"\n",
"pneumonias = [\"COVID-19\", \"SARS\", \"MERS\", \"ARDS\", \"Streptococcus\"]\n",
"pathologies = [\"Pneumonia\",\"Viral Pneumonia\", \"Bacterial Pneumonia\", \"No Finding\"] + pneumonias\n",
"pathologies = sorted(pathologies)\n",
"\n",
"mapping = dict()\n",
"mapping['COVID-19'] = 'COVID-19'\n",
"mapping['SARS'] = 'viral'\n",
"mapping['MERS'] = 'viral'\n",
"mapping['Streptococcus'] = 'bacteria'"
]
},
{
"cell_type": "code",
"execution_count": 218,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'normal': 0, 'viral': 11, 'bacteria': 6, 'COVID-19': 68}\n",
"68\n"
]
}
],
"source": [
"# get non-COVID19 viral, bacteria, and COVID-19 infections from covid-chestxray-dataset\n",
"# stored as patient id, image filename and label\n",
"filename_label = {'normal': [], 'viral': [], 'bacteria': [], 'COVID-19': []}\n",
"count = {'normal': 0, 'viral': 0, 'bacteria': 0, 'COVID-19': 0}\n",
"for index, row in csv.iterrows():\n",
" f = row['finding']\n",
" if f in mapping:\n",
" count[mapping[f]] += 1\n",
" entry = [int(row['Patientid']), row['filename'], mapping[f]]\n",
" filename_label[mapping[f]].append(entry)\n",
"\n",
"print('Data distribution from covid-chestxray-dataset:')\n",
"print(count)"
]
},
{
"cell_type": "code",
"execution_count": 243,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Key: viral\n",
"Test patients: ['8']\n",
"Key: bacteria\n",
"Test patients: ['31']\n",
"Key: COVID-19\n",
"Test patients: ['36', '42', '19', '20']\n",
"test count: {'normal': 0, 'viral': 1, 'bacteria': 4, 'COVID-19': 8}\n",
"train count: {'normal': 0, 'viral': 10, 'bacteria': 2, 'COVID-19': 60}\n"
]
}
],
"source": [
"# add covid-chestxray-dataset into COVIDx dataset\n",
"# since covid-chestxray-dataset doesn't have test dataset\n",
"# split into train/test by patientid\n",
"# for COVIDx:\n",
"# patient 8 is used as non-COVID19 viral test\n",
"# patient 31 is used as bacterial test\n",
"# patients 19, 20, 36, 42 are used as COVID-19 viral test\n",
"\n",
"for key in filename_label.keys():\n",
" arr = np.array(filename_label[key])\n",
" if arr.size == 0:\n",
" continue\n",
" # split by patients\n",
" num_diff_patients = len(np.unique(arr[:,0]))\n",
" num_test = max(1, round(split*num_diff_patients))\n",
" # select num_test number of random patients\n",
" test_patients = random.sample(list(arr[:,0]), num_test)\n",
" print('Key: ', key)\n",
" print('Test patients: ', test_patients)\n",
" # go through all the patients\n",
" for patient in arr:\n",
" if patient[0] in test_patients:\n",
" copyfile(os.path.join(imgpath, patient[1]), os.path.join(savepath, 'test', patient[1]))\n",
" test.append(patient)\n",
" test_count[patient[2]] += 1\n",
" else:\n",
" copyfile(os.path.join(imgpath, patient[1]), os.path.join(savepath, 'train', patient[1]))\n",
" train.append(patient)\n",
" train_count[patient[2]] += 1\n",
"\n",
"print('test count: ', test_count)\n",
"print('train count: ', train_count)"
]
},
{
"cell_type": "code",
"execution_count": 244,
"metadata": {},
"outputs": [],
"source": [
"# add kaggle chest xray data into COVID19\n",
"folders = ['train', 'val', 'test']\n",
"\n",
"# train, val, test normal data\n",
"for folder in folders: \n",
" for img in os.listdir(os.path.join(data_path, folder, 'NORMAL')):\n",
" if '.jp' in img:\n",
" new_img = img.strip('IM-')\n",
" new_img = new_img.strip('NORMAL2-IM-')\n",
" # add to current dataset\n",
" patientid = '1000' + new_img.split('-')[0] # add 1000 in front of kaggle patient ids\n",
" if folder == 'train' or folder == 'val':\n",
" # copy files to data folder\n",
" copyfile(os.path.join(data_path, folder, 'NORMAL', img), os.path.join(savepath, 'train', img))\n",
" train.append([patientid, img, 'normal'])\n",
" train_count['normal'] += 1\n",
" else:\n",
" copyfile(os.path.join(data_path, folder, 'NORMAL', img), os.path.join(savepath, 'test', img))\n",
" test.append([patientid, img, 'normal'])\n",
" test_count['normal'] += 1\n",
"\n",
"# train, val, test pneumonia data\n",
" for img in os.listdir(os.path.join(data_path, folder, 'PNEUMONIA')):\n",
" if '.jp' in img:\n",
" new_img = img.strip('person')\n",
" patientid = '1000' + new_img.split('_')[0]\n",
" p_type = 'bacteria' if 'bacteria' in new_img else 'viral'\n",
" if folder == 'train' or folder == 'val':\n",
" copyfile(os.path.join(data_path, folder, 'PNEUMONIA', img), os.path.join(savepath, 'train', img))\n",
" train.append([patientid, img, p_type])\n",
" train_count[p_type] += 1\n",
" else:\n",
" copyfile(os.path.join(data_path, folder, 'PNEUMONIA', img), os.path.join(savepath, 'test', img))\n",
" test.append([patientid, img, p_type])\n",
" test_count[p_type] += 1\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 245,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Final stats\n",
"Train count: {'normal': 1349, 'viral': 1355, 'bacteria': 2540, 'COVID-19': 60}\n",
"Test count: {'normal': 234, 'viral': 149, 'bacteria': 246, 'COVID-19': 8}\n",
"Total length of train: 5304\n",
"Total length of test: 637\n"
]
}
],
"source": [
"# final stats\n",
"print('Final stats')\n",
"print('Train count: ', train_count)\n",
"print('Test count: ', test_count)\n",
"print('Total length of train: ', len(train))\n",
"print('Total length of test: ', len(test))"
]
},
{
"cell_type": "code",
"execution_count": 246,
"metadata": {},
"outputs": [],
"source": [
"# export to train and test csv\n",
"# format as patientid, filename, label, separated by a space\n",
"train_file = open(\"train_split.txt\",\"a\") \n",
"for sample in train:\n",
" info = str(sample[0]) + ' ' + sample[1] + ' ' + sample[2] + '\\n'\n",
" train_file.write(info)\n",
"\n",
"train_file.close()\n",
"\n",
"test_file = open(\"test_split.txt\", \"a\")\n",
"for sample in test:\n",
" info = str(sample[0]) + ' ' + sample[1] + ' ' + sample[2] + '\\n'\n",
" test_file.write(info)\n",
"\n",
"test_file.close()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python (covid)",
"language": "python",
"name": "covid"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.10"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
import cv2
import numpy as np
import matplotlib.pyplot as plt
import os
np.random.seed(0)
def rotate_image(image, angle):
# grab the dimensions of the image and then determine the
# center
(h, w) = image.shape[:2]
(cX, cY) = (w // 2, h // 2)
# grab the rotation matrix (applying the negative of the
# angle to rotate clockwise), then grab the sine and cosine
# (i.e., the rotation components of the matrix)
M = cv2.getRotationMatrix2D((cX, cY), -angle, 1.0)
cos = np.abs(M[0, 0])
sin = np.abs(M[0, 1])
# compute the new bounding dimensions of the image
nW = int((h * sin) + (w * cos))
nH = int((h * cos) + (w * sin))
# adjust the rotation matrix to take into account translation
M[0, 2] += (nW / 2) - cX
M[1, 2] += (nH / 2) - cY
# perform the actual rotation and return the image
return cv2.warpAffine(image, M, (nW, nH))
def horizontal_flip(image):
return cv2.flip(image, 1)
def shift_image(image, lr_pixels, tb_pixels):
num_rows, num_cols = image.shape[:2]
translation_matrix = np.float32([ [1,0,lr_pixels], [0,1,tb_pixels] ])
return cv2.warpAffine(img, translation_matrix, (num_cols, num_rows))
INPUT_SIZE = (224, 224)
mapping = {'normal': 0, 'bacteria': 1, 'viral': 2, 'COVID-19': 3}
train_filepath = 'train_split.txt'
test_filepath = 'test_split.txt'
num_samples = 3000
# load in the train and test files
file = open(train_filepath, 'r')
trainfiles = file.readlines()
file = open(test_filepath, 'r')
testfiles = file.readlines()
# augment all the train class to 3000 examples each
# get number of each class
classes = {'normal': [], 'bacteria': [], 'viral': [], 'COVID-19': []}
img_aug = {'normal': [], 'bacteria': [], 'viral': [], 'COVID-19': []}
classes_test = {'normal': [], 'bacteria': [], 'viral': [], 'COVID-19': []}
for i in range(len(trainfiles)):
train_i = trainfiles[i].split()
classes[train_i[2]].append(train_i[1])
for i in range(len(testfiles)):
test_i = testfiles[i].split()
classes_test[test_i[2]].append(test_i[1])
for key in classes.keys():
print('{}: {}'.format(key, len(classes[key])))
num_to_augment = {'normal': min(num_samples - (len(classes['normal']) + len(img_aug['normal'])), len(classes['normal'])),
'bacteria': min(num_samples - (len(classes['bacteria']) + len(img_aug['normal'])), len(classes['bacteria'])),
'viral': min(num_samples - (len(classes['viral']) + len(img_aug['normal'])), len(classes['viral'])),
'COVID-19': min(num_samples - (len(classes['COVID-19']) + len(img_aug['normal'])), len(classes['COVID-19']))}
print('num_to_augment 1:', num_to_augment)
to_augment = 0
for key in num_to_augment.keys():
to_augment += num_to_augment[key]
print(to_augment)
while to_augment:
for key in classes.keys():
aug_class = classes[key]
# sample which images to augment
sample_indexes = np.random.choice(len(aug_class), num_to_augment[key], replace=False)
for i in sample_indexes:
# randomly select the degree of each augmentation
rot = np.random.uniform(-5, 5)
do_flip = np.random.randint(0, 2)
shift_vert = np.random.randint(0, 2)
shift = np.random.uniform(-10, 10)
# read in image and apply augmentation
img = cv2.imread(os.path.join('data', 'train', aug_class[i]))
#img = rotate_image(img, rot)
#if shift_vert:
# img = shift_image(img, 0, shift)
#else:
# img = shift_image(img, shift, 0)
if do_flip:
img = horizontal_flip(img)
# append filename and class to img_aug, save as png
imgname = '{}.png'.format(aug_class[i].split('.')[0] + '_aug_r' + str(round(rot)) + '_' + str(do_flip) + '_s' + str(shift_vert) + str(round(shift)))
img_aug[key].append(imgname)
cv2.imwrite(os.path.join('data', 'train', imgname), img)
# update num_to_augment numbers
num_to_augment = {
'normal': min(num_samples - (len(classes['normal']) + len(img_aug['normal'])), len(classes['normal'])),
'bacteria': min(num_samples - (len(classes['bacteria']) + len(img_aug['bacteria'])), len(classes['bacteria'])),
'viral': min(num_samples - (len(classes['viral']) + len(img_aug['viral'])), len(classes['viral'])),
'COVID-19': min(num_samples - (len(classes['COVID-19']) + len(img_aug['COVID-19'])), len(classes['COVID-19']))}
to_augment = 0