In this notebook, we are going to do some basic exploration of the dataset. We shall observe some samples from each dataset and number of image samples in them. As images are obtained from multiple sources, the quality of the segmentation and the dimension of the images varies a lot. We are going use data analysis toolkit pandas to perform our exploration. If you are not familiar with this library check out their 10 Minutes to pandas short introduction.
# Importing necessary libraries
import numpy as np
import os
import glob
import cv2
import matplotlib.pyplot as plt
import pandas as pd
# Declare fontsize variables which will be used while plotting the data
FS_AXIS_LABEL=14
FS_TITLE=17
FS_TICKS=12
FIG_WIDTH=16
While writing the codes, files and folder was organized in the following way
The code folder contains this jupyter notebook, and the Final_DB folder has the raw image datasets.
project_dir='..'
Let's see what is in the Final_DB folder.
os.listdir(os.path.join(project_dir,'Final_DB'))
# Notice that I am using the os.path.join() function to create the filepaths instead of writing them down explicitly with a
# filepath separator ('\\' for windows '/' for linux). This allows us to run this notebook both in windows and linux
# environment without manually changing the filepath separator
Our dataset has images from sources A, B, and E. Tha datasets are already split into train and test sets. The labels of each image is saved in the corresponding .csv files. Let's see some samples of the images.
All the images have .png extension. We are going get all the filepaths that have .png extensions by using the glob.glob() function
paths_train_a=glob.glob(os.path.join(project_dir,'Final_DB','training-a','*.png'))
paths_train_b=glob.glob(os.path.join(project_dir,'Final_DB','training-b','*.png'))
paths_train_e=glob.glob(os.path.join(project_dir,'Final_DB','training-e','*.png'))
paths_test_a=glob.glob(os.path.join(project_dir,'Final_DB','testing-a','*.png'))
paths_test_b=glob.glob(os.path.join(project_dir,'Final_DB','testing-b','*.png'))
paths_test_e=glob.glob(os.path.join(project_dir,'Final_DB','testing-e','*.png'))
path_label_train_a=os.path.join(project_dir,'Final_DB','training-a.csv')
path_label_train_b=os.path.join(project_dir,'Final_DB','training-b.csv')
path_label_train_e=os.path.join(project_dir,'Final_DB','training-e.csv')
path_label_test_a=os.path.join(project_dir,'Final_DB','testing-a.csv')
path_label_test_b=os.path.join(project_dir,'Final_DB','testing-b.csv')
path_label_test_e=os.path.join(project_dir,'Final_DB','testing-e.csv')
def get_img(path,mode=cv2.IMREAD_GRAYSCALE):
# read image (if no read mode is defined, the image is read in grayscale)
return cv2.imread(path,mode)
def imshow_group(paths,n_per_row=10):
# plot multiple digits in one figure, by default 10 images are plotted per
n_sample=len(paths)
j=np.ceil(n_sample/n_per_row)
fig=plt.figure(figsize=(20,2*j))
for i, path in enumerate(paths):
img=get_img(path)
plt.subplot(j,n_per_row,i+1)
plt.imshow(img,cmap='gray')
plt.title(img.shape)
plt.axis('off')
return fig
def get_key(path):
# separate the key from the filepath of an image
return path.split(sep=os.sep)[-1]
We are going to randomly choose a few image filepaths from trainnig set A. Then load them in grayscale and plot them.
paths=np.random.choice(paths_train_a,size=40)
fig=imshow_group(paths)
fig.suptitle('Samples from {} training images in dataset A'.format(len(paths_train_a)), fontsize=FS_TITLE)
plt.show()
The digits do not fill up the entire image and most of them are not centered.
Next, we are going to observe if images in dataset A have different shapes. To do this, we are going to put the image shapes in a pandas series object and use its .value_counts()
attribute to obtain the counts of its unique values.
shapes_train_a_sr=pd.Series([get_img(path).shape for path in paths_train_a])
shapes_train_a_sr.value_counts()
All the images have a fixed shape of 180 x 180
shapes_test_a_sr=pd.Series([get_img(path).shape for path in paths_test_a])
shapes_test_a_sr.value_counts()
Let's observe the frequency of each digit in dataset A.
Let's read the .csv
file which contains the labels of dataset A. We are using read_csv()
function from the pandas library which will return the content of the .csv
file in a dataframe.
df_train_a=pd.read_csv(path_label_train_a)
df_train_a.head() # Observe first five rows
Next, we are going to replace the numerical index of the dataframe with the filename
column which will give us more convenient access to the dataframe elements.
df_train_a=df_train_a.set_index('filename')
df_train_a.head() # Observe first five rows
As we can see the labels of each digit is located under the digit
column. We are going to put the labels of each digit in a series object and use the .value_counts()
attribute to get the count of each digit.
labels_train_a_sr=pd.Series([df_train_a.loc[get_key(path)]['digit'] for path in paths_train_a])
labels_train_a_sr_vc=labels_train_a_sr.value_counts()
The pandas series object has a .plot()
attribute which is useful to plot the contents of the series. We are going to use it to make a plot of the frequency of each digit.
plt.figure(figsize=(FIG_WIDTH,5))
labels_train_a_sr_vc.plot(kind='bar')
plt.xticks(rotation='horizontal',fontsize=FS_TICKS)
plt.yticks(fontsize=FS_TICKS)
plt.xlabel('Digits', fontsize=FS_AXIS_LABEL)
plt.ylabel('Frequency', fontsize=FS_AXIS_LABEL)
plt.title('Train A\nMean frequency of digits per class: {}, Standard Deviation: {:.4f} '.format(labels_train_a_sr_vc.mean(),labels_train_a_sr_vc.std()),
fontsize=FS_TITLE)
plt.show()
Let's see the class distribution statistics in the test set.
df_test_a=pd.read_csv(path_label_test_a)
df_test_a=df_test_a.set_index('filename')
df_test_a.head()
labels_test_a_sr=pd.Series([df_test_a.loc[get_key(path)]['digit'] for path in paths_test_a])
labels_test_a_sr_vc=labels_test_a_sr.value_counts()
plt.figure(figsize=(FIG_WIDTH,5))
labels_train_a_sr_vc.plot(kind='bar')
plt.xticks(rotation='horizontal',fontsize=FS_TICKS)
plt.yticks(fontsize=FS_TICKS)
plt.xlabel('Digits', fontsize=FS_AXIS_LABEL)
plt.ylabel('Frequency', fontsize=FS_AXIS_LABEL)
plt.title('Test A\nMean frequency of digits per class: {}, Standard Deviation: {:.4f} '.format(labels_test_a_sr_vc.mean(),labels_test_a_sr_vc.std()),
fontsize=FS_TITLE)
plt.show()
The digit classes are well balanced both in the test and train set. We are going to repeat the same steps for dataset B and E.
paths=np.random.choice(paths_train_b,size=40)
fig=imshow_group(paths)
fig.suptitle('Samples from {} training images in dataset B'.format(len(paths_train_b)), fontsize=FS_TITLE)
plt.show()