In this notebook, we are going to do some basic exploration of the dataset. We shall observe some samples from each dataset and number of image samples in them. As images are obtained from multiple sources, the quality of the segmentation and the dimension of the images varies a lot. We are going use data analysis toolkit pandas to perform our exploration. If you are not familiar with this library check out their 10 Minutes to pandas short introduction.
# Importing necessary libraries
import numpy as np
import os
import glob
import cv2
import matplotlib.pyplot as plt
import pandas as pd
# Declare fontsize variables which will be used while plotting the data
FS_AXIS_LABEL=14
FS_TITLE=17
FS_TICKS=12
FIG_WIDTH=16
While writing the codes, files and folder was organized in the following way
The code folder contains this jupyter notebook, and the Final_DB folder has the raw image datasets.
project_dir='..'
Let's see what is in the Final_DB folder.
os.listdir(os.path.join(project_dir,'Final_DB'))
# Notice that I am using the os.path.join() function to create the filepaths instead of writing them down explicitly with a
# filepath separator ('\\' for windows '/' for linux). This allows us to run this notebook both in windows and linux
# environment without manually changing the filepath separator
Our dataset has images from sources A, B, and E. Tha datasets are already split into train and test sets. The labels of each image is saved in the corresponding .csv files. Let's see some samples of the images.
All the images have .png extension. We are going get all the filepaths that have .png extensions by using the glob.glob() function
paths_train_a=glob.glob(os.path.join(project_dir,'Final_DB','training-a','*.png'))
paths_train_b=glob.glob(os.path.join(project_dir,'Final_DB','training-b','*.png'))
paths_train_e=glob.glob(os.path.join(project_dir,'Final_DB','training-e','*.png'))
paths_test_a=glob.glob(os.path.join(project_dir,'Final_DB','testing-a','*.png'))
paths_test_b=glob.glob(os.path.join(project_dir,'Final_DB','testing-b','*.png'))
paths_test_e=glob.glob(os.path.join(project_dir,'Final_DB','testing-e','*.png'))
path_label_train_a=os.path.join(project_dir,'Final_DB','training-a.csv')
path_label_train_b=os.path.join(project_dir,'Final_DB','training-b.csv')
path_label_train_e=os.path.join(project_dir,'Final_DB','training-e.csv')
path_label_test_a=os.path.join(project_dir,'Final_DB','testing-a.csv')
path_label_test_b=os.path.join(project_dir,'Final_DB','testing-b.csv')
path_label_test_e=os.path.join(project_dir,'Final_DB','testing-e.csv')
def get_img(path,mode=cv2.IMREAD_GRAYSCALE):
# read image (if no read mode is defined, the image is read in grayscale)
return cv2.imread(path,mode)
def imshow_group(paths,n_per_row=10):
# plot multiple digits in one figure, by default 10 images are plotted per
n_sample=len(paths)
j=np.ceil(n_sample/n_per_row)
fig=plt.figure(figsize=(20,2*j))
for i, path in enumerate(paths):
img=get_img(path)
plt.subplot(j,n_per_row,i+1)
plt.imshow(img,cmap='gray')
plt.title(img.shape)
plt.axis('off')
return fig
def get_key(path):
# separate the key from the filepath of an image
return path.split(sep=os.sep)[-1]
We are going to randomly choose a few image filepaths from trainnig set A. Then load them in grayscale and plot them.
paths=np.random.choice(paths_train_a,size=40)
fig=imshow_group(paths)
fig.suptitle('Samples from {} training images in dataset A'.format(len(paths_train_a)), fontsize=FS_TITLE)
plt.show()
The digits do not fill up the entire image and most of them are not centered.
Next, we are going to observe if images in dataset A have different shapes. To do this, we are going to put the image shapes in a pandas series object and use its .value_counts()
attribute to obtain the counts of its unique values.
shapes_train_a_sr=pd.Series([get_img(path).shape for path in paths_train_a])
shapes_train_a_sr.value_counts()
All the images have a fixed shape of 180 x 180
shapes_test_a_sr=pd.Series([get_img(path).shape for path in paths_test_a])
shapes_test_a_sr.value_counts()
Let's observe the frequency of each digit in dataset A.
Let's read the .csv
file which contains the labels of dataset A. We are using read_csv()
function from the pandas library which will return the content of the .csv
file in a dataframe.
df_train_a=pd.read_csv(path_label_train_a)
df_train_a.head() # Observe first five rows
Next, we are going to replace the numerical index of the dataframe with the filename
column which will give us more convenient access to the dataframe elements.
df_train_a=df_train_a.set_index('filename')
df_train_a.head() # Observe first five rows
As we can see the labels of each digit is located under the digit
column. We are going to put the labels of each digit in a series object and use the .value_counts()
attribute to get the count of each digit.
labels_train_a_sr=pd.Series([df_train_a.loc[get_key(path)]['digit'] for path in paths_train_a])
labels_train_a_sr_vc=labels_train_a_sr.value_counts()
The pandas series object has a .plot()
attribute which is useful to plot the contents of the series. We are going to use it to make a plot of the frequency of each digit.
plt.figure(figsize=(FIG_WIDTH,5))
labels_train_a_sr_vc.plot(kind='bar')
plt.xticks(rotation='horizontal',fontsize=FS_TICKS)
plt.yticks(fontsize=FS_TICKS)
plt.xlabel('Digits', fontsize=FS_AXIS_LABEL)
plt.ylabel('Frequency', fontsize=FS_AXIS_LABEL)
plt.title('Train A\nMean frequency of digits per class: {}, Standard Deviation: {:.4f} '.format(labels_train_a_sr_vc.mean(),labels_train_a_sr_vc.std()),
fontsize=FS_TITLE)
plt.show()
Let's see the class distribution statistics in the test set.
df_test_a=pd.read_csv(path_label_test_a)
df_test_a=df_test_a.set_index('filename')
df_test_a.head()
labels_test_a_sr=pd.Series([df_test_a.loc[get_key(path)]['digit'] for path in paths_test_a])
labels_test_a_sr_vc=labels_test_a_sr.value_counts()
plt.figure(figsize=(FIG_WIDTH,5))
labels_train_a_sr_vc.plot(kind='bar')
plt.xticks(rotation='horizontal',fontsize=FS_TICKS)
plt.yticks(fontsize=FS_TICKS)
plt.xlabel('Digits', fontsize=FS_AXIS_LABEL)
plt.ylabel('Frequency', fontsize=FS_AXIS_LABEL)
plt.title('Test A\nMean frequency of digits per class: {}, Standard Deviation: {:.4f} '.format(labels_test_a_sr_vc.mean(),labels_test_a_sr_vc.std()),
fontsize=FS_TITLE)
plt.show()
The digit classes are well balanced both in the test and train set. We are going to repeat the same steps for dataset B and E.
paths=np.random.choice(paths_train_b,size=40)
fig=imshow_group(paths)
fig.suptitle('Samples from {} training images in dataset B'.format(len(paths_train_b)), fontsize=FS_TITLE)
plt.show()
Similar to dataset A, most of the digits in B do not fill the entire image. Also most of them are not centered.
shapes_train_b_sr=pd.Series([get_img(path).shape for path in paths_train_b])
shapes_train_b_sr_vc=shapes_train_b_sr.value_counts()
plt.figure(figsize=(FIG_WIDTH,5))
shapes_train_b_sr_vc.plot(kind='bar')
plt.xlabel('Image Shapes', fontsize=FS_AXIS_LABEL)
plt.ylabel('Frequency', fontsize=FS_AXIS_LABEL)
plt.xticks(fontsize=FS_TICKS)
plt.yticks(fontsize=FS_TICKS)
plt.title('Train B\nNo. of unique shapes: {}'.format(shapes_train_b_sr_vc.count()),fontsize=FS_TITLE)
plt.show()
shapes_test_b_sr=pd.Series([get_img(path).shape for path in paths_test_b])
shapes_test_b_sr_vc=shapes_test_b_sr.value_counts()
plt.figure(figsize=(FIG_WIDTH,5))
shapes_test_b_sr_vc.plot(kind='bar')
plt.xlabel('Image Shapes', fontsize=FS_AXIS_LABEL)
plt.ylabel('Frequency', fontsize=FS_AXIS_LABEL)
plt.xticks(fontsize=FS_TICKS)
plt.yticks(fontsize=FS_TICKS)
plt.title('Test B\nNo. of unique shapes: {}'.format(shapes_test_b_sr_vc.count()),fontsize=FS_TITLE)
plt.show()
The image shapes in dataset B varies a lot.
df_train_b=pd.read_csv(path_label_train_b)
df_train_b=df_train_b.set_index('filename')
df_train_b.head()
labels_train_b_sr=pd.Series([df_train_b.loc[get_key(path)]['digit'] for path in paths_train_b])
plt.figure(figsize=(FIG_WIDTH,5))
labels_train_b_sr_vc.plot(kind='bar')
plt.xticks(rotation='horizontal',fontsize=FS_TICKS)
plt.yticks(fontsize=FS_TICKS)
plt.xlabel('Digits', fontsize=FS_AXIS_LABEL)
plt.ylabel('Frequency', fontsize=FS_AXIS_LABEL)
plt.title('Train B\nMean frequency of digits per class: {}, Standard Deviation: {}'.format(labels_train_b_sr_vc.mean(),labels_train_b_sr_vc.std()),
fontsize=FS_TITLE)
plt.show()
df_test_b=pd.read_csv(path_label_test_b)
df_test_b=df_test_b.set_index('filename')
df_test_b.head()
labels_test_b_sr=pd.Series([df_test_b.loc[get_key(path)]['digit'] for path in paths_test_b])
labels_test_b_sr_vc=labels_test_b_sr.value_counts()
plt.figure(figsize=(FIG_WIDTH,5))
labels_test_b_sr_vc.plot(kind='bar')
plt.xticks(rotation='horizontal',fontsize=FS_TICKS)
plt.yticks(fontsize=FS_TICKS)
plt.xlabel('Digits', fontsize=FS_AXIS_LABEL)
plt.ylabel('Frequency', fontsize=FS_AXIS_LABEL)
plt.title('Test B\nMean frequency of digits per class: {}, Standard Deviation: {}'.format(labels_test_b_sr_vc.mean(),labels_test_b_sr_vc.std()),
fontsize=FS_TITLE)
plt.show()
paths=np.random.choice(paths_train_e,40)
fig=imshow_group(paths)
fig.suptitle('Samples from {} training images in dataset E'.format(len(paths_train_e)), fontsize=FS_TITLE)
plt.show()
The images are cropped well and they have minimum non digit area.
shapes_train_e_sr=pd.Series([get_img(path).shape for path in paths_train_e])
shapes_train_e_sr.nunique()
There are 1515 unique shapes in training dataset E.
shapes_train_e_sr_vc=shapes_train_e_sr.value_counts()
Plotting 50 most frequently occuring shapes.
plt.figure(figsize=(FIG_WIDTH,5))
shapes_train_e_sr_vc.iloc[:50].plot(kind='bar')
plt.xticks(fontsize=10)
plt.yticks(fontsize=FS_TICKS)
plt.xlabel('Image Shapes', fontsize=FS_AXIS_LABEL)
plt.ylabel('Count', fontsize=FS_AXIS_LABEL)
plt.title('Train E\nNo. of unique shapes: {}\nPlot of 50 most frequently occuring shapes'.format(shapes_train_e_sr_vc.count()),fontsize=FS_TITLE)
plt.show()
shapes_test_e_sr=pd.Series([get_img(path).shape for path in paths_test_e])
shapes_test_e_sr.nunique()
There are 555 unique shapes in test dataset E.
shapes_test_e_sr_vc=shapes_test_e_sr.value_counts()
plt.figure(figsize=(FIG_WIDTH,5))
shapes_test_e_sr_vc.iloc[:50].plot(kind='bar')
plt.xticks(fontsize=10)
plt.yticks(fontsize=FS_TICKS)
plt.xlabel('Image Shapes', fontsize=FS_AXIS_LABEL)
plt.ylabel('Count', fontsize=FS_AXIS_LABEL)
plt.title('Test E\nNo. of unique shapes: {}\nPlot of 50 most frequently occuring shapes'.format(shapes_test_e_sr_vc.count()),
fontsize=FS_TITLE)
plt.show()
df_train_e=pd.read_csv(path_label_train_e)
df_train_e=df_train_e.set_index('filename')
df_train_e.head()
labels_train_e_sr=pd.Series([df_train_e.loc[get_key(path)]['digit'] for path in paths_train_e])
labels_train_e_sr_vc=labels_train_e_sr.value_counts()
plt.figure(figsize=(FIG_WIDTH,5))
labels_train_e_sr_vc.plot(kind='bar')
plt.xticks(rotation='horizontal',fontsize=FS_TICKS)
plt.yticks(fontsize=FS_TICKS)
plt.xlabel('Digits', fontsize=FS_AXIS_LABEL)
plt.ylabel('Frequency', fontsize=FS_AXIS_LABEL)
plt.title('Train E\nMean frequency of digits per class: {}, Standard Deviation: {:.4f}'.format(labels_train_e_sr_vc.mean(),labels_train_e_sr_vc.std()),
fontsize=FS_TITLE)
plt.show()
df_test_e=pd.read_csv(path_label_test_e)
df_test_e=df_test_e.set_index('filename')
df_test_e.head()
labels_test_e_sr=pd.Series([df_test_e.loc[get_key(path)]['digit'] for path in paths_test_e])
labels_test_e_sr_vc=labels_test_e_sr.value_counts()
plt.figure(figsize=(FIG_WIDTH,5))
labels_test_e_sr_vc.plot(kind='bar')
plt.xticks(rotation='horizontal',fontsize=FS_TICKS)
plt.yticks(fontsize=FS_TICKS)
plt.xlabel('Digits', fontsize=FS_AXIS_LABEL)
plt.ylabel('Frequency', fontsize=FS_AXIS_LABEL)
plt.title('Test E\nMean frequency of digits per class: {}, Standard Deviation: {}'.format(labels_test_e_sr_vc.mean(),labels_test_e_sr_vc.std()),
fontsize=FS_TITLE)
plt.show()
The classes in each dataset are well balanced. The images in dataset A and B should be further processed to extract a more focused crop of the digit. Although the shapes of dataset B and E varies a lot, they have a square or nearly square aspect ratio. Therefore, if resizing to a fixed to shape is necessary (e.g., feeding the images into a fixed input neural network), it should not create much distortion in the digit shape. Below is a summary statistic of the datasets.
df_summary=pd.DataFrame(data={
'Avg. samples per digits (Train)':[labels_train_a_sr_vc.mean(),labels_train_b_sr_vc.mean(),labels_train_e_sr_vc.mean()],
'Avg. samples per digits (Test)':[labels_test_a_sr_vc.mean(),labels_test_b_sr_vc.mean(),labels_test_e_sr_vc.mean()],
'Train Samples':[labels_train_a_sr_vc.sum(),labels_train_b_sr_vc.sum(),labels_train_e_sr_vc.sum()],
'Test Samples':[labels_test_a_sr_vc.sum(),labels_test_b_sr_vc.sum(),labels_test_e_sr_vc.sum()],
'Total':[labels_train_a_sr_vc.sum()+labels_test_a_sr_vc.sum(),
labels_train_b_sr_vc.sum()+labels_test_b_sr_vc.sum(),
labels_train_e_sr_vc.sum()+labels_test_e_sr_vc.sum()]},
index=['A','B','E'],
columns=['Avg. samples per digits (Train)',
'Avg. samples per digits (Test)',
'Train Samples',
'Test Samples',
'Total']
)
df_summary
There are additional information in the .csv file of dataset E. We are going to focus on that in the next notebook.