As we have seen in the previuos notebook, there are some additional information in the label file of dataset E. We are going to explore that in this notebook.
The digits from dataset E are extracted from the BanglaLekha-Isolated
dataset which contains Bangla handwritten numerals, basic characters and compound characters.
The BanglaLekha-Isolated
dataset was collected from Dhaka and Comilla. The age group of the subjects ranged from 6 years to 28 years with a high density between the ages of 16–20. Among the subjects, 59.4% were males while the remaining 40.6% were females.
More details can be found in the dataset description here.
# Importing necessary libraries
import numpy as np
import os
import glob
import cv2
import matplotlib.pyplot as plt
import pandas as pd
# Declare fontsize variables which will be used while plotting the data
FS_AXIS_LABEL=14
FS_TITLE=17
FS_TICKS=12
FIG_WIDTH=16
# Setup paths
project_dir='..'
path_label_train_e=os.path.join(project_dir,'Final_DB','training-e.csv')
path_label_test_e=os.path.join(project_dir,'Final_DB','testing-e.csv')
# Let's get the label files in a dataframe
df_train_e=pd.read_csv(path_label_train_e)
df_test_e=pd.read_csv(path_label_test_e)
df_train_e.head()
The information in the districtid and institutionid column is similar. Here, 1
stands for Dhaka
and 2
stands for Comilla
. For the gender
column 0
denotes Male
and 1
denotes Female
.
The age
columns contains the age information of the participants. We are going to use the .value_counts()
method to get the number of occurrence of each age. The .sort_index()
method sorts the index (age) of the resulting Series so that we can plot the age values in ascending order. This will give us a better understanding of the distribution of the participant's ages.
train_age_vc=df_train_e['age'].value_counts().sort_index()
#barplot
plt.figure(figsize=(FIG_WIDTH,5))
plt.suptitle('Age Distribution (Train)',fontsize=FS_TITLE)
plt.subplot(1,2,1)
train_age_vc.plot(kind='bar')
plt.xticks(rotation='horizontal',fontsize=FS_TICKS)
plt.yticks(fontsize=FS_TICKS)
plt.xlabel('Age', fontsize=FS_AXIS_LABEL)
plt.ylabel('Frequency', fontsize=FS_AXIS_LABEL)
#boxplot
plt.subplot(1,2,2)
bp_dict=plt.boxplot(df_train_e['age'])
plt.xticks([0],[''],fontsize=FS_TICKS)
plt.ylabel('Age', fontsize=FS_AXIS_LABEL)
for line in bp_dict['medians']:
# get position data for median line (2nd quartile line)
x, y = line.get_xydata()[1] # terminal point of median line
plt.text(x, y, '{:.0f}'.format(y), horizontalalignment='left')
for line in bp_dict['boxes']:
# get position data for 1st quartile line
x, y = line.get_xydata()[0]
plt.text(x,y, '{:.0f}'.format(y), horizontalalignment='right', verticalalignment='top')
# get position data for 3rd quartile line
x, y = line.get_xydata()[3]
plt.text(x,y, '{:.0f}'.format(y), horizontalalignment='right', verticalalignment='top')
plt.show()
test_age_vc=df_test_e['age'].value_counts().sort_index()
#barplot
plt.figure(figsize=(FIG_WIDTH,5))
plt.suptitle('Age Distribution (Test)',fontsize=FS_TITLE)
plt.subplot(1,2,1)
test_age_vc.plot(kind='bar')
plt.xticks(rotation='horizontal',fontsize=FS_TICKS)
plt.yticks(fontsize=FS_TICKS)
plt.xlabel('Age', fontsize=FS_AXIS_LABEL)
plt.ylabel('Frequency', fontsize=FS_AXIS_LABEL)
#boxplot
plt.subplot(1,2,2)
bp_dict=plt.boxplot(df_test_e['age'])
plt.xticks([0],[''],fontsize=FS_TICKS)
plt.ylabel('Age', fontsize=FS_AXIS_LABEL)
for line in bp_dict['medians']:
# get position data for median line (2nd quartile line)
x, y = line.get_xydata()[1] # terminal point of median line
plt.text(x, y, '{:.0f}'.format(y), horizontalalignment='left')
for line in bp_dict['boxes']:
# get position data for 1st quartile line
x, y = line.get_xydata()[0]
plt.text(x,y, '{:.0f}'.format(y), horizontalalignment='right', verticalalignment='top')
# get position data for 3rd quartile line
x, y = line.get_xydata()[3]
plt.text(x,y, '{:.0f}'.format(y), horizontalalignment='right', verticalalignment='top')
plt.show()
There are some differences between the distribution in the test and train set (e.g., 17, 19, 22 year old participants). The interquartile range (the range within which the innermost 50% of the data resides) for both train and test set is between 16 to 21 years.
train_district_vc=df_train_e['districtid'].value_counts()
plt.figure(figsize=(FIG_WIDTH,5))
train_district_vc.plot(kind='bar')
plt.yticks(fontsize=FS_TICKS)
xlabels=['Dhaka ({:.2f}%)'.format(train_district_vc.loc[1]/train_district_vc.sum()*100),
'Comilla ({:.2f})%'.format(train_district_vc.loc[2]/train_district_vc.sum()*100)]
plt.xticks([0,1],xlabels,rotation='horizontal',fontsize=FS_TICKS)
plt.xlabel('District', fontsize=FS_AXIS_LABEL)
plt.ylabel('Frequency', fontsize=FS_AXIS_LABEL)
plt.title('District Distribution (Train)',fontsize=FS_TITLE)
plt.show()
test_district_vc=df_test_e['districtid'].value_counts()
plt.figure(figsize=(FIG_WIDTH,5))
test_district_vc.plot(kind='bar')
plt.yticks(fontsize=FS_TICKS)
xlabels=['Dhaka ({:.2f}%)'.format(test_district_vc.loc[1]/test_district_vc.sum()*100),
'Comilla ({:.2f})%'.format(test_district_vc.loc[2]/test_district_vc.sum()*100)]
plt.xticks([0,1],xlabels,rotation='horizontal',fontsize=FS_TICKS)
plt.xlabel('District', fontsize=FS_AXIS_LABEL)
plt.ylabel('Frequency', fontsize=FS_AXIS_LABEL)
plt.title('District Distribution (Test)',fontsize=FS_TITLE)
plt.show()
There are less digits from Dhaka than Comilla. The ratio is slightly higher in the test dataset.
train_gender_vc=df_train_e['gender'].value_counts()
plt.figure(figsize=(FIG_WIDTH,5))
train_gender_vc.plot(kind='bar')
plt.yticks(fontsize=FS_TICKS)
xlabels=['Male ({:.2f}%)'.format(train_gender_vc.loc[0]/train_gender_vc.sum()*100),
'Female ({:.2f})%'.format(train_gender_vc.loc[1]/train_gender_vc.sum()*100)]
plt.xticks([0,1],xlabels,rotation='horizontal',fontsize=FS_TICKS)
plt.xlabel('Gender', fontsize=FS_AXIS_LABEL)
plt.ylabel('Frequency', fontsize=FS_AXIS_LABEL)
plt.title('Gender Distribution (Train)',fontsize=FS_TITLE)
plt.show()
test_gender_vc=df_test_e['gender'].value_counts()
plt.figure(figsize=(FIG_WIDTH,5))
test_gender_vc.plot(kind='bar')
plt.yticks(fontsize=FS_TICKS)
xlabels=['Male ({:.2f}%)'.format(test_gender_vc.loc[0]/test_gender_vc.sum()*100),
'Female ({:.2f})%'.format(test_gender_vc.loc[1]/test_gender_vc.sum()*100)]
plt.xticks([0,1],xlabels,rotation='horizontal',fontsize=FS_TICKS)
plt.xlabel('Gender', fontsize=FS_AXIS_LABEL)
plt.ylabel('Frequency', fontsize=FS_AXIS_LABEL)
plt.title('Gender Distribution (Test)',fontsize=FS_TITLE)
plt.show()
There are more digits from male participants. The male to female ratio is similar in both test and train set.
Since we have age, district and gender information of the participants, the digits could be differenciated based on them. We have to bear in mind that the dataset is imbalanced (with respect to age/distric/gender) which would have significant effect on the classifier model.