Welcome to the Machine Learning Repository of Bengali.Ai
Bengali.Ai has a curated list of datasets open sourced for the research community.
PLEASE LOGIN TO DOWNLOAD DATASETS.
To submit your own dataset please visit the dataset upload page. We are all humans and prone to error, Contact Us if you find any errors in the dataset or find mislabeled data.
Page 1 of 1

Bengali Text to Speech Dataset
Download Dataset
About the dataset
This data set contains multi-speaker high quality transcribed audio data for Bengali. The data set consists of wave files, and a TSV file. There are two zip files, one for each local which contain a file: line_index.tsv and the wave files. Line index has a fileID and the transcription. The data set has been manually quality checked, but there might still be errors. This data set was collected by Google. See LICENSE file for license information.
Copyright 2015, 2016, 2017, 2018 Google, Inc.

Bengali Automatic Speech Recognition Dataset
Download Dataset
About the dataset
Bangla Automatic Speech Recognition (ASR) dataset with 196k utterances.
The data set consists of wave files, and a TSV file. The file utt_spk_text.tsv contains a FileID, anonymized UserID and the transcription of audio in the file. The data set has been manually quality checked, but there might still be errors. See LICENSE file for license information.
Copyright 2016, 2017, 2018 Google, Inc.

Numta Handwritten Bengali Digits
Download Dataset
Documentation PDF
The dataset is a compilation of six datasets that were gathered from different sources and at different times. However each of them were checked rigorously under the same evaluation criterion so that all digits were at least legible to one human being without any prior knowledge.
UPDATE (14th August 2018): The initial release of the NumtaDB dataset was used for the Bengali.AI Computer Vision Challenge. It was found that the testing set consisted of some illegible and ambiguous digits. These digits are replaced by legible digits of the same label. The new testing digits along with old legible ones can be downloaded here:
Download Revised Testing Set
To check your results on the (revised) testing set
click here.
Disclaimer: Dataset-e is an abridged and curated version of BanglaLekha-Isolated and was not collected by Bengali.AI volunteers