# Introduction

ashmiri or Koshur is a Dardic language subgroup from Indo-Aryan, spoken by over seven million Kashmiris [Wikipedia]. There are many accents Spoken in Kashmir. There are some major accents and some minor accents in this language. This leads to diversity in the language and adds to its beautiful sounds and variations. The aim of this research is to classify these different accents. Although many accents are being spoken in this language, for this research, we have classified the prominent accents belonging to Kupwara, Srinagar, Islamabad, Shopian, and Bandipora. The proposed approach is on the basis of using Convolution Neural Networks (CNN) and training Neural networks on the images of features extracted from the audio files. The features are Mel-spectrogram and MFCCs. Our approach uses CNN as the classifier and MFCCs, Mel spectrograms as Features. Three types of MFCCs are extracted, 13, 24 and 36. We got excellent results on our dataset.

Accent classi fication refers to the problem of inferring the native language of a speaker from his or her foreign accented speech. Identifying idiosyncratic differences in speech production is important for

Author: e-mail: mallashehzen786@gmail.com improving the robustness of existing speech analysis systems. For example, automatic speech recognition (ASR) systems exhibit lower performance when evaluated on foreign accented speech. By developing pre-processing algorithms that identify the accent, these systems can be modified to customize the recognition algorithm to the particular accent [1]  [2]. In addition to ASR applications, accent identi fication is also useful for forensic speaker profiling by identifying the speaker's regional origin and ethnicity in applications involving targeted marketing [3] [4]. In this paper we propose a method for classification of 11 accents directly from the speech acoustics.

For example, Deshpande et al. used GMMs based on formant frequency features to discriminate between standard American English and Indian accented English [6]. Chen et al. explored the effect of the number of components in GMMs on classification performance [7]. Tang and Ghorbani compared the performance of HMMs with Support Vector Machine (SVM) for accent classification [8]. Kumpf and King proposed to use linear discriminant analysis (LDA) for identification of three accents in Australian English [9].

Artificial neural networks, especially Deep Neural Networks (DNNs) and Recurrent Neural Networks (RNNs) and CNNs have been widely used in state-ofthe-art speech systems and Image Processing Systems [10] [11] [12] [13]; however,in the area of accent identification, there are only a few studies evaluating the performance of neural networks [14]  [15]. Nonetheless, in a related area, language identification (LID), neural networks have been investigated exhaustively [16] [17] [18].In a recent paper [19], where they used spectrograms for accent classification and speaker recognition and achieved an accuracy of 92%. Inspired by their work, we also propose to use Mel-Spectrograms and MFCCs for our research

The rest of the paper is organized as follows: In section 2, we discuss the collection and making of dataset. In section 3, we discuss proposed system and discuss in detail the features that we have used for our research. In section 4, we discuss the experiments and show our results and finally in section 5, we conclude our research.


# II.


# Dataset a) Collection of Dataset

Data is very important in every machine learning and deep learning project or deep learning research. The data, we required for our research was, the audio files of people speaking some sentences that we choose. These sentences, to some extent captured a wide range of accent changes in the spoken Kashmiri Language. In total, 20 sentences were chosen for research purposes and people were recorded speaking these sentences in their native accents of Kashmir language. The data was collected from 5 districts or areas of Kashmir and all these files were saved with the extension of 'ogg', which in preprocessing, were changed to 'wav' format. In total, we got almost100 voice samples from each area and thus we had, 500 total voice samples of these sentences spoken by different people.


# b) Making of dataset

The data we had, were audio files and we decided on getting the MFCC and Mel-Spectrograms from these audio files. So, our final dataset consisted of images of MFCC and Mel-Spectrograms. Since deep learning models require huge amount of data, we had to augment the data to increase the size of our dataset. There are many great techniques of augmenting the data, when it comes to images and audio. Since the images were of features, we could not use the normal augmentation techniques like distortion, rotation and many more [reference]. A special kind of augmentation known as specAugment [20], which produces augmented images on spectrograms was used. This augmentation performed following operations on the Images of Mel-Spectrograms. 1) Frequency masking is where certain part of the frequency is masked out, and 2) Time masking, where certain part of time is masked out. Even though, we performed augmentation on Mel-Spectrogram images, the data was not enough. so, we had to perform the augmentation on the audio files also. The audio files were augmented by increasing the speed, pitch and amplitude of the audio files, thus giving somewhat variability in the initial dataset of audio files.

After performing, such augmentations we had large sufficient dataset for deep learning.


# III.


# Proposed System a) Architecture

We used CNN based architecture with ReLU activation function for internal nodes and SoftMax function to output the probability distribution of output classes. CNN [19] models show state of the art performance with image data. Since our motive was to extract the features from the audio data and plot them as images and then those images were input to the model, so we chose the model based on CNN architecture. Our model has, six convolution layers and six Maxpooling layers, with 5 dense layers and a flatten layer.

[image of model] Accuracy and loss varied based on the feature used and learning rate of the model. We choose different learning rates based on different features that were input to the model. Our models were trained on learning rates between 0.001 to 0.0001


# b) Features Used

Many features have been used in researches of audio processing. We decided to keep our research simple so we decided on two features, MFCC and Mel Spectrograms. MFCC have been found to perform well in case audio classification [21] purposes and Mel spectrograms and Spectrograms have also shown such performance in many cases [19]. Different number of coefficients can be used, mostly 13 are taken. The selection of such number of constants, depends on the problem in hand. We experimented on various number of coefficients and finally decided on 13, 24 and 36 coefficients. These features were extracted and plotted as images and then such images were input to our model.

Mel Spectrograms -A Mel spectrogram is a spectrogram that converts the frequencies to the Mel scale. When the spectrogram from the audio file is plotted using Mel scale, we get the Mel-Spectrogram. These spectrograms were plotted as images, same as the MFCCs and given input to the model.

All these operations of feature extraction were done using librosa library [22], which makes working with audio very easy.


# IV.


# Experimental Setup And Results

Different experiments were performed on different features and different learning rates were set during the training of the models.

The features, were stored in two ways -Images of the features were generated, and in other features were extracted and a dimension was added, no image was generated and the features were stored in JSON data format. Below, we show the results of various experiments:
a) Experiment 1
This Experiment was done using images of Mel spectrograms and MFCC for the CNN having 3 color channels. In this experiment, the images were generated from the audio files. Those images were saved and later were loaded back into the model. The models were trained on those images and evaluated on the validation and testing sets.


# i. Mel Spectrograms

The below figure shows the metrices graphically and we can conclude from the graph that the model is showing state of art results on our data.  These above results were calculated on the testing data and we can conclude that our model performed much better than expected and showed state of the art performance on our data.


# ii. MFCCs

The MFCC features were extracted from the audio files and plotted as images and these Images were saved and loaded at the time of model training. The following figures show the accuracies and losses with respect to the epochs. Three types of constants were extracted and same model was trained on these images generated from the audio files. The training was done using the training data, and validated on validated. Following table shows our results. From the above table, we can see that the 24 constants performed slightly better than the others on validation data.

Using the images as input to the model, Mel-Spectrograms and MFCC 24 constant features performed better than the other features.


# b) Experiment 2

This experiment was done using Json files of extracted features and giving to CNN having 1 color channel.

In this experiment, the features were extracted and were saved in JSON files. No Images were generated for this data. Then the features were loaded back and the model was trained on these features. The below table show the Testing accuracies and testing Losses for various features extracted from audio files. This experiment was done by splitting the audio files in the chunks of 2 seconds.

In this experiment the audio was splitted into two second chunks a The Mel spectrograms features were extracted from the splitted audio files and then saved as images and as well as JSON files. Following accuracies and Losses were calculated on validation data. 


# Conclusion

This research paper proposes a solution to the accent classification for Kashmiri Language using Convolutional Neural networks. The solution is based on deep learning techniques using CNNs that adapt to the multi-dimensional data. CNNs provide solution to this problem using the supervised approach where they first undergo training session during which they are fed with labelled data from which they learn the relationships in the data and attain the learning capability. In later stage, they are presented with unseen data of same domain and are able to make remarkable inferences from this unseen data by utilizing the attained learning capability. In addition to reporting the state-of-art classification results, its accuracy is also remarkable. In our research, we can conclude that the models and the data that we used, the Mel Spectrograms performed better and showed better performance than the MFCCs, also we saw that the images with 3 color channels performed better than the features that were saved with extended dimension. Overall we can conclude, our model showed state of the art performance for the Accent classification of Kashmiri Language with five output classes.


# VI.


# Future Improvements

There is a lot of improvement to be done in this field of research. Since, this is the first research in this language, as we could not find any other research related to Kashmiri language, so the area of improvement is vast. We propose following enhancements for this research-? Collection of more data for efficient model training a. The dataset can be increased in size b. The dataset can be in such a way, that it captures the maximum of the features and variations present in the language. ? The model can be made more complex and more sophisticated that would be able to handle more data and not underfit. ? Making efficient model for being able to capture most of the features of accent classification ? Improving the classification error and thus being able to classify wide range of the language. ? Making use of different architectures and techniques available for making the overall application most fruitful. ? The classification classes can be increased to more than 5 accents or regions.
12![Fig. 1: Plot of Metrics of trainig data](image-2.png "Fig. 1 :Fig. 2 :")
3![Fig. 3: Metric Scores for testing data.](image-3.png "Fig. 3 :")
1FeatureValidation LossValidation AccuracyMel Spectrograms0.03920.9848MFCC 130.040.98MFCC 240.0310.99MFCC 360.060.97
2FeatureTesting LossTesting AccuracyMFCC 130.0860.87MFCC 240.1100.87MFCC 360.5070.865c) Experiment 3
3Type ofValidationValidationFeatureLossAccuracyImages0.03310.98JSON files0.0690.97V.
		
		
* 
	
		Fast accent identification and accented speech recognition
		
			LKat
		
		
			PFung
		
	
		Acoustics, Speech, and Signal Processing
				Phoenix, AZ, USA
		
			IEEE
			1999
			1
			
		
	IEEE International Conference on


* 
	
		Accent issues in large vocabulary continuous speech recognition
		
			CHuang
		
		
			TChen
		
		
			EChang
		
	
		International Journal of Speech Technology
		
			7
			2-3
			
			2004
		
	
* 
	
		Forensic aspects of speech patterns: voice prints, speaker pro filing, lie and intoxication detection
		
			DCTanner
		
		
			METanner
		
		
			2004
			Lawyers & Judges Publishing Company
		
	
* 
	
		Dialect and accent recognition using phoneticsegmentation supervectors
		
			FBiadsy
		
		
			JBHirschberg
		
		
			DPEllis
		
		
			2011
		
	
* 
	
		Accent classification in speech
		
			SDeshpande
		
		
			SChikkerur
		
		
			VGovindaraju
		
	
		Automatic Identification Advanced Technologies, Fourth IEEE Workshop on
				Buffalo, NY, USA
		
			IEEE
			2005
			
		
* 
	
		Automatic accent identification using gaussian mixture models
		
			TChen
		
		
			CHuang
		
		
			EChang
		
		
			JWang
		
	
		Automatic Speech Recognition and Understanding
				Italy
		
			IEEE
			2001
			
		
* 
	
		Accent classification using support vector machine and hidden markov model
		
			HTang
		
		
			AAGhorbani
		
	
		Advances in Artificial Intelligence
				
			Springer
			2003
			
		
* 
	
		Foreign speaker accent classification using phoneme-dependent accent discrimination models and comparisons with human perception benchmarks
		
			KKumpf
		
		
			RWKing
		
	
		Proc. EuroSpeech
				EuroSpeech
		
			1997
			4
			
		
* 
	
		
			GHinton
		
		
			LDeng
		
		
			DYu
		
		
			GEDahl
		
		
			A
		
		
* 
	
		Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups
		
			NMohamed
		
		
			AJaitly
		
		
			VSenior
		
		
			PVanhoucke
		
		
			TNNguyen
		
		
			Sainath
		
	
		Signal Processing Magazine
		
			29
			6
			
			2012
		
	
	IEEE


* 
	
		Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis
		
			HZen
		
		
			HSak
		
	
		Acoustics, Speech and Signal Processing (ICASSP), IEEE International Conference on
				Brisbane, Australia
		
			IEEE
			2015
			
		
* 
	
		An experimental study on speech enhancement based on deep neural networks
		
			YXu
		
		
			JDu
		
		
			L.-RDai
		
		
			C.-HLee
		
	
		Signal Processing Letters, IEEE
		
			21
			1
			
			2014
		
	
* 
	
		Online speaking rate estimation using recurrent neural netwroks
		
			YJiao
		
		
			MTu
		
		
			VBerisha
		
		
			JLiss
		
	
		Acoustics, Speech and Signal Processing, IEEE International Conference on
				Shanghai, China
		
			IEEE
			2016
		
	
* 
	
		Classi fication of speech accents with neural networks
		
			MVChan
		
		
			XFeng
		
		
			JAHeinen
		
		
			RJNiederjohn
		
	
		Neural Networks, IEEE World Congress on Computational Intelligence., IEEE International Conference on
				
			IEEE
			1994
			7
			44834486
		
	
* 
	
		Persian accents identification using an adaptive neural network
		
			SRabiee
		
		
			Setayeshi
		
	
		Second International Workshop on Education Technology and Computer Science
				Wuhan, China
		
			IEEE
			2010
			
		
* 
	
		Deep learning for spoken language identification
		
			GMontavon
		
	
		NIPS Workshop on deep learning for speech recognition and related applications
				Whistler, BC, Canada
		
			2009
			
		
* 
	
		Language identification with neural networks: a feasibility study
		
			RACole
		
		
			JWInouye
		
		
			YKMuthusamy
		
		
			MGopalakrishnan
		
	
		Computers and Signal Processing
		
			
			1989
			IEEE
		
	
	IEEE Pacific Rim Conference on


* 
	
		Automatic language identification using deep neural networks
		
			JLopez-Moreno
		
		
			OGonzalez-Dominguez
		
		
			DPlchot
		
		
			JMartinez
		
		
			PGonzalez-Rodriguez
		
		
			Moreno
		
	
		Acoustics, Speech and Signal Processing (ICASSP), IEEE International Conference on
				Florence, Italy
		
			IEEE
			2014
			
		
* 
	
		Spectrogram based multi-task audio classification
		
			YZeng
		
		
			HMao
		
		
			DPeng
		
		
			ZYi
		
		doi: 10/gnkrrs
	
	
		Multimed Tools Appl
		
			78
			3
			
			Feb. 2019
		
	
* 
	
		SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
		
			DSPark
		
		doi: 10/ghbzt4
		
			Sep. 2019
			
		
	Interspeech 2019


* 
	
		Environmental sound classification based on feature fusion
		
			Huimin&Zhao
		
		
			Xianglin
		
		
			&Huang
		
		
			Wei
		
		
			LifangLiu & Yang
		
		
			2018
		
	
* 
	
		librosa: Audio and Music Signal Analysis in Python
		
			BMcfee
		
		10.1051/matecconf/201817303059
		doi: 10/gf4wxc
		
		
			2015
			
			Austin, Texas