# I. Introduction

In paper [1], they used linear regression and polynomial regression to predict the results of fatalities. These two algorithms were applied to nd the best t line to estimate the average values of the two variables. These algorithms are dependent on the variation and dispersion of the data. The best t line will divide the data into two parts with the same distance between the values of data from the best t line. They also used root mean square error to estimate the accuracy of prediction. Root mean square error is a kind of metric to calculate the error when analyzing the data using regression algorithms. Root mean the square error will be calculated as the mean of the values and ensure the distances are the same as the points. The root means square error measures the variation and the concentration of the values around the mean. Many kinds of data could be expressed in Fig 1, the exactness belongs to the distribution of data.

In paper [2], they predicted the outbreak of Covid-19 in Ethiopia by comparing the Support Vector Machine (SVM) model and the Polynomial Regression (PR) model in the ScikitLearn library. The paper showed that SVM gets better performance than PR banked on evaluating graph performance and metric Mean Square Error (MSE), Mean Absolute Error (MAE) [3{7, 9]. With the same evaluation in paper [1], the results were also depending on the distribution of the data and this evaluation is just counted on the mean of the values that if the data is dense on the prediction, the mean of the values will be closed to the mean of prediction. This calculation usually makes the approximate values instead of exact values.

In this paper, we considered the unformed data with the information in Fig 1 . We calculated the correlation between the attributes of data and applied an accuracy metric to evaluate the exact values. We


# Authors

: King Mongkut's University of Technology North Bangkok, Thailand. e-mails: maleerat.m@itd.kmutnb.ac.th, minh.tuan@itd.kmutnb.ac.th, http://kmutnb.ac.th/ unformed data by calculating the correlation shown in Table 1.We defined very strong positive correlation when values are greater than or equal to 0.8, strong positive correlation when values are greater than or equal to 0.6 and smaller than 0.8, weak positive correlation when values are greater than or equal to 0.4, and smaller than 0.6. We omitted no correlation (values are in the interval of -0.4 to 0. smaller than or equal to -0.4 and greater than -0.6), strong negative correlation (Values are greater than -0.8 to values smaller than or equal to -0.6) and very strong negative correlation (Values are smaller or equal to -0.8). We tried models and chose the metric accuracy to calculate the true prediction and the percentage of the prediction. With this metric, we could evaluate exactly the number of predictions and depicted the records related to prediction. PySpark is one of the branches of Hadoop structure becoming strongly and easily in analyzing the data. With the powerful libraries, PySpark supplies the structure for direct and indirect processing, graph environment with ease of use, short time analyzing the big data. PySpark sponsors many sections with many kinds of functions such as Spark SQL, DataFrame, Streaming, MLlib, and Spark Core. PySpark could solve with big data and costs less time to analyze the classification problems. Table 2 shows details of the sections and functions in the PySpark library. The steps for analyzing data could not follow the sections but could form the data before applying the sections and functions (Fig 1). The data will be extracted feature and applied to the model to transform to right form data by choosing basic statistics. After that, we could confirm and make the kinds of problems such as classification, regression, or clustering problems. Finally, we applied evaluation metrics to estimate the models (Equations 1-4).

(2)


# Where

is the true positive at class , and is false positive at class . is false negative at class j.

(

(4)
Accuracy = n i=1 T iV n i=1 T iV + m j=1 F jV P recision = n i=1 T iP n i=1 T iP + m j=1 F jP T iP i i F iP F jN Recall = n i=1 T iP n i=1 T iP + m j=1 F jN F 1 ? Score = 2 × P recision × Recall P recision + Recall II. Literature Review
Nowadays, machine learning is becoming an essential part of computer science. PySpark is a strong application for analyzing the data with open-source libraries where we can run R, Python, Java, and Scala. PySpark is free for users and easy to use. PySpark supports two strong libraries with Spark MLlib and Spark ML packages where they can solve big data and analyze it in a very short time [17]. However, the processing for analyzing data could follow as Fig 2 shown. We summarized the algorithms used in the PySpark library shown the detail in Table 2.


# III. Experiments

In this paper, we got data downloaded on June 10th, 2022 from the website https://ourworldindata.org/covid-deaths and updated every day (Table 2). The data totally consists of 59 attributes and we also chose the attribute with the  Where n; m are numbers of classes, TiV is a true value of prediction at class i; FjV is a false value of label at class j.


# MLlib Sections Features

Data types


# Local vector

The vector is formed by an integer or double or zero-based type. The data can be distributed densely or sparsely.


# Labeled point

A kind of local vector using supervised machine learning algorithms with data is labeled. Labels sometimes are 0 and 1 or start from 0, 1,2,. . . The data can be established in dense or sparse distribution.  As Fig 1 shown, we need to process the data in the right format by using PySpark libraries. The columns selected will be divided into two parts: One part for features and another for labels. We applied StringIndexer to change to the column labeled and applied OneHotEncoder to established binary vector and after that, we applied VectorAssembler to combine with total cases column to make column features for prediction. We also applied StringIndexer to turn total deaths into a label column for target prediction (see Table 2). Besides metrics accuracy to evaluate the ratio of right targets and total targets, we considered evaluating by Precision, Recall, and F1-score occupied great important units in the medical aspect. Precision is confirmed the rightly positive cases while Recall is to confirm rightly negative cases to decide the right method for curing. F1-score, calculated as the average of Recall and Precision, is applied to confirm how much Recall is more important than Precision. In the medical branch, it is used to decide prior Recall or Precision to choose an appropriate patients' situation.


# Distributed

Compared to deep machine learning, we also analyze the data when trying with deep learning [8{13] such as LSTM, and GRU but get the worse results prediction shown such as the time costs too much time (5,435s/step), accuracy for the first step is 0.138 and the second step is 0.1384. The parameters for solving this data are a total of 202,878,594 parameters and the batch size is 1,318 parameters. PySpark has shown better performance with the best accuracy and least time to evaluate.


# IV. Results

In this paper, we tried the models in PySpark and choose the models that could analyze the data. After trying the models in Spark.MLlib and Spark.ML, we got the results in Table 3. The results showed that Naïve Bayes has the best performance in predicting fatalities with an accuracy of 0.813. Following that was the Decision Tree model with an accuracy is 0.621. Table 4 shows some example prediction results with the models.    
![he world has spent to the heart-rending day when fatalities passed 4 million people while the crisis becoming the race between vaccinating and new dangerous variants. Prediction is another way to control the Covid-19 situation and propose a new method to face the new stage of devastation coronavirus [18{21].](image-2.png "")
![The State-of-the-Art Machine Learning in Prediction Covid-19 Fatality CasesGlobal Journal of Computer Science and TechnologyVolume XXII Issue I Version I](image-3.png "")
1![Fig. 1: Steps to Process Data greatest correlation values in the set of very strong correlation values for building features combining location and total deaths is chosen as labels. The raw data chosen comprises about 208,111 instances and is cleaned by keeping specific character contributes.As Fig1shown, we need to process the data in the right format by using PySpark libraries. The columns selected will be divided into two parts: One part for features and another for labels. We applied StringIndexer to change to the column labeled and applied OneHotEncoder to established binary vector](image-4.png "Fig. 1 :")


3
4ModelsAccuracy Precision Recall F1-ScoreNaïve Bayes0.8130.5710.381 0.457Random Forest 0.1390.6320.003 0.005Decision Tree0.6210.8240.013 0.026Label Naïve Bayes Random Forest Decision Trees8.054320.03231623.0161838.028311243.046341V. Conclusion
			© 2022 Global Journals
		
		
* 
	
		Prediction of number of fatalities due to Covid-19 using Machine Learning
		
			ManpinderSingh
		
		
			SaibaDalmia
		
	
		IEEE 17th India Council International Conference (INDICON)
				
			2020
		
	
* 
	
		Analysis and forcasting the outbreak of Covid-19 in Ethiopia using machine learning
		
			AhmedSirage Zeynu
		
		
* 
	
		A novel machine learning based model for COVID-19 prediction
		
			TamerSh
		
		
			Mazen
		
	
		International Journal of Avanced Computer Science and Applications
		
			2020
		
	
* 
	
		Machine learning prediction for Covid 19 pandemic in India
		
			RoselineOluwaseun
		
		
			Ogundokun
		
		
			JosephBamidele
		
		
			Awotunde
		
		
			2020
		
	
* 
	
		Logistic Regression Analysis to Predict Mortality Risk in COVID 19 Patients from Routine Hematologic Parameters
		
			SudhirBhandari
		
		
			AjitSingh Shaktawat
		
		
			AmitTak
		
		
			BhoopendraPatel
		
		
			JyotsnaShukla
		
		
			SanjaySinghal
		
		
			KapilGupta
		
		
			JitendraGupta
		
		
			ShivankanKakkar
		
		
			AmitabhDube
		
	
		Ibnosina Journal of Medicine and Biomedical Sciences
		
			2020
		
	
* 
	
		Clinical predictors of COVID-19 mortality: development and validation of a clinical prediction model" in Lancet Digit Health
		
			SArjun
		
		
			Yan-ChakYadaw
		
		
			SonaliLi
		
		
			RaviBose
		
		
			SupindaIyengar
		
		
			GauravBunyavanich
		
		
			Pandey
		
		
			2020
		
	
* 
	
		Sukhpal Singh Gill, predicting the growth and trend of Covid-19 pandemic using machine learning and cloud computing
		
			ShreshthTuli
		
		
			ShikharTuli
		
		
			RakeshTuli
		
		
			2020
			Elsevier public health emergency collection
		
	
* 
	
		
			Mohammad BehdadJamshidi
		
		
			AliLalbakhsh
		
		
			JakubTalla
		
		
			Zden_EkPeroutka
		
		
			FarimahHadjilooei
		
		
			PedramLalbakhsh
		
		
			MortezaJamshidi
		
		
			LuigiLaSpada
		
		
			MirhamedMirmozafari
		
		
			MojganDehghani
		
	
		Deep Learning Approaches for Diagnosis and Treatment
				
			AsalSabet
			SaeedRoshani
			SobhanRoshani
			NimaBayat-Makou
			BahareMohamadzade
			ZahraMalek
			AlirezaJamshidi
			SarahKiani
			HamedHashemi-Dezaki
			WahabMohyuddin
		
		
			19
			2020
		
	
* 
	
		
			SinaFArdabili
		
		
			AmirMosavi
		
		
			PedramGhamisi
		
		
			FilipFerdinand
		
		
			AnnamariaRVarkonyi-Koczy
		
		
			UweReuter
		
		
			TimonRabczuk
		
		
			PeterMAtkinson
		
		COVID-19 Outbreak Prediction with Machine Learning
				
			MDPI
			2020
		
	
* 
	
		Machine-Learning Approaches in COVID-19 Survival Analysis and Discharge-Time Likelihood Prediction Using Clinical Data
		
			MohammadrezaNemati
		
		
			JamalAnsary
		
		
			NazafarinNemat
		
		
			2020
			CellPress
		
	
* 
	
		Chiavegatto Filho ADP, COVID-19 diagnosis prediction in emergency care patients: a machine learning approach
		
			AfmBatista
		
		
			JLMiraglia
		
		
			ThrDonato
		
		
			2020
			CSH
		
	
* 
	
		COVID-19: Short term prediction model using daily incidence data
		
			HongweiZhao
		
		
			NNaveed
		
		
			AlyssaMerchant
		
		
			Mcnulty
		
		
			ATifiany
		
		
			Radclifi
		
		
			JMurray
		
		
			RebeccaCote
		
		
			HuiyanFischer
		
		
			Sang
		
		
			GMarcia
		
		
			Ory
		
	
		Plos One Collection
		
			2021
		
	
* 
	
		Machine Learning Models for Government to Predict COVID-19 Outbreak
		
			RajanGupta
		
		
			GauravPandey
		
		
			PoonamChaudhary
		
		
			KSaibal
		
		
			Pal
		
	
		ACM Journal
		
			2020
		
	
* 
	
		
			AnthonyKelly
		
		
* 
	
		Investigating the Statistical Assumptions of Naïve Bayes Classifiers
		
			MarcJohnson
		
		
			Anthony
		
	
		55th Annual Conference on Information Sciences and Systems (CISS)
				
			2021
		
	
* 
	
		Intrusion Detection System using Naive Bayes algorithm
		
			B S;Sharmila
		
		
			RohiniNagapadma
		
	
		IEEE International WIE Conference on Electrical and Computer Engineering
				
			2019
		
		
			WIECON-ECE
		
	
* 
	
		Intrusion Detection System using Naive Bayes algorithm
		
			RohiniB S Sharmila
		
		
			Nagapadma
		
	
		IEEE International WIE Conference on Electrical and Computer Engineering
				
			2019
		
	
* 
	
		English-Vietnamese Machine Translation Using Deep Learning, Recent Advances in Information and Communication Technology
		
			TuanNguyen
		
		
			MMeesad
		
		
			P
		
		
			NguyenHa
		
		
			H
		
		10.1007/978-3-030-79757-710
		
		
			2021
		
	
* 
	
		A Study of Using Machine Learning in Predicting COVID-19 Cases. Cloud Computing and Data Science
		
			MMaliyaem
		
		
			TuanNguyen
		
		
			MLockhart
		
		
			DMuenthong
		
		
			S
		
		
			2022
		
	
* 
	
		A Study of Predicting the Sincerity of a Question Asked Using Machine Learning
		
			TuanNguyen
		
		
			M
		
		
			PhayungMeesad
		
	
		5th International Conference on Natural Language Processing and Information Retrieval (NLPIR)
				
			2021
		
	
* 
	
		
			TuanNguyen
		
		
			M
		
		Machine Learning Performance on Predicting Banking Term Deposit, International Conference on Enterprise Information Systems (ICEIS)
				
			2022
		
	
* 
	
		Muhamad Shirwan Abdullah-Sani, A network analysis and support vector regression approaches for visualizing and predicting the COVID-19 outbreak in Malaysia
		
			MohamadSiti Nurhidayahsharin
		
		
			Khairilradzali
		
		
			2022
			ScienceDirect