# I. Introduction
In paper [1], they used linear regression and polynomial regression to predict the results of fatalities. These two algorithms were applied to nd the best t line to estimate the average values of the two variables. These algorithms are dependent on the variation and dispersion of the data. The best t line will divide the data into two parts with the same distance between the values of data from the best t line. They also used root mean square error to estimate the accuracy of prediction. Root mean square error is a kind of metric to calculate the error when analyzing the data using regression algorithms. Root mean the square error will be calculated as the mean of the values and ensure the distances are the same as the points. The root means square error measures the variation and the concentration of the values around the mean. Many kinds of data could be expressed in Fig 1, the exactness belongs to the distribution of data.
In paper [2], they predicted the outbreak of Covid-19 in Ethiopia by comparing the Support Vector Machine (SVM) model and the Polynomial Regression (PR) model in the ScikitLearn library. The paper showed that SVM gets better performance than PR banked on evaluating graph performance and metric Mean Square Error (MSE), Mean Absolute Error (MAE) [3{7, 9]. With the same evaluation in paper [1], the results were also depending on the distribution of the data and this evaluation is just counted on the mean of the values that if the data is dense on the prediction, the mean of the values will be closed to the mean of prediction. This calculation usually makes the approximate values instead of exact values.
In this paper, we considered the unformed data with the information in Fig 1 . We calculated the correlation between the attributes of data and applied an accuracy metric to evaluate the exact values. We
# Authors
: King Mongkut's University of Technology North Bangkok, Thailand. e-mails: maleerat.m@itd.kmutnb.ac.th, minh.tuan@itd.kmutnb.ac.th, http://kmutnb.ac.th/ unformed data by calculating the correlation shown in Table 1.We defined very strong positive correlation when values are greater than or equal to 0.8, strong positive correlation when values are greater than or equal to 0.6 and smaller than 0.8, weak positive correlation when values are greater than or equal to 0.4, and smaller than 0.6. We omitted no correlation (values are in the interval of -0.4 to 0. smaller than or equal to -0.4 and greater than -0.6), strong negative correlation (Values are greater than -0.8 to values smaller than or equal to -0.6) and very strong negative correlation (Values are smaller or equal to -0.8). We tried models and chose the metric accuracy to calculate the true prediction and the percentage of the prediction. With this metric, we could evaluate exactly the number of predictions and depicted the records related to prediction. PySpark is one of the branches of Hadoop structure becoming strongly and easily in analyzing the data. With the powerful libraries, PySpark supplies the structure for direct and indirect processing, graph environment with ease of use, short time analyzing the big data. PySpark sponsors many sections with many kinds of functions such as Spark SQL, DataFrame, Streaming, MLlib, and Spark Core. PySpark could solve with big data and costs less time to analyze the classification problems. Table 2 shows details of the sections and functions in the PySpark library. The steps for analyzing data could not follow the sections but could form the data before applying the sections and functions (Fig 1). The data will be extracted feature and applied to the model to transform to right form data by choosing basic statistics. After that, we could confirm and make the kinds of problems such as classification, regression, or clustering problems. Finally, we applied evaluation metrics to estimate the models (Equations 1-4).
(2)
# Where
is the true positive at class , and is false positive at class . is false negative at class j.
(
(4)
Accuracy = n i=1 T iV n i=1 T iV + m j=1 F jV P recision = n i=1 T iP n i=1 T iP + m j=1 F jP T iP i i F iP F jN Recall = n i=1 T iP n i=1 T iP + m j=1 F jN F 1 ? Score = 2 × P recision × Recall P recision + Recall II. Literature Review
Nowadays, machine learning is becoming an essential part of computer science. PySpark is a strong application for analyzing the data with open-source libraries where we can run R, Python, Java, and Scala. PySpark is free for users and easy to use. PySpark supports two strong libraries with Spark MLlib and Spark ML packages where they can solve big data and analyze it in a very short time [17]. However, the processing for analyzing data could follow as Fig 2 shown. We summarized the algorithms used in the PySpark library shown the detail in Table 2.
# III. Experiments
In this paper, we got data downloaded on June 10th, 2022 from the website https://ourworldindata.org/covid-deaths and updated every day (Table 2). The data totally consists of 59 attributes and we also chose the attribute with the Where n; m are numbers of classes, TiV is a true value of prediction at class i; FjV is a false value of label at class j.
# MLlib Sections Features
Data types
# Local vector
The vector is formed by an integer or double or zero-based type. The data can be distributed densely or sparsely.
# Labeled point
A kind of local vector using supervised machine learning algorithms with data is labeled. Labels sometimes are 0 and 1 or start from 0, 1,2,. . . The data can be established in dense or sparse distribution. As Fig 1 shown, we need to process the data in the right format by using PySpark libraries. The columns selected will be divided into two parts: One part for features and another for labels. We applied StringIndexer to change to the column labeled and applied OneHotEncoder to established binary vector and after that, we applied VectorAssembler to combine with total cases column to make column features for prediction. We also applied StringIndexer to turn total deaths into a label column for target prediction (see Table 2). Besides metrics accuracy to evaluate the ratio of right targets and total targets, we considered evaluating by Precision, Recall, and F1-score occupied great important units in the medical aspect. Precision is confirmed the rightly positive cases while Recall is to confirm rightly negative cases to decide the right method for curing. F1-score, calculated as the average of Recall and Precision, is applied to confirm how much Recall is more important than Precision. In the medical branch, it is used to decide prior Recall or Precision to choose an appropriate patients' situation.
# Distributed
Compared to deep machine learning, we also analyze the data when trying with deep learning [8{13] such as LSTM, and GRU but get the worse results prediction shown such as the time costs too much time (5,435s/step), accuracy for the first step is 0.138 and the second step is 0.1384. The parameters for solving this data are a total of 202,878,594 parameters and the batch size is 1,318 parameters. PySpark has shown better performance with the best accuracy and least time to evaluate.
# IV. Results
In this paper, we tried the models in PySpark and choose the models that could analyze the data. After trying the models in Spark.MLlib and Spark.ML, we got the results in Table 3. The results showed that Naïve Bayes has the best performance in predicting fatalities with an accuracy of 0.813. Following that was the Decision Tree model with an accuracy is 0.621. Table 4 shows some example prediction results with the models.
![he world has spent to the heart-rending day when fatalities passed 4 million people while the crisis becoming the race between vaccinating and new dangerous variants. Prediction is another way to control the Covid-19 situation and propose a new method to face the new stage of devastation coronavirus [18{21].](image-2.png "")
![The State-of-the-Art Machine Learning in Prediction Covid-19 Fatality CasesGlobal Journal of Computer Science and TechnologyVolume XXII Issue I Version I](image-3.png "")
1![Fig. 1: Steps to Process Data greatest correlation values in the set of very strong correlation values for building features combining location and total deaths is chosen as labels. The raw data chosen comprises about 208,111 instances and is cleaned by keeping specific character contributes.As Fig1shown, we need to process the data in the right format by using PySpark libraries. The columns selected will be divided into two parts: One part for features and another for labels. We applied StringIndexer to change to the column labeled and applied OneHotEncoder to established binary vector](image-4.png "Fig. 1 :")
3
4ModelsAccuracy Precision Recall F1-ScoreNaïve Bayes0.8130.5710.381 0.457Random Forest 0.1390.6320.003 0.005Decision Tree0.6210.8240.013 0.026Label Naïve Bayes Random Forest Decision Trees8.054320.03231623.0161838.028311243.046341V. Conclusion
© 2022 Global Journals
*
Prediction of number of fatalities due to Covid-19 using Machine Learning
ManpinderSingh
SaibaDalmia
IEEE 17th India Council International Conference (INDICON)
2020
*
Analysis and forcasting the outbreak of Covid-19 in Ethiopia using machine learning
AhmedSirage Zeynu
*
A novel machine learning based model for COVID-19 prediction
TamerSh
Mazen
International Journal of Avanced Computer Science and Applications
2020
*
Machine learning prediction for Covid 19 pandemic in India
RoselineOluwaseun
Ogundokun
JosephBamidele
Awotunde
2020
*
Logistic Regression Analysis to Predict Mortality Risk in COVID 19 Patients from Routine Hematologic Parameters
SudhirBhandari
AjitSingh Shaktawat
AmitTak
BhoopendraPatel
JyotsnaShukla
SanjaySinghal
KapilGupta
JitendraGupta
ShivankanKakkar
AmitabhDube
Ibnosina Journal of Medicine and Biomedical Sciences
2020
*
Clinical predictors of COVID-19 mortality: development and validation of a clinical prediction model" in Lancet Digit Health
SArjun
Yan-ChakYadaw
SonaliLi
RaviBose
SupindaIyengar
GauravBunyavanich
Pandey
2020
*
Sukhpal Singh Gill, predicting the growth and trend of Covid-19 pandemic using machine learning and cloud computing
ShreshthTuli
ShikharTuli
RakeshTuli
2020
Elsevier public health emergency collection
*
Mohammad BehdadJamshidi
AliLalbakhsh
JakubTalla
Zden_EkPeroutka
FarimahHadjilooei
PedramLalbakhsh
MortezaJamshidi
LuigiLaSpada
MirhamedMirmozafari
MojganDehghani
Deep Learning Approaches for Diagnosis and Treatment
AsalSabet
SaeedRoshani
SobhanRoshani
NimaBayat-Makou
BahareMohamadzade
ZahraMalek
AlirezaJamshidi
SarahKiani
HamedHashemi-Dezaki
WahabMohyuddin
19
2020
*
SinaFArdabili
AmirMosavi
PedramGhamisi
FilipFerdinand
AnnamariaRVarkonyi-Koczy
UweReuter
TimonRabczuk
PeterMAtkinson
COVID-19 Outbreak Prediction with Machine Learning
MDPI
2020
*
Machine-Learning Approaches in COVID-19 Survival Analysis and Discharge-Time Likelihood Prediction Using Clinical Data
MohammadrezaNemati
JamalAnsary
NazafarinNemat
2020
CellPress
*
Chiavegatto Filho ADP, COVID-19 diagnosis prediction in emergency care patients: a machine learning approach
AfmBatista
JLMiraglia
ThrDonato
2020
CSH
*
COVID-19: Short term prediction model using daily incidence data
HongweiZhao
NNaveed
AlyssaMerchant
Mcnulty
ATifiany
Radclifi
JMurray
RebeccaCote
HuiyanFischer
Sang
GMarcia
Ory
Plos One Collection
2021
*
Machine Learning Models for Government to Predict COVID-19 Outbreak
RajanGupta
GauravPandey
PoonamChaudhary
KSaibal
Pal
ACM Journal
2020
*
AnthonyKelly
*
Investigating the Statistical Assumptions of Naïve Bayes Classifiers
MarcJohnson
Anthony
55th Annual Conference on Information Sciences and Systems (CISS)
2021
*
Intrusion Detection System using Naive Bayes algorithm
B S;Sharmila
RohiniNagapadma
IEEE International WIE Conference on Electrical and Computer Engineering
2019
WIECON-ECE
*
Intrusion Detection System using Naive Bayes algorithm
RohiniB S Sharmila
Nagapadma
IEEE International WIE Conference on Electrical and Computer Engineering
2019
*
English-Vietnamese Machine Translation Using Deep Learning, Recent Advances in Information and Communication Technology
TuanNguyen
MMeesad
P
NguyenHa
H
10.1007/978-3-030-79757-710
2021
*
A Study of Using Machine Learning in Predicting COVID-19 Cases. Cloud Computing and Data Science
MMaliyaem
TuanNguyen
MLockhart
DMuenthong
S
2022
*
A Study of Predicting the Sincerity of a Question Asked Using Machine Learning
TuanNguyen
M
PhayungMeesad
5th International Conference on Natural Language Processing and Information Retrieval (NLPIR)
2021
*
TuanNguyen
M
Machine Learning Performance on Predicting Banking Term Deposit, International Conference on Enterprise Information Systems (ICEIS)
2022
*
Muhamad Shirwan Abdullah-Sani, A network analysis and support vector regression approaches for visualizing and predicting the COVID-19 outbreak in Malaysia
MohamadSiti Nurhidayahsharin
Khairilradzali
2022
ScienceDirect