# I. Introduction In paper [1], they used linear regression and polynomial regression to predict the results of fatalities. These two algorithms were applied to nd the best t line to estimate the average values of the two variables. These algorithms are dependent on the variation and dispersion of the data. The best t line will divide the data into two parts with the same distance between the values of data from the best t line. They also used root mean square error to estimate the accuracy of prediction. Root mean square error is a kind of metric to calculate the error when analyzing the data using regression algorithms. Root mean the square error will be calculated as the mean of the values and ensure the distances are the same as the points. The root means square error measures the variation and the concentration of the values around the mean. Many kinds of data could be expressed in Fig 1, the exactness belongs to the distribution of data. In paper [2], they predicted the outbreak of Covid-19 in Ethiopia by comparing the Support Vector Machine (SVM) model and the Polynomial Regression (PR) model in the ScikitLearn library. The paper showed that SVM gets better performance than PR banked on evaluating graph performance and metric Mean Square Error (MSE), Mean Absolute Error (MAE) [3{7, 9]. With the same evaluation in paper [1], the results were also depending on the distribution of the data and this evaluation is just counted on the mean of the values that if the data is dense on the prediction, the mean of the values will be closed to the mean of prediction. This calculation usually makes the approximate values instead of exact values. In this paper, we considered the unformed data with the information in Fig 1 . We calculated the correlation between the attributes of data and applied an accuracy metric to evaluate the exact values. We # Authors : King Mongkut's University of Technology North Bangkok, Thailand. e-mails: maleerat.m@itd.kmutnb.ac.th, minh.tuan@itd.kmutnb.ac.th, http://kmutnb.ac.th/ unformed data by calculating the correlation shown in Table 1.We defined very strong positive correlation when values are greater than or equal to 0.8, strong positive correlation when values are greater than or equal to 0.6 and smaller than 0.8, weak positive correlation when values are greater than or equal to 0.4, and smaller than 0.6. We omitted no correlation (values are in the interval of -0.4 to 0. smaller than or equal to -0.4 and greater than -0.6), strong negative correlation (Values are greater than -0.8 to values smaller than or equal to -0.6) and very strong negative correlation (Values are smaller or equal to -0.8). We tried models and chose the metric accuracy to calculate the true prediction and the percentage of the prediction. With this metric, we could evaluate exactly the number of predictions and depicted the records related to prediction. PySpark is one of the branches of Hadoop structure becoming strongly and easily in analyzing the data. With the powerful libraries, PySpark supplies the structure for direct and indirect processing, graph environment with ease of use, short time analyzing the big data. PySpark sponsors many sections with many kinds of functions such as Spark SQL, DataFrame, Streaming, MLlib, and Spark Core. PySpark could solve with big data and costs less time to analyze the classification problems. Table 2 shows details of the sections and functions in the PySpark library. The steps for analyzing data could not follow the sections but could form the data before applying the sections and functions (Fig 1). The data will be extracted feature and applied to the model to transform to right form data by choosing basic statistics. After that, we could confirm and make the kinds of problems such as classification, regression, or clustering problems. Finally, we applied evaluation metrics to estimate the models (Equations 1-4). (2) # Where is the true positive at class , and is false positive at class . is false negative at class j. ( (4) Accuracy = n i=1 T iV n i=1 T iV + m j=1 F jV P recision = n i=1 T iP n i=1 T iP + m j=1 F jP T iP i i F iP F jN Recall = n i=1 T iP n i=1 T iP + m j=1 F jN F 1 ? Score = 2 × P recision × Recall P recision + Recall II. Literature Review Nowadays, machine learning is becoming an essential part of computer science. PySpark is a strong application for analyzing the data with open-source libraries where we can run R, Python, Java, and Scala. PySpark is free for users and easy to use. PySpark supports two strong libraries with Spark MLlib and Spark ML packages where they can solve big data and analyze it in a very short time [17]. However, the processing for analyzing data could follow as Fig 2 shown. We summarized the algorithms used in the PySpark library shown the detail in Table 2. # III. Experiments In this paper, we got data downloaded on June 10th, 2022 from the website https://ourworldindata.org/covid-deaths and updated every day (Table 2). The data totally consists of 59 attributes and we also chose the attribute with the Where n; m are numbers of classes, TiV is a true value of prediction at class i; FjV is a false value of label at class j. # MLlib Sections Features Data types # Local vector The vector is formed by an integer or double or zero-based type. The data can be distributed densely or sparsely. # Labeled point A kind of local vector using supervised machine learning algorithms with data is labeled. Labels sometimes are 0 and 1 or start from 0, 1,2,. . . The data can be established in dense or sparse distribution. As Fig 1 shown, we need to process the data in the right format by using PySpark libraries. The columns selected will be divided into two parts: One part for features and another for labels. We applied StringIndexer to change to the column labeled and applied OneHotEncoder to established binary vector and after that, we applied VectorAssembler to combine with total cases column to make column features for prediction. We also applied StringIndexer to turn total deaths into a label column for target prediction (see Table 2). Besides metrics accuracy to evaluate the ratio of right targets and total targets, we considered evaluating by Precision, Recall, and F1-score occupied great important units in the medical aspect. Precision is confirmed the rightly positive cases while Recall is to confirm rightly negative cases to decide the right method for curing. F1-score, calculated as the average of Recall and Precision, is applied to confirm how much Recall is more important than Precision. In the medical branch, it is used to decide prior Recall or Precision to choose an appropriate patients' situation. # Distributed Compared to deep machine learning, we also analyze the data when trying with deep learning [8{13] such as LSTM, and GRU but get the worse results prediction shown such as the time costs too much time (5,435s/step), accuracy for the first step is 0.138 and the second step is 0.1384. The parameters for solving this data are a total of 202,878,594 parameters and the batch size is 1,318 parameters. PySpark has shown better performance with the best accuracy and least time to evaluate. # IV. Results In this paper, we tried the models in PySpark and choose the models that could analyze the data. After trying the models in Spark.MLlib and Spark.ML, we got the results in Table 3. The results showed that Naïve Bayes has the best performance in predicting fatalities with an accuracy of 0.813. Following that was the Decision Tree model with an accuracy is 0.621. Table 4 shows some example prediction results with the models. ![he world has spent to the heart-rending day when fatalities passed 4 million people while the crisis becoming the race between vaccinating and new dangerous variants. Prediction is another way to control the Covid-19 situation and propose a new method to face the new stage of devastation coronavirus [18{21].](image-2.png "") ![The State-of-the-Art Machine Learning in Prediction Covid-19 Fatality CasesGlobal Journal of Computer Science and TechnologyVolume XXII Issue I Version I](image-3.png "") 1![Fig. 1: Steps to Process Data greatest correlation values in the set of very strong correlation values for building features combining location and total deaths is chosen as labels. The raw data chosen comprises about 208,111 instances and is cleaned by keeping specific character contributes.As Fig1shown, we need to process the data in the right format by using PySpark libraries. The columns selected will be divided into two parts: One part for features and another for labels. We applied StringIndexer to change to the column labeled and applied OneHotEncoder to established binary vector](image-4.png "Fig. 1 :") 3 4ModelsAccuracy Precision Recall F1-ScoreNaïve Bayes0.8130.5710.381 0.457Random Forest 0.1390.6320.003 0.005Decision Tree0.6210.8240.013 0.026Label Naïve Bayes Random Forest Decision Trees8.054320.03231623.0161838.028311243.046341V. Conclusion © 2022 Global Journals * Prediction of number of fatalities due to Covid-19 using Machine Learning ManpinderSingh SaibaDalmia IEEE 17th India Council International Conference (INDICON) 2020 * Analysis and forcasting the outbreak of Covid-19 in Ethiopia using machine learning AhmedSirage Zeynu * A novel machine learning based model for COVID-19 prediction TamerSh Mazen International Journal of Avanced Computer Science and Applications 2020 * Machine learning prediction for Covid 19 pandemic in India RoselineOluwaseun Ogundokun JosephBamidele Awotunde 2020 * Logistic Regression Analysis to Predict Mortality Risk in COVID 19 Patients from Routine Hematologic Parameters SudhirBhandari AjitSingh Shaktawat AmitTak BhoopendraPatel JyotsnaShukla SanjaySinghal KapilGupta JitendraGupta ShivankanKakkar AmitabhDube Ibnosina Journal of Medicine and Biomedical Sciences 2020 * Clinical predictors of COVID-19 mortality: development and validation of a clinical prediction model" in Lancet Digit Health SArjun Yan-ChakYadaw SonaliLi RaviBose SupindaIyengar GauravBunyavanich Pandey 2020 * Sukhpal Singh Gill, predicting the growth and trend of Covid-19 pandemic using machine learning and cloud computing ShreshthTuli ShikharTuli RakeshTuli 2020 Elsevier public health emergency collection * Mohammad BehdadJamshidi AliLalbakhsh JakubTalla Zden_EkPeroutka FarimahHadjilooei PedramLalbakhsh MortezaJamshidi LuigiLaSpada MirhamedMirmozafari MojganDehghani Deep Learning Approaches for Diagnosis and Treatment AsalSabet SaeedRoshani SobhanRoshani NimaBayat-Makou BahareMohamadzade ZahraMalek AlirezaJamshidi SarahKiani HamedHashemi-Dezaki WahabMohyuddin 19 2020 * SinaFArdabili AmirMosavi PedramGhamisi FilipFerdinand AnnamariaRVarkonyi-Koczy UweReuter TimonRabczuk PeterMAtkinson COVID-19 Outbreak Prediction with Machine Learning MDPI 2020 * Machine-Learning Approaches in COVID-19 Survival Analysis and Discharge-Time Likelihood Prediction Using Clinical Data MohammadrezaNemati JamalAnsary NazafarinNemat 2020 CellPress * Chiavegatto Filho ADP, COVID-19 diagnosis prediction in emergency care patients: a machine learning approach AfmBatista JLMiraglia ThrDonato 2020 CSH * COVID-19: Short term prediction model using daily incidence data HongweiZhao NNaveed AlyssaMerchant Mcnulty ATifiany Radclifi JMurray RebeccaCote HuiyanFischer Sang GMarcia Ory Plos One Collection 2021 * Machine Learning Models for Government to Predict COVID-19 Outbreak RajanGupta GauravPandey PoonamChaudhary KSaibal Pal ACM Journal 2020 * AnthonyKelly * Investigating the Statistical Assumptions of Naïve Bayes Classifiers MarcJohnson Anthony 55th Annual Conference on Information Sciences and Systems (CISS) 2021 * Intrusion Detection System using Naive Bayes algorithm B S;Sharmila RohiniNagapadma IEEE International WIE Conference on Electrical and Computer Engineering 2019 WIECON-ECE * Intrusion Detection System using Naive Bayes algorithm RohiniB S Sharmila Nagapadma IEEE International WIE Conference on Electrical and Computer Engineering 2019 * English-Vietnamese Machine Translation Using Deep Learning, Recent Advances in Information and Communication Technology TuanNguyen MMeesad P NguyenHa H 10.1007/978-3-030-79757-710 2021 * A Study of Using Machine Learning in Predicting COVID-19 Cases. Cloud Computing and Data Science MMaliyaem TuanNguyen MLockhart DMuenthong S 2022 * A Study of Predicting the Sincerity of a Question Asked Using Machine Learning TuanNguyen M PhayungMeesad 5th International Conference on Natural Language Processing and Information Retrieval (NLPIR) 2021 * TuanNguyen M Machine Learning Performance on Predicting Banking Term Deposit, International Conference on Enterprise Information Systems (ICEIS) 2022 * Muhamad Shirwan Abdullah-Sani, A network analysis and support vector regression approaches for visualizing and predicting the COVID-19 outbreak in Malaysia MohamadSiti Nurhidayahsharin Khairilradzali 2022 ScienceDirect