In paper [1], they used linear regression and polynomial regression to predict the results of fatalities. These two algorithms were applied to nd the best t line to estimate the average values of the two variables. These algorithms are dependent on the variation and dispersion of the data. The best t line will divide the data into two parts with the same distance between the values of data from the best t line. They also used root mean square error to estimate the accuracy of prediction. Root mean square error is a kind of metric to calculate the error when analyzing the data using regression algorithms. Root mean the square error will be calculated as the mean of the values and ensure the distances are the same as the points. The root means square error measures the variation and the concentration of the values around the mean. Many kinds of data could be expressed in Fig 1, the exactness belongs to the distribution of data.
In paper [2], they predicted the outbreak of Covid-19 in Ethiopia by comparing the Support Vector Machine (SVM) model and the Polynomial Regression (PR) model in the ScikitLearn library. The paper showed that SVM gets better performance than PR banked on evaluating graph performance and metric Mean Square Error (MSE), Mean Absolute Error (MAE) [3{7, 9]. With the same evaluation in paper [1], the results were also depending on the distribution of the data and this evaluation is just counted on the mean of the values that if the data is dense on the prediction, the mean of the values will be closed to the mean of prediction. This calculation usually makes the approximate values instead of exact values.
In this paper, we considered the unformed data with the information in Fig 1 . We calculated the correlation between the attributes of data and applied an accuracy metric to evaluate the exact values. We
: King Mongkut's University of Technology North Bangkok, Thailand. e-mails: [email protected], [email protected], http://kmutnb.ac.th/ unformed data by calculating the correlation shown in Table 1.We defined very strong positive correlation when values are greater than or equal to 0.8, strong positive correlation when values are greater than or equal to 0.6 and smaller than 0.8, weak positive correlation when values are greater than or equal to 0.4, and smaller than 0.6. We omitted no correlation (values are in the interval of -0.4 to 0. smaller than or equal to -0.4 and greater than -0.6), strong negative correlation (Values are greater than -0.8 to values smaller than or equal to -0.6) and very strong negative correlation (Values are smaller or equal to -0.8). We tried models and chose the metric accuracy to calculate the true prediction and the percentage of the prediction. With this metric, we could evaluate exactly the number of predictions and depicted the records related to prediction. PySpark is one of the branches of Hadoop structure becoming strongly and easily in analyzing the data. With the powerful libraries, PySpark supplies the structure for direct and indirect processing, graph environment with ease of use, short time analyzing the big data. PySpark sponsors many sections with many kinds of functions such as Spark SQL, DataFrame, Streaming, MLlib, and Spark Core. PySpark could solve with big data and costs less time to analyze the classification problems. Table 2 shows details of the sections and functions in the PySpark library. The steps for analyzing data could not follow the sections but could form the data before applying the sections and functions (Fig 1). The data will be extracted feature and applied to the model to transform to right form data by choosing basic statistics. After that, we could confirm and make the kinds of problems such as classification, regression, or clustering problems. Finally, we applied evaluation metrics to estimate the models (Equations 1-4).
(2)
is the true positive at class , and is false positive at class . is false negative at class j.
(
(4)
Accuracy = n i=1 T iV n i=1 T iV + m j=1 F jV P recision = n i=1 T iP n i=1 T iP + m j=1 F jP T iP i i F iP F jN Recall = n i=1 T iP n i=1 T iP + m j=1 F jN F 1 ? Score = 2 × P recision × Recall P recision + Recall II. Literature ReviewNowadays, machine learning is becoming an essential part of computer science. PySpark is a strong application for analyzing the data with open-source libraries where we can run R, Python, Java, and Scala. PySpark is free for users and easy to use. PySpark supports two strong libraries with Spark MLlib and Spark ML packages where they can solve big data and analyze it in a very short time [17]. However, the processing for analyzing data could follow as Fig 2 shown. We summarized the algorithms used in the PySpark library shown the detail in Table 2.
In this paper, we got data downloaded on June 10th, 2022 from the website https://ourworldindata.org/covid-deaths and updated every day (Table 2). The data totally consists of 59 attributes and we also chose the attribute with the Where n; m are numbers of classes, TiV is a true value of prediction at class i; FjV is a false value of label at class j.
Data types
The vector is formed by an integer or double or zero-based type. The data can be distributed densely or sparsely.
A kind of local vector using supervised machine learning algorithms with data is labeled. Labels sometimes are 0 and 1 or start from 0, 1,2,. . . The data can be established in dense or sparse distribution. As Fig 1 shown, we need to process the data in the right format by using PySpark libraries. The columns selected will be divided into two parts: One part for features and another for labels. We applied StringIndexer to change to the column labeled and applied OneHotEncoder to established binary vector and after that, we applied VectorAssembler to combine with total cases column to make column features for prediction. We also applied StringIndexer to turn total deaths into a label column for target prediction (see Table 2). Besides metrics accuracy to evaluate the ratio of right targets and total targets, we considered evaluating by Precision, Recall, and F1-score occupied great important units in the medical aspect. Precision is confirmed the rightly positive cases while Recall is to confirm rightly negative cases to decide the right method for curing. F1-score, calculated as the average of Recall and Precision, is applied to confirm how much Recall is more important than Precision. In the medical branch, it is used to decide prior Recall or Precision to choose an appropriate patients' situation.
Compared to deep machine learning, we also analyze the data when trying with deep learning [8{13] such as LSTM, and GRU but get the worse results prediction shown such as the time costs too much time (5,435s/step), accuracy for the first step is 0.138 and the second step is 0.1384. The parameters for solving this data are a total of 202,878,594 parameters and the batch size is 1,318 parameters. PySpark has shown better performance with the best accuracy and least time to evaluate.
In this paper, we tried the models in PySpark and choose the models that could analyze the data. After trying the models in Spark.MLlib and Spark.ML, we got the results in Table 3. The results showed that Naïve Bayes has the best performance in predicting fatalities with an accuracy of 0.813. Following that was the Decision Tree model with an accuracy is 0.621. Table 4 shows some example prediction results with the models.
Models | Accuracy Precision Recall F1-Score | |||
Naïve Bayes | 0.813 | 0.571 | 0.381 0.457 | |
Random Forest 0.139 | 0.632 | 0.003 0.005 | ||
Decision Tree | 0.621 | 0.824 | 0.013 0.026 | |
Label Naïve Bayes Random Forest Decision Trees | ||||
8.0 | 5 | 4 | 3 | |
20.0 | 32 | 3 | 16 | |
23.0 | 16 | 1 | 8 | |
38.0 | 28 | 31 | 12 | |
43.0 | 46 | 3 | 41 | |
V. Conclusion |
Intrusion Detection System using Naive Bayes algorithm. IEEE International WIE Conference on Electrical and Computer Engineering, 2019. WIECON-ECE
COVID-19: Short term prediction model using daily incidence data. Plos One Collection 2021.
Prediction of number of fatalities due to Covid-19 using Machine Learning. IEEE 17th India Council International Conference (INDICON), p. 2020.
Investigating the Statistical Assumptions of Naïve Bayes Classifiers. 55th Annual Conference on Information Sciences and Systems (CISS), p. 2021.
Machine Learning Models for Government to Predict COVID-19 Outbreak. ACM Journal 2020.
Intrusion Detection System using Naive Bayes algorithm. IEEE International WIE Conference on Electrical and Computer Engineering, 2019.
Logistic Regression Analysis to Predict Mortality Risk in COVID 19 Patients from Routine Hematologic Parameters. Ibnosina Journal of Medicine and Biomedical Sciences 2020.
A novel machine learning based model for COVID-19 prediction. International Journal of Avanced Computer Science and Applications 2020.
A Study of Predicting the Sincerity of a Question Asked Using Machine Learning. 5th International Conference on Natural Language Processing and Information Retrieval (NLPIR), p. 2021.