# Introduction
n an era of technological disruption, the demand for software adoption has accelerated. They are a part of our society and play an important role in shaping it. Our modern society is becoming increasingly reliant on complex software systems. Thus, it is critical to build reliable and trustworthy systems in a cost-effective and timely manner. The presence of defective modules in a software drives up development and maintenance expenses, leading to customer dissatisfaction. The need for quality assurance has inevitably remained the biggest challenge in today's software development environment Hence, software bug prediction is an important task to help developers locate bugs more efficiently.
Software bug prediction is an imperative task in Software Development Life cycle (SDLC) as it pertains to the overall success of software. One method in this direction is to use machine learning (ML) methods to predict defects in software. In addition, implementing this method earlier in the SDLC process enhances quality of the product and lowers the cost of software maintenance. Many researchers have applied different theories and methodologies in the field of software bug prediction. Two things are clear from the literature when it comes to defect prediction. Initially, no single prediction approach dominates (Lessmann et al.,2008), and next, the employment of various set of data, data pre-processing, validation systems, and performance statistics makes it challenging to make sense of the multiple prediction outcomes (Myrtveit et al.,2005). There are two common ML model used for prediction based on dataset availability. The first, known as supervised approach, in which a software defect prediction model is built from training set of data and then tested on a testing dataset. Secondly, unsupervised approach, in which the defect prediction model for software is built from scratch using the present testing dataset without training the dataset.
Clustering algorithms have been commonly used to evade the lack of training datasets available being a constraint. Cluster analysis groups things into clusters based on their similarity to create a visual representation of data (Jain and Dubes, 1998). As pointed out by Kaur, 2010, one of the better instances of unsupervised learning is K-means clustering. Clustering is beneficial because it makes it easier to obtain or locate relevant information at a faster rate. Among the different clustering approaches that already exist, the Kmeans methodology is obviously fairly popular. (Gayathri et al., 2015). The preliminary values of the initial centroids, which are generated randomly each time the algorithm is run, have a significant impact on the performance of k-means. K-means frequently fall into local optima that produce poor clustering results. Obtaining a globally optimal clustering result involves a time-consuming, exhaustive approach that tests all partitioning choices. A heuristic approach to the problem is to use an optimization algorithm to search for global optima in each computer iteration.
Our unsupervised approach uses the k-means approach to divide the unlabeled dataset into defective and non-defective non-overlapped clusters for bug prediction. The goal of this research is to verify the hybrids' efficacy as well as to quantify the quality of results produced by each clustering hybrid model. In this study, we have applied the k-means clustering algorithm, an unsupervised algorithm with different NIAs including Genetic algorithm (GA), Bat algorithm (BA), Particle Swarm Optimization (PSO), Coral Reefs Optimization (CRO), Cuckoo Search optimization (CSO) algorithm, Ant colony optimization (ACO), Firefly algorithm (FA) and Grey Wolf Optimizer (GWO) for software bug prediction. The rest of this paper is organized as follows. Section 2 presents a discussion of the related work in software bug prediction. An overview of the methodology, consisting of the algorithms used are presented in Section 3. Section 4 describes the proposed method. Section 5 describes the Dataset and Data Processing method. The evaluation methodology is discussed in section 6. The results and discussion part is discussed in Section 7. Section 8 discusses the practical implications followed by conclusions and future works in section 9.
# II.
# Related Works
K-means clustering is a well-known partitioned clustering algorithm that has been used in a variety of applications. In the literature, several variations of Kmeans have been proposed to improve its performance for the broad clustering problem. Fong et al. (2012) studied the integration of bio-inspired optimization methods into K-means clustering for software bug prediction in order to assess clustering performance. The main optimization algorithms tested include the Firefly algorithm, Cuckoo search algorithm, Bat algorithm, Wolf and Ant Colony Optimization (ACO) algorithms. Results show that the combination of these algorithms acquired improved performance accuracy compared with ordinary k-means, at the same time accelerating the search process and avoid local optima. Zhong et al.,2004 compared the k-means algorithm to natural-gas algorithms. The natural gas algorithm outperformed the k-means algorithm in terms of mean square error values. However, this method necessitates the use of a software expert to determine whether the software is appropriate.
Annisa et al., 2020, came up with an improved version of k-means algorithm for software bug prediction, that locate the initial centroid of the k-means algorithm and determine the number of clusters present. Because it produces better accuracy than the simple K-Means method, this proposed method could be useful for clustering other data types. Seliya and Khoshgoftaar, 2007 proposed K-means for software failure prediction. Their method iteratively labels clusters as fault-prone or not using expert domain knowledge as a restriction.
The k-means algorithm based on quad tree was proposed by Bishnu and Bhattacherjee, 2012 and it was compared to some clustering algorithms. Their proposed algorithm has error rates that are comparable to k-means, Linear Discriminant Analysis and Naive Bayes. Catal et al. 2009 used the x-means clustering algorithm to create faulty and non-faulty clusters based on software metrics. Lines of code, cyclomatic complexity, operand and operator are the metrics. If the metric values are complex than the threshold, the software entity is predicted to be defective, and vice versa. Almayyan, 2021 used dataset from the NASA repository and used three clustering algorithms, Farthest First, X-means and Self-organizing map. This article presents a comparison of software defect prediction algorithms based on Bat, Cuckoo, Grey Wolf Optimizer (GWO), and Particle Swarm Optimization (PSO) in order to evaluate different feature selection algorithms. The Farthest First clustering algorithm was found to be effective in predicting software faultiness, and Bat and Cuckoo were found to be useful in comparison to all other metaheuristic algorithms.
Though several academics have sought to merge K-means clustering with nature-inspired algorithms (NIAs), their efforts have been restricted to almost identical group movements, such as the Firefly, Artificial Bee Colony (ACO), and Particle Swarm Optimization (PSO) algorithms (Jensi and Jiji, 2015). In addition, only a few bio-inspired optimization methods that are integrated with K-means are provided in the previous studies. Only 7 of the 28 NIAs hybridized with K-means (Genetic Algorithm, Particle Swarm Optimization, Bat Algorithm, Artificial Bee Colony, Differential Evolution, Harmony Search, and Symbiotic Organism Search) dedicated their hybridization to solving automatic clustering problems, accounting for 20.6 percent of the total (Ikotun et al., 2021). In general, it can be seen that the rate of publishing on K-means hybridization with specific NIAAs is minimal. More research is needed in this area to see if there are any other ways to improve the performance of the existing hybridization algorithm. This suggests that combining Kmeans with these other NIAs to solve automatic clustering problems should be investigated.
The purpose of this research is to look into the mechanics of incorporating certain NIAs into the Kmeans clustering algorithm. The optimization function adds to the existing best solution by progressively improving it with a new solution from an unknown fragment of the search space. When a new solution is identified to be better than the present one, the searching agents replace the solutions and continue searching until some stopping criteria are fulfilled.
# III.
# Methodology a) K means Clustering Algorithm
The K-means clustering algorithm is a partitioned clustering technique that divides a dataset into k number of clusters using a certain fitness measure. Due to the large amount of data objects in real-world datasets, distributing data items into appropriate clusters to obtain an ideal cluster outcome is computationally expensive and time-consuming (Ikotun et al.2021).
Given a dataset X = {x i }, where i = 1, 2, . . . n of d-dimension data points of size n, X is partitioned into 'k' clusters such that
J(c k ) = ? xieck ||x i -µ k || 2(1)
With the objective function: minimize the sum of the square error over all the k clusters. That is, minimize
J(C) = K ? k=1 ? xieck ||x i -µ k || 2(2)
When assigning N objects to k clusters, the purpose of the clustering algorithm is to limit the number of potential possibilities. This can be expressed numerically as:
S (N, K) = 1/ K! K ? i=0 (-1) K-i ( K t ) t N(3)
b) Nature-inspired algorithms (NIAs) Nature-inspired computation has gained popularity in the previous two decades and has been used in practically every field of research and engineering (Yang et al.2013). NIAs are global optimization strategies for solving difficult real-world issues (Okwu et al. 2020). NIAs have successfully provided suboptimal solutions to automatic clustering problems in a reasonable amount of time (Hruschka et al. 2009). The population is used for the exploration of search space in the nature-inspired metaheuristic, ensuring a higher possibility of finding optimal cluster partitions (Nanda and Panda, 2014). It has been discovered that combining K-means with NIAs for automatic clustering improves the performance of algorithms when dealing with cluster analysis. In most circumstances, the automatic cluster number determination aids in the selection of near-optimal starting cluster centroids for the clustering process rather than the normal random selection (Zhou et al. 2017).
# c) Combination of k-means with Nature-Inspired Algorithms (NIAs)
Clustering using NIAs is now as simple as assigning combinations of centroids to the searching agents, allowing them to heuristically find the best answer. Though the specifics of conducting a heuristic search vary depending on which nature-inspired optimization algorithm technique is used, the initialization stage and the finishing step, where the quality of the discovered solution is evaluated as a stopping condition, are both comparable.
S is defined as the solution space that contains a finite number of x i , where i is the solution's index, in the initialization construct. The search agents represent the solutions x, each of which holds a set of centroids, regardless of the types of bio-inspired optimization methods used. Typically, a large population of searching agents, N, is utilized to collaboratively search for the best feasible cluster configurations (as expressed by the locations of the optimal centroids). K is the number of clusters that must be formed, which is generally a userdefined figure. D is the dimension of the search space, which is the number of attributes a data point possesses.
To find the optimal configuration of centroids we let cen j,v be the centroids at the j th cluster and the v th attribute. To obtain the centroid location, the following formula is used:
cen j,v =? S i =1 wi, j X i,v/ ? S i =1 wi, j X i , Where j =1?K, v=1?.K *D (4)
In our concept, the matrix cenj,v contains all of the cluster centers and is a two-dimensional matrix with K * D characteristics.
F(cen)= ? k j=1 ? S i=1 W i,j ? K*D v=1 (X iv , -cen j,v ) 2 (5)
The calculation method loops K * D times to analyze the values of all the attributes of x in each cluster v to calculate the distance between each x and the centroid.
Cluster centers can be designated by data points. For example, in a two-cluster clustering task, the objective function requires three variables. As a result, there are three dimensions.
Three variables, and hence three-dimensional spaces, are required, and the i th data point may be written as x i = (i, [x i , 1 , x i,2 , x i,3 , x i,4 , x i,5 , x i,6 ]). The clustering strategy can be formulated as follows:
clmat i,j = min k?k {||X i -cen k ||}(6)
Sets of functional parameters must be defined in order to execute the bio-inspired optimization algorithms. Despite the fact that some of their parameters are shared, each set of parameters for the hybrid bio-inspired clustering algorithms is designed independently. The six models investigated are K means with Genetic Algorithm, K means with Bat algorithm, K means with Ant colony algorithm, K means with Cuckoo Search Algorithm, K means with Firefly Algorithm and K means with Coral reefs algorithm. The most significant variations are in how the global optimal exploration is carried out for all these algorithms. The evaluation stage comes right after the exploration construct, and it compares if the new solution is better than the current best one.
# d) Genetic Algorithm
Genetic Algorithm (Ga) are randomized heuristic search algorithms that are based on natural selection and genetic principles (Goldberg, 1989). The genetic operators used in the combination of K-means and GA are selection, distance-based mutation, and the K-means operator. The parameters have been set according to the study of Bouhmala et al. 2015. P (0) is chosen at random as the starting population. Each allele in the population can be given a cluster number from the uniform distribution over the set {1,?. K) at random.
According to the distribution given by, the selection operator selects a chromosome from the preceding population at random as follows:
P(s i ) = F(s i ) / N ? j=1 F(s j )(7)
The possibility of solutions surviving in the future population is ranked in the current population. Each solution in the population must be assigned a figure of merit or a fitness value.
F(s W ) = { g(sW ) 0 ; if g(s W ) ? 0 ; otherwise.(8)
e) Bat Algorithm (BA) Bat echolocation is used in the bat algorithm (BA), which is a heuristic optimization tool (Yang, 2010). The four basic parameters of a BA are pulse frequency, pulse rate, velocity, and a constant. The parameters have been set according to the study (Huang and Ma, 2020).
The frequency, velocity, and position for each bat are initialized. The virtual bats' movement is described by updating their velocity and position using the equations below for each time step t, where T is the iteration limit.
f i =f min + (f min -f max ) ? (9) V i t+1 =v i t + [X i t + X * ]f i (10) X i t+1 = X i t +v i t(11)
A random number is generated when the bat positions are updated; if the random number is greater than the pulse emission rate, a new location is formed around the current best solutions, as shown in the equation below.
X new =x old + EA t (12)
# f) Ant Colony Optimization (ACO)
The ACO heuristic was inspired by investigations of ant foraging behavior in real colonies, which indicated that ants can often figure out the shortest path between food source and nest (Zheng et al. 2003). The parameters have been set according to the study (Tang et al. 2012).
When the ant moves from i to j, the path node at the start can set as A, A= {0, 1,?? n-1}. This reflects the role of pheromones accumulated by ants during exercise during ant migration and reveals the relative relevance of the trajectory. The larger ? is, it indicates the high probability for subsequent ants to choose this path.
The probability of the ant moving from I to j is computed using the following formula:
P ij k (t)= r ij k (t)n ij ? (t) / ?r ij ? (t) n ij ? (t)(13)
# g) Firefly Algorithm (FA)
Firefly algorithm is a very strong technique for solving restricted optimization and NP-hard problems (Apostolopoulos and Vlachos, 2011). The parameters have been set according to the study (Tang et al.2012).
The attractiveness of a firefly I on a firefly j is determined by the degree of the firefly i's brightness and the distance rij between the firefly I and the firefly j, as shown below:
I (r)=I s /r 2(14)
Consider the case when there are n fireflies and the solution for firefly I is xi. The brightness of the firefly I is linked to the objective function f (xi).
I= f(x i )(15)
Each firefly has an attraction value, and the less dazzling (attractive) one is drawn to the brighter one and transferred there. The attractiveness value ? is relative based on the distance between fireflies. Where pheromone is ?, which is a constant that represents weight. The time of iteration is Nc and the initial setting is ?. The predicted heuristic factor is?, which demonstrates the relevance of visibility relative to other factors. It also represents the significance of the heuristic component in the entire path of the ant's movement.
? (r)= ? 0 e -yr2(16
Where V i t and X i t are the velocity and position at time t, V i t+1 and X i t+1 are the velocity and position at time t+1, and is a random number between 0 and 1.
Where ? 0 is the firefly attraction value at r = 0 and ? is the media light absorption coefficient.
Where E is a random number A t represents the average loudness of all bats at time t.
An initial population of n nests is randomly generated at the positions, X= {x 0 1 ,x 0 2 ,?,x 0 n }, to evaluate the objective values to find the current global best g t 0 .
The new position is updated accordingly by performing a L?vy flight:
x i (t+1) = x i (t) + ? ? L?vy (?),(17)
# i) Coral Reefs Optimization Algorithm (CRO)
CRO is another nature-inspired algorithm, based on an artificial simulation of the process of coral reef formation and reproduction (Sanz et al.2014). The CRO algorithm has never been utilized in the realm of software bug detection to our knowledge. Corals reproduce at each iteration step in the CRO algorithm, producing new individuals. The parameters have been set according to the study (Medeiros et al., 2015).
By allocating a coral to each square (i j), the CRO algorithm generates a N x M square grid in which each square (i,j) may represent an alternate solution to a problem (or colony of corals). The formation of coral is the second phase. After three phases, the entire collection of existing corals in the reef is graded according to their level of healthiness (broadcast spawning, brooding, and larvae setting).
# j) Particle Swarm Optimization (PSO)
The behavior of particles in a swarm is the central concept of the PSO. Each particle has its own location in a multidimensional space and communicates with the others. To move about in space, the particles employ social and cognitive information. When the algorithm comes to a halt, the best solution has been discovered (Koohi and Groza, 2014). The parameters have been set according to the study (Rana et al., 2010).
The inertia weight balances the algorithm's local and global search abilities. The proportional contribution of the prior velocity to the current velocity is defined by the inertia weight.
V i k+1 = wv i k + c1 rand (p besti -x i k ) + c2 rand (g best -X i k ) (18) X i k+1 = X i k + v i k+1 (19)
# k) Grey Wolf Optimizer (GWO)
The Grey Wolf Optimizer (GWO) is a simple, population-based, flexible, and derivative-free metaheuristic optimization method that intelligently avoids stagnation in local optima spots of the search space. It simulates the social behaviors of grey wolves in the aspects of their hierarchical leadership and hunting movement (Mirjalili et al., 2013). Grey wolves' leadership and haunting mechanism help to design a new metaheuristic algorithm with three steps: searching prey, encircling prey, and attacking prey.
During the GWO operation, the position of the wolves is continuously updated, with appropriate mathematical formulas (Hou et al., 2022). The parameters have been set according to the study (Wang et al., 2019).
IV.
# Proposed Method
The purpose of clustering is to discover a proper set of centroids using the metaheuristic of the nature-inspired method as a guide. The metaheuristic will always insist on centroids being moved in a progressive manner in each phase, with the goal of finding the best grouping. The ideal group's ultimate result should be that the data points inside each cluster are closest to their centroid. During the search, the centroids move around in the search space, following the swarming pattern of the nature-inspired optimization method, until no further progress is seen. It comes to a halt when there is no other possible relocation that will yield a better result. Along with the success of employing nature-inspired metaheuristic algorithms to solve automatic clustering problems, it has been discovered that combining two or more metaheuristics for the same objective improves clustering performance. The performance of hybrid algorithms, according to Nanda and Panda 2014, is superior to that of separate algorithms in terms of robustness, effectiveness, and accuracy.
V.
# Dataset and Data Processing
The dataset was collected from the online PROMISE repository. AR1, AR3, AR4, AR5, AR6, KC1, KC2, JM1, CM1, PC1 and PC5 were used respectively. With reference to the paper, by Shepperd et al. 2013, data cleaning is mandatory before using any datasets available. Indeed, we noted a huge class imbalance issue with the available datasets (faulty, non-faulty),and all data inconsistencies, missing and null values were removed. Each dataset selected represents a NASA software system that includes various metrics. Each dataset is made up of a number of software modules and attributes. Modules with defects are classified as prone to faults, whereas those without defects are To address the curbs of the K-means clustering approach in generating globally optimum clusters, the suggested method uses the k-means algorithm together with a range of NIAs for software bug prediction. By adding an exploration function to the k-means algorithm, the combination of these strategies may improve the model. The exploration function improves the existing solution by examining regions outside of its immediate vicinity, and if a new, better solution than the current best one is discovered, the search agents will move toward it. The exploring procedure will continue until certain stopping criteria are met. Nature-inspired algorithms are metaheuristic algorithms, which means they have the ability to explore the combinatorial search space heuristically rather than exhaustively. The integration methods are based on representing the search agents as a combination of centroid locations, then the search agents explore the search space for the best solution.
Where ? > 0 denotes the step size, which should be connected to the problem's scales. In most circumstances, we can use? = 1. classified as non-fault prone. For the training purpose, the entire dataset is used except for the last column (output column), only columns consisting of numerical values were considered. VI.
# Evaluation a) Experimental Setup
The main goal of this research is to demonstrate the utility of the k-means algorithm with different NIAs, which we accomplished using Tensorflow to train the model. TensorFlow is an open-source machine learning platform to build and deploy prediction models. Google Colab was also used to run the results, which allowed the code to run with no configuration and free GPU access. Each dataset is performed 10 times in the trials to find the average CPU time and objective function values/best fitness value.
The clustering results of the new hybrid clustering algorithms are compared to the K-means, which serve as a benchmarking reference. The full dataset is used for training, and cluster formation is referred to until perfection is attained using the entire set of data. The ultimate clustering result's quality is determined by each cluster's integrity, which is represented by the objective function's final fitness value.
The hardware configuration used for all experiments in this study is as follows: Corei7-6500U CPU @2.50 GHz 2.60 GHz, Windows 10, 64-bit operating system, x64 based processor, RAM: 8 GB DDR4, and Hard Disk: SSD.
# b) Performance Evaluation Measures
In order to assess the effectiveness of combining the k-means algorithm and optimization algorithms in the prediction of software bugs, the evaluation metrics, accuracy and F-measure have been calculated accordingly as shown in the Equation ( 1):
???????????????? = (????+????) / (????+????+????+????),(20)
Where TP = true positive, TN = true negative, FN = false negative and FP = false positive.
On the other hand, the external metric used to determine the accuracy of the clustering findings, known as the F-measure, is also computed.
The F-measure, which is the average of precision and sensitivity performance, is calculated as follows: F = 2 * P * Sensitivity/ P + Sensitivity, (21) Where P refers to precision and sensitivity is calculated by finding the non-defective modules that were accurately categorized. From the table above, K-means clustering is optimized using the various NIAs. We can see that all of the proposed algorithms perform better than the traditional standalone k-means algorithm. K-means appears to take the shortest computation time in any of the tests, maybe because it stops early in local optima (Table 3). This is evident from the accuracy obtained from the table above. NIAs speed up the process of clustering centroids and illustrate that all partitioning clustering methods can be linked with the natural search process to prevent local optima. Secondly, simple Kmeans were applied to the robust nature of GA, which shows adequate prediction accuracy for all datasets. Even though GA may converge to the global optimum due to mutation, GA faces the issue in terms of computational challenges. The application of k means with the Bat algorithm apparently yields the same accuracy. This hybrid algorithm improves the convergence speed of BA and helps the k means algorithm independent of the initial centers. Next, K means is combined with PSO. The PSO method is used to start the process because of its fast convergence, and then the K-Means algorithm is used to refine the PSO algorithm's outcome to near-optimal solutions. The hybridization of these two methods yields effective results in terms of efficiency and precision. The PSO algorithm can be used to generate good initial cluster centroids for the K-Means.
# VII.
# Results and Discussions
# Practical Implications
Metaheuristics algorithms have shown to be effective optimizers. This research found that each of the hybrid K means based-nature-inspired optimization algorithm models outperformed the standalone K means algorithm in terms of accuracy and F1 score. Following the intrinsic limitations of K-means design and the virtues of Nature-inspired optimization techniques, it seems feasible to integrate them, allowing them to complement and function together. The algorithms' successful integration gives reason to believe that more advanced optimization mining techniques can be developed. This study can be used as a roadmap for researchers who want to incorporate other new emerging NIAs into improved clustering methods in the field of software bug detection.
# IX.
# Conclusion and Future Works
Prediction of defect-prone software modules is an important goal in software engineering. The traditional clustering algorithm usually gets trapped in the problem of local optima. As a result, the natureinspired method provides an alternative technique for solving clustering problems using its searching capabilities. This study's main contribution is combining the clustering algorithm with the different NIAs for software bug detection. To the authors' knowledge, only PSO, Cuckoo, Bat, and GWO (Grey Wolf Optimizer) algorithms were applied with clustering algorithms for software bug detection (Almayyan, 2021). The results are improved significantly when clustering algorithms are combined with bio-inspired optimization methods, apparently for the hybrid model of k means clustering withCoral reefs algorithm, achieving an accuracy of 96%.For future work, this work can be replicated with other related datasets for the analysis of bug prediction in software.
1DatasetModulesDefective modulesSoftware metrics (Attributes)AR1121929AR363829AR41072029AR536829AR61011529KC12109178322KC252210721JM17782167221CM13274237PC17056137PC5171147138
2DatasetsAR1AR3AR4AR5AR6KC1KC2JM1CM1PC1PC5k-Means88.9088.0089.0188.8588.4389.1089.0088.8089.0089.1989.99K-Means +GA90.5090.5891.2891.5590.1190.0090.5490.5391.2590.0090.05K-Means +BAT90.0091.5991.0092.3492.0092.9891.3490.0091.2592.5692.00K-Means +PSO92.5092.6592.8793.0193.0092.9994.1092.6792.8993.1093.58K-Means +Coral Reefs94.0094.5494.5694.8794.0095.9695.6696.8895.0195.0495.54K-Means +Cuckoo94.5094.5894.5894.0094.5695.4595.8895.6795.4494.5694.78K-Means + ACO94.0093.5693.5094.1093.7893.0393.5693.4493.8994.0194.52K-Means +Firefly92.5692.6793.0093.4493.0293.5694.7893.6794.8894.3494.54K-Means + GWO90.0992.4794.6593.2292.0092.6093.0092.5094.5094.1294.13
4DatasetsAR1AR3AR4AR5AR6KC1KC2JM1CM1PC1PC5k-Means0.660.790.820.800.750.810.800.810.820.820.80K-Means+GA0.840.830.830.800.830.840.840.850.820.810.85K-Means +BAT0.830.810.830.860.860.860.850.850.850.870.85K-Means +PSO0.850.850.870.870.860.850.870.850.870.870.87K-Means +Coral Reefs0.860.860.860.850.860.860.870.880.860.870.88K-Means +Cuckoo0.890.850.880.890.860.890.860.890.890.870.88K-Means+ ACO0.840.830.860.850.840.860.850.850.860.850.86K-Means +Firefly0.860.850.830.870.870.850.830.850.860.880.85K-Means+ GWO0.820.820.810.860.840.830.790.850.840.840.85VIII.
Towards Optimized K means Clustering using Nature-Inspired Algorithms for Software Bug Prediction © 2023 Global Journals
## Acknowledgments
This study received no formal support from public, private, or not-for-profit funding organizations.
## ( )
Year 2023 C Furthermore, K means and Coral reefs algorithm are combined. The results for this combined method are quite promising since they show that using the CRO method for a clustering application can produce better results to using hybrid genetic algorithms, which is the most often used clustering optimization technique. To best of our knowledge, CRO has not been used with clustering for software bug detection. The hybrid model of k means with Cuckoo Search algorithm shows significant accuracy, likewise CRO algorithm. Cuckoo search is used to provide a robust initialization, whereas K-means is utilized to construct solutions faster. K means is also combined with Ant Colony Optimization algorithm. The suggested method's learning mechanism is based on the use of a defined parameter termed pheromone, which eliminates undesirable K-means algorithm solutions. The suggested method improves the K-means algorithm by making it less reliant on starting parameters such as randomly picked beginning cluster centers, resulting in a more stable algorithm. K means with firefly also produce near accuracy with CRO and Cuckoo search algorithm. This is because fireflies with high similarity are dispersed, resulting in a more diverse distribution of the entire swarm in search space. K means with GWO has also shown rapid convergence. This improvement is caused by the fact that K-means significantly affects the GWO population and separates it into two clusters. Because GWO often operates as three clusters and has three wolves in the search space, K-means is advantageous for GWO.As a result, it can be concluded that K-means combined with GWO increased GWO's effectiveness to some extent.
High clustering accuracy and efficiency were obtained from the hybrid clustering of Coral reefs and Cuckoo Search Algorithm. Hybrid clustering of Coral reefs algorithm has never been applied in the field of software bug detection and has indeed shown promising results. Hybrid clustering of Coral reefs algorithm locate cluster centroids without causing premature convergence. The findings of the evaluation results add evidence that NIAs can indeed speed up the process and avoid local optima.Because fewer iterations are required to achieve the best cluster outcome, selecting the number of clusters enhances the hybridized clustering method's convergence speed. The computational time for each algorithm is computed as shown in Table 3. Less computational time was noted when K means was integrated with Coral reefs and Cuckoo Search algorithm respectively. For statistical performance, the F1 score has been calculated for all the algorithms as shown in Table 4. Again, the F1 Score shows that K-means with Coral reefs resulted in dependable and significant performance that can be used to predict software defects. When a good validity measure is applied, most metaheuristic algorithms can automatically divide datasets into an appropriate number of clusters, according to Gbaje et al.2019.
*
Some methods for classification and Analysis of Multivariate Observations
JBMacqueen
Proceedings of 5 th Berkeley Symposium on Mathematical Statistics and Probability
5 th Berkeley Symposium on Mathematical Statistics and Probability
University of California Press
1967
*
Xin-she Yang and Suash Deb, Integrating Nature inspired Optimization algorithms to k-means clustering
RuiTang
SimonFong
Aug 2012
University of Macau
*
Improved point center algorithm for k-means clustering to increase software defect prediction
DidirosiyadiRiskiannisa
DwizaRiana
International Journal of Advances in Intelligent Informatics
6
3
November 2020
*
Benchmarking classification models for software defect prediction: A proposed framework and novel findings
SLessmann
BBaesens
CMues
SPietsch
IEEE Transactions on Software
*
Engineering
2008
34
*
Reliability and validity in comparative studies of software prediction models
IMyrtveit
EStensrud
MShepperd
IEEE Transactions on Software Engineering
31
5
2005
*
Algorithms for clustering data
AKJain
RCDubes
1988
Prentice-Hall, Inc
*
A Novel Approach for Clustering Based On Bayesian Network
RGayathri
ACauveri
RKanagapriya
VNivetha
PTamizhselvi
KPKumar
Proceedings of the 2015 International Conference on Advanced Research in Computer Science Engineering & Technology
the 2015 International Conference on Advanced Research in Computer Science Engineering & Technology
ACM
2015. March. 2015
60
*
Unsupervised learning for expert-based software quality estimation
SZhong
TMKhoshgoftaar
NSeliya
10.1109/HASE.2004.1281739
Proceedings of the eighth IEEE international conference on high assurance systems engineering HASE
the eighth IEEE international conference on high assurance systems engineering HASE
2004. 2004
*
Software fault prediction using quad tree-based kk-means clustering algorithm
PSBishnu
VBhattacherjee
10.1109/TKDE.2011.163
IEEE Trans Knowl Data Eng
24
6
2012
*
Software fault prediction of unlabeled program modules
CCatal
USevim
BDiri
2009. 2009
*
A new approach for data clustering using hybrid artificial bee colony algorithm
XiaohuiYunlongzhu
WenpingYan
LiangZou
Wang
Neuro computing
97
2012
*
A hybridized approach to data clustering
Yi-TungKao
Erwiezahara
I-WeiKao
Expert Systems with Applications
34
3
2008
*
Towards Predicting software defects with clustering techniques
WaheedaAlmayyan
International Journal of Artificial Intelligence and Application (IJAIA)
12
1
January 2021
*
MShepperd
QSong
ZSun
CMair
Data quality: Some Eng
2013
39
*
A Public Bug Database of GitHub Projects and Its Application in Bug Prediction
ZTóth
PGyimesi
RFerenc
Computational Science and Its Applications --ICCSA 2016
Cham
Springer International Publishing
2016
*
Towards Enhancement of Performance of K-Means Clustering Using Nature-Inspired Optimization Algorithms
SimonFong
SuashDeb
Xin-SheYang
YanZhuang
10.1155/2014/564829
Computational Intelligence and Metaheuristic Algorithms with Applications
2014
*
Fault Prediction using K-Canberra Means Clustering
DeepinderKaur
ArashdeepKaur
CNC
2010
in Press
*
K-Means-Based Nature-Inspired Metaheuristic Algorithms for Automatic Data Clustering Problems: Recent Advances and Future Directions
MAbiodun
MubarakSIkotun
AbsalomEAlmutari
Ezugwu
10.3390/app112311246
Appl. Sci
2021
11246
*
Metaheuristic Optimization: Nature-Inspired Algorithms Swarm and Computational Intelligence
MOOkwu
LKTartibu
Theory and Applications
927
2020
Springer Nature: Berlin/Heidelberg
Germany
*
A Survey of Evolutionary Algorithms for Clustering
EHruschka
RJ G BCampello
AAFreitas
ADe Carvalho
IEEE Trans. Syst. Man Cybern. Part C Appl. Rev
39
2009
*
A survey on nature inspired metaheuristic algorithms for partitional clustering
SJNanda
GPanda
Swarm Evol. Comput
16
2014
*
An Automatic K-Means Clustering Algorithm of GPS Data Combining a Novel Niche Genetic Algorithm with Noise and Density
XZhou
JGu
SShen
HMa
FMiao
HZhang
HGong
ISPRS Int. J. Geo-Inf
6
392
2017
*
Local Search Based Hybrid Particle Swarm Optimization for Multiobjective Optimization
AAMousa
MAEl-Shorbagy
AbdEl-Wahed
WF
Swarm and Evolutionary Computation
3
2012
*
Enhanced Genetic Algorithm with K-Means for the Clustering Problem
NBouhmala
AViken
JBLønnum
International Journal of Modeling and Optimization
5
2
April 2015
*
A new metaheuristic bat-inspired algorithm
DEGoldberg
; X.-SYang
Genetic Algorithms in Search, Optimization, and Machine Learning
New York
Addison-Wesley
1989. 2010
284
*
Bat Algorithm Based on an Integration Strategy and Gaussian Distribution
JianqiangHuang
YanMa
10.1155/2020/9495281
2020
*
The application of ant colony system to image texture classification
HZheng
ZZheng
YXiang
Proceedings of the 2nd International Conference on Machine Learning and Cybernetics
the 2nd International Conference on Machine Learning and CyberneticsXi'an, China
2003
3
textute read texture
*
Integrating nature-inspired optimization algorithms to K-means clustering
RTang
SFong
XYang
SDeb
10.1109/ICDIM.2012.6360145
Seventh International Conference on Digital Information Management (ICDIM 2012
2012
*
Application of the Firefly Algorithm for Solving the Economic Emissions Load Dispatch Problem
TApostolopoulos
AVlachos
10.1155/2011/523806
International journal of Combinatorics
2011
*
Cuckoo search via Lévy flights
X.-SYang
SDeb
Proceedings of the World Congress on Nature & Biologically Inspired Computing (NABIC '09)
the World Congress on Nature & Biologically Inspired Computing (NABIC '09)Coimbatore, india
December 2009
*
Applying the Coral Reefs Optimization Algorithm to Clustering Problems
GInacio
JoaoCMedeiros
AnneM PXavier-Junior
Canuto
10.1109/IJCNN.2015.7280845
Conference Paper
July 2015
*
The Coral Reefs Optimization Algorithm: A Novel Metaheuristic for Efficiently Solving Optimization Problems
SSalcedo-Sanz
JDel Ser
SGil-Lpez
ILanda-Torres
JAPortilla-Figueras
The Scientific World Journal
2014
2014
Hindawi Publishing Corporation
*
Optimizing Particle Swarm Optimization algorithm
IKoohi
VZGroza
10.1109/CCECE.2014.6901057
2014 IEEE 27th Canadian Conference on Electrical and Computer Engineering (CCECE)
2014
*
HYBRID DATA CLUSTERING APPROACH USING K-MEANS AND FLOWER POLLINATION ALGORITHM, Advanced Computational Intelligence
RJensi
GWiselinjiji
An International Journal (ACII)
2
2
April 2015
*
Automatic Data Clustering Using Hybrid Firefly Particle Swarm Optimization Algorithm
MBAgbaje
AEEzugwu
REls
IEEE Access
7
2019
*
XSYang
Swarm Intelligence and Bio-Inspired Computation: Theory and Applications
2013
Elsevier Science Publishers B
V. Amsterdam, The Netherlands
*
Grey wolf optimizer
SMirjalili
SMMirjalili
ALewis
10.1016/j.advengsoft.2013.12.007
Adv. Eng. Softw
69
2014
*
Improved Grey Wolf Optimization Algorithm and Application
YHou
HGao
ZWang
CDu
10.3390/s22103810
Sensors
2022
3810
*
An Improved Grey Wolf Optimizer Based on Differential Evolution and Elimination Mechanism
Jie-ShengWang
Shu-XiaLi
10.1038/s41598-019-43546-3
SciRep
9
7181
2019
*
A hybrid sequential approach for data clustering using K-Means and particle swarm optimization algorithm
SandeepRana
SanjayJasola
RajeshKumar
International Journal of Engineering, Science and Technology
2
6
2010