# Introduction

ow a day's statistical analysis in any data is performed just to analyze the data little bit more by using mathematical terms. But only resolving a data is not sufficient when it comes to analysis that too by using statistics. So at this point, predictive audit comes which is nothing but a part of inferential statistics.

Here we try to infer any outcome based on analyzing patterns from previous data to predict for the next dataset when it comes to prediction first buzzword came, i.e., machine learning. So machine learning combine's statistical analysis and computer science for the prediction purpose. Machine learning also introduced to self-learning process from particular data. This learning reduces the gap between computer and statistics. Alarge amount of data prediction can be possible by human interaction as a human brain can analyze the situation with various aspects. Here the partition of algorithms occur, i.e., Supervised (used for labeled data) and unsupervised (data with no tag for learning) algorithm. As the name itself says that machine will learn, but the question arises how that is by using data. In general, by performing mistakes, we learn anything so in Machine learning these mistakes are the data which will be given to the machine to learn. But only learning is not sufficient for a model as again we need to test whatever that machine learned is it accurate or not. Here accuracy testing is required which we are going to measure by creating confusion matrix.

Before building any model in machine learning first, we need to collect the data then few preprocessing is required. Feature extraction is essential to know which features are vital in our model building. After getting the features we can build our model by using different algorithms, depending on our problem statement. Once the model is built, now we need to check its accuracy. Here we will know all the process carried out in model building. Different algorithms used like SVM, K-means, Decision tree, Random Forest, Linear and Logistic regression, from statistics standard deviation, variance analysis, Mean usability, displacement calculation and so on. All the concepts will execute by Python language and code will implement by using Jupyter Notebook.


# a) The Need of Classification and Regression

Both classification and regression are frequently used in Data mining techniques. Regression comes into eye view when we need to predict dependant (Rely upon other attributes) variable which has relation with other data. Example-In our given Titanic data the number of survived passenger is somehow dependent upon which class the passenger is traveling as well as which cabin they were sitting. So for predicting which person survived is relative upon all these attributes so here we will use regression technique to predict.

As the name itself defines Classification is all about the categorization of data based on condition.

Support Vector Machine algorithm can give high accuracy when the data set is small and as well as less missing values in the given dataset.

Pandas: Highly used library for data analysis. Easy to understand. Open source as well as easy to use in data manipulation. Numpy: Used for scientific computing with python. Matplotlib: It is a mathematical extension from Numpy (Library for mathematical calculation) as well as primarily used for plotting graphs.


# II.


# Method

Linear and logistic regression [3] both used for prediction purpose. But what's the difference is much more important to know. These are the following attributes to perceive the difference between these two regression algorithms. Outcome after regression: In linear regression, the result we got is continuous whereas logistic regression has limited number of possible values. Dependent variable: Logistic regression used for the instance of true/false, yes/no, 0/1 which are categorical in nature but linear regression used in case of a continuous variable like a number, weight, height, etc. [4] Fig. 1


# : Linear and logistic regression


# Equation:

Linear regression gives a linear equation in the form of Y = aX + B, means degree 1 equation But, logistic regression gives curved association which is in the form of Y = e^X/1 + e^-X


# Minimization of error:

Linear regression (LR) uses ordinary least squares method which minimizes the error and, Logistic Regression [5] use the least square method which reduces the error quadratic-ally. 


# Support Vectors

These are the vectors (magnitude and direction) which take support for classification purpose near to the hyper plane. [2] Hyper-plane: Generally plane forms in 2 dimensions but more than 2D it is called the hyper-plane. Though support vectors drawn in more than two extent that's why it splits data through hyper-plane [2]. IV.


# Code and Explanation

Step 1-Irrespective of any regression or classification algorithm initially need to import libraries like Pandas, Numpy, Matplotlib, Seaborn and from Scikit-learnlinear, logistic regression and SVM module.

Step 2 -Loading data in CSV file format as the data has been taken from Kaggle Titanic competition. Where train and test data set were grasped for regression.

[10]

Step 3-Select required columns in X (mostly independent variable) and in Y take dependant column as per here number of passengers survived is dependant that's why clasped in Y.

Step 4-Data cleaning and fill null values to prepare data.

Step 5-For knowing which column is influenced (value related to other column in data) more on the output column, we need to plot graphs by using regression type. [9] Step 6 -Split the data set into train and test by using Scikit-learn(free software for Machine learning libraries for Python programming).

Step 7-Fill all the null values using Mean or Dummy Values.

Step 8-Finally call regression function whether it is linear, logistic or SVM, KNN, Decision tree. [7].

Step 9-Calculate accuracy of all the algorithms and print it.

Step 10-By importing confusion matrix calculate precision and Recall to Plot the graph.   


# Result Analysis

By using the above code, we have already calculated the accuracy of each algorithm. Now by using confusion matrix, we will reckon how many numbers are correctly. Where, TP = Total positive prediction, FP = False positive and FN = False negative.

As per our result, we got Precision as 0.812101910828 and Recall as 0.745614035088. So our models have predicted 81% accurately. From the results, we got both random forest, and decision tree is giving high accuracy.  


# Conclusion

Here we have studied the basic about machine learning, linear regression, logistic regression, SVM, KNN, Decision tree and Random forest tree algorithm. We have executed the code by using python language and got the output successfully by using Confusion matrix, Precision-recall curve. At the end, we have calculated Random forest, and decision tree model are giving a higher accuracy of 92.82 % of data by using modules from scikit learn. As the objective was for knowing all these five algorithms and code execution which is computed with accuracy. We have also performed confusion matrix, for result analysis and got the result by getting the Precision and Recall value.
2![Fig.2: Linear regression with nearest data[3] III.](image-2.png "Fig. 2 :")
3![Fig. 3: SVM In the above example we saw the set of blue and red dots separated, but in the next picture, the splitting is done via hyper-plane to segregate data set in two different clusters. Way to find right hyper-plane: Nearest data point and hyper-plane distance are known as margin. So when the margin is less the chance of correct segregation is more. [5] Decision Trees It is a decision sequence which designed in such a tree-like structure. It includes Yes or No type of answers. In our given data set the Passenger either survive or will die. Random Forest: Tree will be the combination of the Decision tree.](image-3.png "Fig. 3 :Accuracy")
4![Fig. 4: Gender wise survival representation #Embarked seems to be correlated with survival, sns.barplot(x='Pclass', y='Survived', data=train_df) for dataset in data: # extract titles dataset['Title'] = dataset.Name.str.extract(' ([A-Zaz]+)\.', expand=False) dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss') # convert titles into numbers dataset['Title'] = dataset['Title'].map(titles) # filling NaN with 0, to get safe dataset['Title'] = dataset['Title'].fillna(0) # Let's take a last look at the training set, before we start training the models. train_df.head(5) b) Building Machine Learning Models X_train = train_df.drop("Survived", axis=1) Y_train = train_df["Survived"] X_test =test_df.drop("PassengerId", axis=1).copy() # Random Forest random_forest = RandomForestClassifier(n_estimators=100) random_forest.fit(X_train, Y_train) Y_prediction = random_forest.predict(X_test) random_forest.score(X_train, Y_train)](image-4.png "Fig. 4 :")
56![Fig. 5: Precision and Recall graph defplot_precision_vs_recall(precision, recall): plt.ylabel("recall", fontsize=19) plot_precision_vs_recall(precision, recall plt.show()](image-5.png "Fig. 5 :Fig. 6 :")
7![Fig. 7: Confusion Matrix Example Predicted i.e. Precision will be = TP /(TP + FP) and Recall will be TP /(TP + FN).Where, TP = Total positive prediction, FP = False positive and FN = False negative.As per our result, we got Precision as 0.812101910828 and Recall as 0.745614035088. So our models have predicted 81% accurately. From the results, we got both random forest, and decision tree is giving high accuracy.](image-6.png "Fig. 7 :")
8![Fig. 8: Algorithm and Percentage of accuracy](image-7.png "Fig. 8 :")
			© 2018 Global Journals
			© 2018 Global JournalsAccuracy Analysis of Continuance by using Classification and Regression Algorithms in Python
		
		
* 
	
		The Tragedy of Titanic: A Logistic Regression Analysis. Dina Ahmed Mohamed Ghandour1 and
		
			May Alawi Mohamed Abdalla2
		
	
* 
	
		A Comparative Analysis on Linear Regression and Support Vector Regression Kavitha S Assistant Professor Computer Science and Engineering Bannari Amman Institute of Technolgy Sathyamangalamkvth
		
		
* 
	
		An Introduction to Logistic Regression: From Basic Concepts to Interpretation with Particular Attention to Nursing Domain Park
		
			Seoul, Korea
		
		
			Hyeoun-Ae College of Nursing and System Biomedical Informatics National Core Research Center, Seoul National University
		
	
* 
	
		Logistic regression in the medical literature: Standards for use and reporting, with particular attention to one medical domain
		
			SCBagley
		
		
			HWhite
		
		
			BA&golomb
		
		
			VBewick
		
		
			LCheek
		
		
			JBall
		
	
		Journal of Clinical Epidemiology
		
			54
			10
			
			2001. 2004
		
	
* 
	
		Receiver operating characteristic curves
		10.1186/cc3000
		
	
		Statistics review
		
			13
			6
			508512
		
	
	Critical Care


* 
	
		Logistic regression for research in higher education
		
			JTAustin
		
		
			RAYaffee
		
		
			DEHinkle
		
		
			SCBagley
		
		
			HWhite
		
		
			BAGolomb
		
		379-410. 2
	
	
		Handbook of Theory and Research
				
			1992. 2001
			8
		
	
* 
	
		Logistic regression in the medical literature: Standards for use and reporting, with particular attention to one medical domain
	
	
		Journal of Clinical Epidemiology
		
			54
			10
			
		
* 
	
		Prediction of Survivors in Titanic Dataset: A Comparative Study using Machine Learning Algorithms Tryambak Chatterjee* Department of Management Studies
		
			NIT Trichy, Tiruchirappalli, Tamilnadu, India
		
	
* 
	
		Flight Quest Challenge
		
	
		Kaggle.com
		
			2
			Jun-2017
		
		
			GE
		
	
* 
	
		Accessed: 2-
		
		Titanic: Machine Learning from Disaster
				
			Jun-2017. Jun-2017
		
	
* 
	
		
			Kaggle
		
	
		Data Science Community
		
	
* 
	
		
		Available
				
			Accessed:2-Jun-2017