cross validationtrain/test splitmodel_selectionrandomly partition the data into training and test sets, by default, 25 percent of the data is assigned to the test set
how far it was from the decision boundary The most common metrics are accuracy, precision, recall, F1 measure, true negatives, false positives and false negatives 1confusion matrix Confusion matrix true positives, true negatives, false positives false negatives // confusion_matrix=confusion_matrix(y_test, y_pred) 2accuracy: measures a fraction of the classifier's predictions that are correct. // accuracy_score(y_true,y_pred) LogisticRegression.score() accuracy 3precision: cancer // classifier=LogisticRegression() // classifier.fit(X_train,y_train) // precisions= cross_val_score(classifier, X_train,y_train,cv=5,scoring='precision') 4recall: cancer // recalls= cross_val_score(classifier,X_train,y_train,cv=5,scoring='recall') 5precisionrecalltrade-offF1scoreF1score // fls=cross_val_score(classifier, X_train, y_train, cv=5,scoring='f1') 6ROCAUC ROCfalse positive rate(FPR),true positive rate(TPR) AUC=ROC // classifier=LogisticRegression() // classifier.fit(X_train, y_train) // predictions = classifier.predict_proba(X_test) // false_positive_rate, recall, thresholds = roc_curve(y_test, predictions[:,1]) // roc_auc=auc(false_positive_rate, recall)
Principal components analysisPCA
array([[1.369e+01, 3.260e+00, 2.540e+00, 2.000e+01, 1.070e+02, 1.830e+00, 5.600e-01, 5.000e-01, 8.000e-01, 5.880e+00, 9.600e-01, 1.820e+00, 6.800e+02], [1.269e+01, 1.530e+00, 2.260e+00, 2.070e+01, 8.000e+01, 1.380e+00, 1.460e+00, 5.800e-01, 1.620e+00, 3.050e+00, 9.600e-01, 2.060e+00, 4.950e+02], [1.162e+01, 1.990e+00, 2.280e+00, 1.800e+01, 9.800e+01, 3.020e+00, 2.260e+00, 1.700e-01, 1.350e+00, 3.250e+00, 1.160e+00, 2.960e+00, 3.450e+02]])
array([[ 0.87668336, 0.79842885, 0.64412971, 0.12974277, 0.48853231, -0.70326216, -1.42846826, 1.0724566 , -1.36820277, 0.35193216, 0.0290166 , -1.06412236, -0.2059076 ], [-0.36659076, -0.7581304 , -0.39779858, 0.33380024, -1.41302392, -1.44153145, -0.5029981 , 1.70109989, 0.02366802, -0.84114577, 0.0290166 , -0.73083231, -0.81704676], [-1.69689407, -0.34424759, -0.32337513, -0.45327855, -0.14531976, 1.24904997, 0.31964204, -1.52069698, -0.4346309 , -0.75682931, 0.90197362, 0.51900537, -1.31256499]])
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=0, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
The Logistic regression model is a supervised learning model which is used to forecast the possibility of a target variable. The dependent variable would have two classes, or we can say that it is binary coded as either 1 or 0, where 1 stands for the Yes and 0 stands for No.
It is one of the simplest algorithms in machine learning. It predicts P(Y=1) as a function of X. It can be used for various classification problems such as Diabetic detection, Cancer detection, and Spam detection.
Logistic regression with binary target variables is termed as binary logistic regressions. The target variables can be categorized into two or more categories, which can be predicted. The logistic regression can be further classified into the following categories:
It is one of the simpler logistic regression models in which the dependent variables are in two forms; either 1 or 0. It models a relationship between multiple predictor/independent variables and a binary dependent variable in order to discover the finest suitable model. It calculates the probability of an occurring event by the best-fitted data to the logit function. In this the linear function is used to feed as input to the other function, which is mathematically given as;
We will see how the logistic regression manages to separate some categories and predict the outcome. For this, we will use a database which contains the information about the user in Social Network, such as User ID, Age, Gender, and Estimated Salary. The social_network has many clients who can put ads on a social network. One of the employees from Car Company has launched an SUV car on the ridiculously low price.
We are trying to see which users on the social network are going to buy the SUV on the basis of age & estimated salary variable. So, our matrix of the feature will be Age & Estimated Salary. We are going to find the correlation between them and also if they will purchase or not.
We will now split the dataset into a training set and the test set. As we have 400 observations, so a good test size would be 300 observations in the training set and the leftover 100 observations in the test set. And then we will apply feature scaling, as we want the accurate results to predict which users are actually going to buy the SUVs.
Now that our data is well pre-processed, we are ready to build our Logistic Regression model. We will fit the Logistic regression to the training set. For this, we will first import the Linear model library because the logistic regression is the linear classifier. Since we are working here in 2D, our two categories of users will be separated by a straight line.
A new variable classifier will be created, which is a Logistic Regression object, and to create it a LogisticRegression class would be called. We will only include the random_state parameter to have the same results. And then we will take the classifier object and fit it to the training set using the fit() method, so that the classifier can learn the correlation between the X_train and the Y_train.
After learning the correlations, the classifier will now be able to predict the new observations. To test its predictive power, we will use the test set. A new variable y_pred will be introduced as it would going to be the vector of predictions. We will use predict() method of logistic regression class, and in that, we will pass the X_test argument.
Now we will evaluate if our logistic regression model understood the correlations correctly in a training set to see how it will make the predictions on a new set or a test set. We will make a confusion matrix which will contain the correct predictions as well as the incorrect predictions made by our model.
So, for that, we will import a function from sklearn.metrics library. A new variable cm is then created, and we will pass some parameters such as; Y_test which is a vector of real values telling yes/no if the user really bought the car, Y_pred which is the vector of prediction,
Next, we will have a graphic visualization of our result in which we will clearly see a decision boundary of the classifier and the decision regions. We are going to make a graph so that we can clearly see the regions where logistic regression model predicts Yes in a case when the user is going to purchase the SUV and No when the user will not purchase the product.
To visualize the training set results, we will first import the ListedColormap class to colorize all the datapoints. Then we will create some local variables X_set and y_set to replace the X_train and Y_train. The command np.meshgrid will help us to create a grid with all the pixel points. We have taken the minimum age value to be -1, as we do not want out points to get squeezed and maximum value equals to 1, to get the range of those pixels we want to include in the frame and same we have done for the salary. We have taken the resolution equals to 0.01. We will then use the contour() to make contour between two prediction regions. After that we will use predict() of Logistic Regression classifier to predict which of the pixels points belong to 0 and 1. Then if the pixel point belong to o, it will be colourized as red or if it belong to 1, it will be colourized as green.
From the graph given above, we can see some red points and some green points. All these points are the observation points from the training set i.e. these were all the users of Social_Network which were selected to go to the training set. And each of these users are characterized by their age on X-axis and estimated salary on Y-axis.
The red points are the training set observations for which the dependent variable purchased is zero means the users who did not buy SUV, and for the green points the dependent variable purchased is equal to one are those users who actually bought SUV.
So, the goal is here to classify the right users into the right category which means we are trying to make a classifier which will successfully segregate right users into the right category and are represented by the prediction region. By prediction region, we meant the red region and the green region. For each user in the red region, the classifier predicts the users who dint buy the SUV, and for each user in the green region, it predicts the user who actually bought the SUV, such that the both these regions are separated by a straight line which is called as prediction boundary.
Here the prediction boundary is a straight line, and it means that our logistic regression classifier is a linear classifier. Since our logistic regression classifier is a linear classifier, so our prediction boundary will be the straight line and just a random one. Similarly, if we were in 3Dimension, then the prediction boundary would have been a straight plane separating two spaces. As it is a training set, our classifier successfully learned how to make the predictions based on this information.
From the above output image, it can be seen that the prediction made by the classifier produces a good result and predicts really well as all the red points are in the red region, but only a few green points are there in the red region which is acceptable not a big issue. This is due to the 11 incorrect predictions which we saw in the confusion matrix and can be counted from here too by calculation the red and green points present in the alternate regions. It can be seen that in the red region, red points indicate the people who did not buy the SUV and in the green region the people who bought the SUV.
It is not required that you have to build the classifier from scratch. Building classifiers is complex and requires knowledge of several areas such as Statistics, probability theories, optimization techniques, and so on. There are several pre-built libraries available in the market which have a fully-tested and very efficient implementation of these classifiers. We will use one such pre-built model from the sklearn.
Once the classifier is created, you will feed your training data into the classifier so that it can tune its internal parameters and be ready for the predictions on your future data. To tune the classifier, we run the following statement
Using the sampled data, creating the ANN classification model. Please note that the output layer has one neuron here because this is a binary classification problem. If there were multiple classes then we will have to choose those many neurons, like for 5 classes, the output layer will have 5 neurons, each giving the probability of that class, whichever class has the highest probability, becomes the final answer.
How many neurons should you choose? How many hidden layers should you choose? This is something which varies from data to data, you need to check the testing accuracy and decide which combination is working best. This is why tuning ANN is a difficult task, because there are so many parameters and configurations which can be changed.
There is no thumb rule which can help you to decide the number of layers/number of neurons etc. in the first look at data. You need to try different parameters and choose the combination which produces the highest accuracy.
Just keep in mind, that, the bigger the network, the more computationally intensive it is, hence it will take more time to run. So always to find the best accuracy with the minimum number of layers/neurons.
Even when you use the same hyperparameters, the result will be slightly different for each run of ANN. This happens because the initial step for ANN is the random initialization of weights. So every time you run the code, there are different values that get assigned to each neuron as weights and bias, hence the final outcome also differs slightly.
Deep ANNs work great when you have a good amount of data available for learning. For small datasets with less than 50K records, I will recommend using the supervised ML models like Random Forests, Adaboosts, XGBoosts, etc.