naive bayes classifier from scratch in python

build your first text classifier in python with logistic regression | kavita ganesan

Text classification is the automatic process of predicting one or more categories given a piece of text. For example, predicting if an email is legit or spammy. Thanks to Gmails spam classifier, I dont see or hear from spammy emails!

Other than spam detection, text classifiers can be used to determine sentiment in social media texts, predict categories of news articles, parse and segment unstructured documents, flag the highly talked about fake news articles and more.

Text classifiers work by leveraging signals in the text to guess the most appropriate classification. For example, in a sentiment classification task, occurrences of certain words or phrases, like slow,problem,wouldn't and not can bias the classifier to predict negative sentiment.

The nice thing about text classification is that you have a range of options in terms of what approaches you could use. From unsupervised rules-based approaches to more supervised approaches such as Naive Bayes, SVMs, CRFs and Deep Learning.

In this article, we are going to learn how to build and evaluate a text classifier using logistic regression on a news categorization problem. The problem while not extremely hard, is not as straightforward as making a binary prediction (yes/no, spam/ham).

Heres the full source code with accompanying dataset for this tutorial. Note that this is a fairly long tutorial and I would suggest that you break it down to several sessions so that you completely grasp the concepts.

The dataset that we will be using for this tutorial is from Kaggle. It contains news articles from Huffington Post (HuffPost) from 2014-2018 as seen below. This data set has about ~125,000 articles and 31 different categories.

Without the actual content of the article itself, the data that we have for learning is actually pretty sparse a problem you may encounter in the real world. But lets see if we can still learn from it reasonably well. We will not use the author field because we want to test it on articles from a different news organization, specifically from CNN.

In this tutorial, we will use the Logistic Regression algorithm to implement the classifier. In my experience, I have found Logistic Regression to be very effective on text data and the underlying algorithm is also fairly easy to understand. More importantly, in the NLP world, its generally accepted that Logistic Regression is a great starter algorithm for text related classification.

Features are attributes (signals) that help the model learn. This can be specific words from the text itself (e.g. all words, top occurring terms, adjectives) or additional information inferred based on the original text (e.g. parts-of-speech, contains specific phrase patterns, syntactic tree structure).

For this task, we have text fields that are fairly sparse to learn from. Therefore, we will try to use all words from several of the text fields. This includes the description, headline and tokens from the url. The more advanced feature representation is something you should try as an exercise.

Not all words are equally important to a particular document / category. For example, while words like murder, knife and abduction are important to a crime related document, words like news and reporter may not be quite as important.

In this tutorial, we will be experimenting with 3 feature weighting approaches. The most basic form of feature weighting, is binary weighting. Where if a word is present in a document, the weight is 1 and if the word is absent the weight is 0.

There are of course many other methods for feature weighting. The approaches that we will experiment with in this tutorial are the most common ones and are usually sufficient for most classification tasks.

One of the most important components in developing a supervised text classifier is the ability to evaluate it. We need to understand if the model has learned sufficiently based on the examples that it saw in order to make correct predictions.

For this particular task, even though the HuffPost dataset lists one category per article, in reality, an article can actually belong to more than one category. For example, the article in Figure 4 could belong to COLLEGE (the primary category) or EDUCATION.

If the classifier predicts EDUCATION as its first guess instead of COLLEGE, that doesnt mean its wrong. As this is bound to happen to various other categories, instead of looking at the first predicted category, we will look at the top 3 categories predicted to compute (a) accuracy and (b) mean reciprocal rank (MRR).

Accuracy evaluates the fraction of correct predictions. In our case, it is the number of times the PRIMARY category appeared in the top 3 predicted categories divided by the total number of categorization tasks.

where Q here refers to all the classification tasks in our test set and rank_{i} is the position of the correctly predicted category. The higher the rank of the correctly predicted category, the higher the MRR.

Since we are using the top 3 predictions, MRR will give us a sense of where the PRIMARY category is at in the ranks. If the rank of the PRIMARY category is on average 2, then the MRR would be ~0.5 and at 3, it would be ~0.3. We want to get the PRIMARY category higher up in the ranks.

Next, we will be creating different variations of the text we will use to train the classifier. This is to see how adding more content to each field, helps with the classification task. Notice that we create a field using only the description, description + headline, and description + headline + url (tokenized).

Earlier, we talked about feature representation and different feature weighting schemes. In `extract_features()` from above, is where we extract the different types of features based on the weighting schemes.

First, note that cv.fit_transform(...) from the above code, snippet creates a vocabulary based on the training set. Next, `cv.transform()` takes in any text (test or unseen texts) and transforms it according to the vocabulary of the training set, limiting the words by the specified count restrictions (`min_df`, `max_df`) and applying necessary stop words if specified. It returns a term-document matrix where each column in the matrix represents a word in the vocabulary while each row represents the documents in the dataset. The values could either be binary or counts. The same concept also applies to tfidf_vectorizer.fit_transform(...) and `tfidf_vectorizer.transform()`.

The code below shows how we start the training process. When you instantiate the LogisticRegression module, you can vary the `solver`, the `penalty`, the `C` value and also specify how it should handle the multi-class classification problem (one-vs-all or multinomial). By default, a one-vs-all approach is used and thats what were using below:

In a one-vs-all approach that we are using above, a binary classification problem is fit for each of our 31 labels. Since we are selecting the top 3 categories predicted by the classifier (see below), we will leverage the estimated probabilities instead of the binary predictions. Behind the scenes, we are actually collecting the probability of each news category being positive.

You can see that the accuracy is 0.59 and MRR is 0.48. This means that only about 59% of the PRIMARY categories are appearing within the top 3 predicted labels. The MRR also tells us that the rank of the PRIMARY category is between positions 2 and 3. Lets see if we can do better. Lets try a different feature weighting scheme.

This second model uses tf-idf weighting instead of binary weighting using the same description field. You can see that the accuracy is 0.63 and MRR is 0.51 a slight improvement. This is a good indicator that the tf-idf weighting works better than binary weighting for this particular task.

How else can we improve our classifier? Remember, we are only using the description field and it is fairly sparse. What if we used the description, headline and tokenized URL, would this help? Lets try it.

Now, look! As you can see in Figure 8, the accuracy is 0.87 and MRR is 0.75, a significant jump. Now we have about 87% of the primary categories appearing within the top 3 predicted categories. In addition, more of the PRIMARY categories are appearing at position 1. This is good news!

Overall, not bad, huh? The predicted categories make a lot of sense. Note that in the above predictions, we used the headline text. To further improve the predictions, we can enrich the text with the URL tokens and description.

Once we have fully developed the model, we want to use it later on unseen documents. Doing this is actually straightforward with sklearn. First, we have to save the transformer to later encode/vectorize any unseen document. Next, we also need to save the trained model so that it can make predictions using the weight vectors. Heres how you do it:

Heres the full source code with the accompanying dataset for this tutorial. I hope this article has given you the confidence in implementing your very own high-accuracy text classifier.Keep in mind that text classification is an art as much as it is a science. Your creativity when it comes to text preprocessing, evaluation and feature representation will determine the success of your classifier.A one-size-fits-all approach is rare. What works for this news categorization task, may very well be inadequate for something like bug detection in source code.

Right now, we are at 87% accuracy. How can we improve the accuracy further? What else would you try? Leave a comment below with what you tried, and how well it worked. Aim for a 90-95% accuracy and let us all know what worked!

knn r, k-nearest neighbor classifier implementation in r programming from scratch

In the introduction to k-nearest-neighbor algorithm article, we have learned the core concepts of the knn algorithm. Also learned about the applications using knn algorithm to solve the real world problems.

In this post, we will be implementing K-Nearest Neighbor Algorithm on a dummy data set using R programming language from scratch. Along the way, we will implement a prediction model to predict classes for data.

Implementation of K-Nearest Neighbor algorithm in R language from scratch will help us to apply the concepts ofKnnalgorithm. As we are going implement each every component of theknnalgorithm and the other components like how to use the datasets and find the accuracy of our implemented model etc.

Our objective is to program a Knn classifier in R programming language without using any machine learning package. We have two classes g(good) or b(bad), it is the response of radar from the ionosphere. The classifier could be capable of predicting g or b class for new records from training data.

This dummy dataset consists of 6 attributes and 30 records. Out Of these 5 attributes are continuous variables with values ranging from -1 to +1 i.e, [-1,+1]. Last(6th) attribute is a categorical variable with values as g(good) or b(bad) according to the definition summarized above. This is a binary classification task.

For any programmatic implementation on thedataset, we first need to import it.Usingread.csv(), we are importing dataset into knn.df dataframe.Sincedataset has no header so, we are using header= FALSE. sep parameter is to define the literal which separatesvalues our document. knn.df is a dataframe. A dataframeis a table or 2-D array, in whicheach column contains measurements on one variable, and each row contains one record.

Before Train & Test data split, we need to distribute it randomly. In R, we can use sample() method. It helps to randomize all the records of dataframe. Please use set.seed(2), seed() method is used to produce reproducible results. In the next line we are passing sample() method inside dataframe. This is to randomize all 30 records of knn.df. Now, we are ready for a split. For dividing train, test data we are splitting them in 70:30 ratio i.e., 70% of data will be considered as train set & 30% as thetest set.

Euclidean Distance euclideanDist <- function(a, b){ d = 0 for(i in c(1:(length(a)-1) )) { d = d + (a[[i]]-b[[i]])^2 } d = sqrt(d) return(d) } 123456789 euclideanDist <- function(a, b){d = 0for(i in c(1:(length(a)-1) )){d = d + (a[[i]]-b[[i]])^2}d = sqrt(d)return(d)}

This function is the core part of this tutorial.We are writing a function knn_predict. It takes 3 arguments: test data, train data & value of K. It loops over all the records of test data and train data. It returns the predicted class labels of test data.

KNN Algorithm accuracy print: In this code snippet we are joining all our functions. We are calling the knn_predict function with train and test dataframes that we split earlier and K value as 5. We are appending the prediction vector as the 7th column in our test dataframe and then using accuracy() method we are printing accuracy of our KNN model.

Yes we can perform the regression with knn also. In knn regression we will average the K neighbor values as the predicted value, But using knn for regression is not an optimal option, it always better to go with the regression algorithms

Sorry to say the dataset we used in the article is the dummy data we have created, the idea is to apply this code for any dataset. Feel free to use the code for other datasets and let me know if you face any issues.

The dataset we have used in the article is the dummy dataset, the main intension is to apply the same model building workflow for any other dataset. Hope you can use the same model building framework for other datasets.

[] we are going to examine a wine dataset. Our motive is to predict the origin of the wine. As in our Knnimplementation in R programmingpost, we built a Knn classifier in R from scratch, but that process is not a feasible []

naive bayes classifier from scratch in python

We can use probability to make predictions in machine learning. Perhaps the most widely used example is called the Naive Bayes algorithm. Not only is it straightforward to understand, but it also achieves surprisingly good results on a wide range of problems.

Naive Bayes is a classification algorithm for binary (two-class) and multiclass classification problems. It is called Naive Bayes or idiot Bayes because the calculations of the probabilities for each class are simplified to make their calculations tractable.

This is a very strong assumption that is most unlikely in real data, i.e. that the attributes do not interact. Nevertheless, the approach performs surprisingly well on data where this assumption does not hold.

It is a multiclass classification problem. The number of observations for each class is balanced. There are 150 observations with 4 input variables and 1 output variable. The variable names are as follows:

Well see how these statistics are used in the calculation of probabilities in a few steps. The two statistics we require from a given dataset are the mean and the standard deviation (average deviation from the mean).

You can see that we square the difference between the mean and a given value, calculate the average squared difference from the mean, then take the square root to return the units back to their original value.

Below is a small function named standard_deviation() that calculates the standard deviation of a list of numbers. You will notice that it calculates the mean. It might be more efficient to calculate the mean of a list of numbers once and pass it to the standard_deviation() function as a parameter. You can explore this optimization if youre interested later.

We can do that by gathering all of the values for each column into a list and calculating the mean and standard deviation on that list. Once calculated, we can gather the statistics together into a list or tuple of statistics. Then, repeat this operation for each column in the dataset and return a list of tuples of statistics.

The first trick is the use of the zip() function that will aggregate elements from each provided argument. We pass in the dataset to the zip() function with the * operator that separates the dataset (that is a list of lists) into separate lists for each row. The zip() function then iterates over each element of each row and returns a column from the dataset as a list of numbers. A clever little trick.

We then calculate the mean, standard deviation and count of rows in each column. A tuple is created from these 3 numbers and a list of these tuples is stored. We then remove the statistics for the class variable as we will not need these statistics.

Below is a function named summarize_by_class() that implements this operation. The dataset is first split by class, then statistics are calculated on each subset. The results in the form of a list of tuples of statistics are then stored in a dictionary by their class value.

Running this example calculates the statistics for each input variable and prints them organized by class value. Interpreting the results, we can see that the X1 values for rows for class 0 have a mean value of 2.7420144012.

A Gaussian distribution can be summarized using only two numbers: the mean and the standard deviation. Therefore, with a little math, we can estimate the probability of a given value. This piece of math is called a Gaussian Probability Distribution Function (or Gaussian PDF) and can be calculated as:

Running it prints the probability of some input values. You can see that when the value is 1 and the mean and standard deviation is 1 our input is the most likely (top of the bell curve) and has the probability of 0.39.

We can see that when we keep the statistics the same and change the x value to 1 standard deviation either side of the mean value (2 and 0 or the same distance either side of the bell curve) the probabilities of those input values are the same at 0.24.

Probabilities are calculated separately for each class. This means that we first calculate the probability that a new piece of data belongs to the first class, then calculate probabilities that it belongs to the second class, and so on for all the classes.

This means that the result is no longer strictly a probability of the data belonging to a class. The value is still maximized, meaning that the calculation for the class that results in the largest value is taken as the prediction. This is a common implementation simplification as we are often more interested in the class prediction rather than the probability.

The input variables are treated separately, giving the technique its name naive. For the above example where we have 2 input variables, the calculation of the probability that a row belongs to the first class 0 can be calculated as:

Now you can see why we need to separate the data by class value. The Gaussian Probability Density function in the previous step is how we calculate the probability of a real value like X1 and the statistics we prepared are used in this calculation.

First the total number of training records is calculated from the counts stored in the summary statistics. This is used in the calculation of the probability of a given class or P(class) as the ratio of rows with a given class of all rows in the training data.

Next, probabilities are calculated for each input value in the row using the Gaussian probability density function and the statistics for that column and of that class. Probabilities are multiplied together as they accumulated.

We can see that the probability of the first row belonging to the 0 class (0.0503) is higher than the probability of it belonging to the 1 class (0.0001). We would therefore correctly conclude that it belongs to the 0 class.

The first step is to load the dataset and convert the loaded data to numbers that we can use with the mean and standard deviation calculations. For this we will use the helper function load_csv() to load the file, str_column_to_float() to convert string numbers to floats and str_column_to_int() to convert the class column to integer values.

We will evaluate the algorithm using k-fold cross-validation with 5 folds. This means that 150/5=30 records will be in each fold. We will use the helper functions evaluate_algorithm() to evaluate the algorithm with cross-validation and accuracy_metric() to calculate the accuracy of predictions.

Another new function named naive_bayes() was developed to manage the application of the Naive Bayes algorithm, first learning the statistics from a training dataset and using them to make predictions for a test dataset.

We also might like to know the class label (string) for a prediction. We can update the str_column_to_int() function to print the mapping of string class names to integers so we can interpret the prediction by the model.

Then a new observation is defined (in this case I took a row from the dataset), and a predicted label is calculated. In this case our observation is predicted as belonging to class 2 which we know is Iris-setosa.

how to build gaussian naive bayes classifier from scratch using pandas, numpy, & python evidencen

I am not going to bug you down with naive Bayes theorem and the different types of theorem. If you want more information about naive Bayes theorem, I suggest you check out this Wikipedia page https://en.wikipedia.org/wiki/Naive_Bayes_classifier

I choose to implement the Gaussian naive Bayes as opposed to the other naive base algorithms because I felt like the Gaussian naive Bayes mathematical equation was a bit easier to understand and implement.

To start off, it is better to use an existing example. I am going to build this project using example data from Wikipedia. Wikipedia already worked out an example. So, as we are writing this code, we are comparing our answers to the answers from the Wikipedia page to verify that we are doing the right calculations.

I dont want to regurgitate the Gaussian naive Bayes equation explanation here because I feel like Wikipedia does a better job of explaining it than I do. So, if you REALLY need to understand the equation and mathematics of Gaussian naive Bayes before diving into how to code it up, I highly suggest you visit this link https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Examples

This is the result of the above code. NOTE: y_person = y, which is the 2 different classes we have. We could potentially have more than 2 classes for classification. But in this example, we only have 2 classes.

mean for height male is 5.85, variance for height male is 0.03, e.t.c. Those 2 features will be combined together in order to calculate the probability of someone being male using the test height data we receive. And this probability has to be calculated for each class, using each feature, using the mean and variance of each class and feature.

Now that we have the mean_variance pair in a list, the next step is to separate it by class. we know that the first 3 items in the list belongs to the first class, and the second 3 items in the list belongs to the second class. we are going to use that information to separate our list into different classes.

So then, we will have 6 probabilities calculations.Then we will add up the probabilities of being male + priorand then add up the probabilities of being female + priorin order to determine which class the sample data belongs to

If I just print out the for loop above before calculating the probabilities, then it makes sense why I did the split into various classes before calculating probabilities. Short Answer: It just makes it easier to pair up the right x_test_value with the correct mean_variance pair. Here are the results below.

introducing naive bayes classifier from scratch in python - blockgeni

We can use probability to make predictions in machine learning. Perhaps the most widely used example is called the Naive Bayes algorithm. Not only is it straightforward to understand, but it also achieves surprisingly good results on a wide range of problems.

Naive Bayes is a classification algorithm for binary (two-class) and multiclass classification problems. It is called Naive Bayes or idiot Bayes because the calculations of the probabilities for each class are simplified to make their calculations tractable.

This is a very strong assumption that is most unlikely in real data, i.e. that the attributes do not interact. Nevertheless, the approach performs surprisingly well on data where this assumption does not hold.

It is a multiclass classification problem. The number of observations for each class is balanced. There are 150 observations with 4 input variables and 1 output variable. The variable names are as follows:

Well see how these statistics are used in the calculation of probabilities in a few steps. The two statistics we require from a given dataset are the mean and the standard deviation (average deviation from the mean).

You can see that we square the difference between the mean and a given value, calculate the average squared difference from the mean, then take the square root to return the units back to their original value.

Below is a small function namedstandard_deviation()that calculates the standard deviation of a list of numbers. You will notice that it calculates the mean. It might be more efficient to calculate the mean of a list of numbers once and pass it to thestandard_deviation()function as a parameter. You can explore this optimization if youre interested later.

We can do that by gathering all of the values for each column into a list and calculating the mean and standard deviation on that list. Once calculated, we can gather the statistics together into a list or tuple of statistics. Then, repeat this operation for each column in the dataset and return a list of tuples of statistics.

The first trick is the use of thezip() functionthat will aggregate elements from each provided argument. We pass in the dataset to thezip()function with the * operator that separates the dataset (that is a list of lists) into separate lists for each row. Thezip()function then iterates over each element of each row and returns a column from the dataset as a list of numbers. A clever little trick.

We then calculate the mean, standard deviation and count of rows in each column. A tuple is created from these 3 numbers and a list of these tuples is stored. We then remove the statistics for the class variable as we will not need these statistics.

Below is a function namedsummarize_by_class()that implements this operation. The dataset is first split by class, then statistics are calculated on each subset. The results in the form of a list of tuples of statistics are then stored in a dictionary by their class value.

Running this example calculates the statistics for each input variable and prints them organized by class value. Interpreting the results, we can see that the X1 values for rows for class 0 have a mean value of 2.7420144012.

AGaussian distributioncan be summarized using only two numbers: the mean and the standard deviation. Therefore, with a little math, we can estimate the probability of a given value. This piece of math is called a GaussianProbability Distribution Function(or Gaussian PDF) and can be calculated as:

Running it prints the probability of some input values. You can see that when the value is 1 and the mean and standard deviation is 1 our input is the most likely (top of the bell curve) and has the probability of 0.39.

We can see that when we keep the statistics the same and change the x value to 1 standard deviation either side of the mean value (2 and 0 or the same distance either side of the bell curve) the probabilities of those input values are the same at 0.24.

Probabilities are calculated separately for each class. This means that we first calculate the probability that a new piece of data belongs to the first class, then calculate probabilities that it belongs to the second class, and so on for all the classes.

This means that the result is no longer strictly a probability of the data belonging to a class. The value is still maximized, meaning that the calculation for the class that results in the largest value is taken as the prediction. This is a common implementation simplification as we are often more interested in the class prediction rather than the probability.

The input variables are treated separately, giving the technique its name naive. For the above example where we have 2 input variables, the calculation of the probability that a row belongs to the first class 0 can be calculated as:

Now you can see why we need to separate the data by class value. The Gaussian Probability Density function in the previous step is how we calculate the probability of a real value like X1 and the statistics we prepared are used in this calculation.

First the total number of training records is calculated from the counts stored in the summary statistics. This is used in the calculation of the probability of a given class orP(class)as the ratio of rows with a given class of all rows in the training data.

Next, probabilities are calculated for each input value in the row using the Gaussian probability density function and the statistics for that column and of that class. Probabilities are multiplied together as they accumulated.

We can see that the probability of the first row belonging to the 0 class (0.0503) is higher than the probability of it belonging to the 1 class (0.0001). We would therefore correctly conclude that it belongs to the 0 class.

The first step is to load the dataset and convert the loaded data to numbers that we can use with the mean and standard deviation calculations. For this we will use the helper functionload_csv()to load the file,str_column_to_float()to convert string numbers to floats andstr_column_to_int()to convert the class column to integer values.

We will evaluate the algorithm usingk-fold cross-validationwith 5 folds. This means that 150/5=30 records will be in each fold. We will use the helper functionsevaluate_algorithm()to evaluate the algorithm with cross-validation andaccuracy_metric()to calculate the accuracy of predictions.

Another new function namednaive_bayes()was developed to manage the application of the Naive Bayes algorithm, first learning the statistics from a training dataset and using them to make predictions for a test dataset.

We also might like to know the class label (string) for a prediction. We can update the str_column_to_int() function to print the mapping of string class names to integers so we can interpret the prediction by the model.

Then a new observation is defined (in this case I took a row from the dataset), and a predicted label is calculated. In this case our observation is predicted as belonging to class 1 which we know is Iris-versicolor.

understanding naive bayes classifier from scratch

Naive Bayes classifier belongs to a family of probabilistic classifiers that are built upon the Bayes theorem. In naive Bayes classifiers, the number of model parameters increases linearly with the number of features. Moreover, its trained by evaluating a closed-form expression, i.e., a mathematical expression that can be evaluated using finite steps and has one definite solution. This means that naive Bayes classifiers train in linear time compared to the quadratic or cubic time of other iterative approximation based approaches. These two factors make naive Bayes classifiers highly scalable. In this article, well go through the Bayes theorem, make some assumptions and then implement a naive Bayes classifier from scratch.

Bayes theorem is one of the most important formulas in all probability. Its an essential tool for scientific discovery and for creating AI systems; it has also been used to find century-old treasures. It is formulated as

Steve is very shy and withdrawn, invariably helpful but with very little interest in people or in the world of reality. A meek and tidy soul, he has a need for order and structure and a passion for detail.

Given the above description, do you think Steve is more likely to be a librarian or a farmer? The majority of people immediately conclude that Steve must be a librarian since he fits their idea of a librarian. However, as we see the whole picture,we see that there are twenty times as many farmers as librarians (in the United States). Most people arent aware of this statistic and hence cant make an accurate prediction, and thats okay. Also, thats beside the point of this article. However, if you want to learn why we act irrationally and make assumptions like this, I wholeheartedly recommend reading Kahnemans Thinking Fast and Slow.

Back to Bayes theorem, to model this puzzle more accurately, lets start by creating create a representative sample of 420 people, 20 librarians and 400 farmers. And lets say your intuition is that roughly 50% of librarians would fit that description, and 10% of farmers would. So the probability of a random person fitting this description being a librarian becomes 0.2 (10/50). So even if you think a librarian is five times as likely as a farmer to fit this description, thats not enough to overcome the fact that there are way more farmers.

This new evidence doesnt necessarily overrule your past belief but rather updates it. And this is precisely what the Bayes theorem models. The first relevant number is the probability that your beliefs hold true before considering the new evidence. Using the ratio of farmers to librarians in the general population, this came out to be 1/5 in our example. This is known as the prior P(H). In addition to this, we need to consider the proportion of librarians that fit this description; the probability we would see the evidence given that the hypothesis is true, P(E|H). In the context of the Bayes theorem, this value is called the likelihood. This represents a limited view of your initial hypothesis.

Similarly, we need to consider how much of the farmers side of the sample space make up the evidence; the probability of seeing the evidence given that your beliefs dont hold true P(E|H). Using these notations, the accurate probability of your beliefs being right given the evidence, P(H|E), also called the posterior probability, can be formulated as:

This is the original Bayes theorem that we started with. I hope this illustrated the core point of the Bayes theorem representing a changing belief system, not just a bunch of independent probabilities.

The naive Bayes classifier is called naive because it makes the assumption that all features are independent of each other. Another assumption that it makes is that the values of the features are normally (Gaussian) distributed. Using these assumptions, the original Bayes theorem is modified and transformed into a simpler form that is relevant for solving learning problems. We start with

Create a function that calculates the prior probability, P(H), mean and variance of each class. The mean and variance are later used to calculate the likelihood, P(E|H), using the Gaussian distribution.