How to build Gaussian naive Bayes classifier from scratch using pandas, Numpy, & python
Here is the github repo for this project
I am not going to bug you down with naive Bayes theorem and the different types of theorem. If you want more information about naive Bayes theorem, I suggest you check out this Wikipedia page https://en.wikipedia.org/wiki/Naive_Bayes_classifier
I choose to implement the Gaussian naive Bayes as opposed to the other naive base algorithms because I felt like the Gaussian naive Bayes mathematical equation was a bit easier to understand and implement.
To start off, it is better to use an existing example. I am going to build this project using example data from Wikipedia. Wikipedia already worked out an example. So, as we are writing this code, we are comparing our answers to the answers from the Wikipedia page to verify that we are doing the right calculations.
The example calculation we will use to verify our code and calculations can be found here https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Examples
I don’t want to regurgitate the Gaussian naive Bayes equation explanation here because I feel like Wikipedia does a better job of explaining it than I do. So, if you REALLY need to understand the equation and mathematics of Gaussian naive Bayes before diving into how to code it up, I highly suggest you visit this link https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Examples
I am also screen-shotting and adding the pages from Wikipedia below to help you understand the basics of how the Gaussian naive Bayes equation classifier is supposed to work.
Here are the images from wikipedia showing example calculations.



Step 1: Pick a DataFrame for testing the code.
The data below corresponds to the data used in the Wikipedia example above. So, the answers we get below should be the same thing as the answers they got above.
import pandas as pd import numpy as np x_values = pd.DataFrame({ 'height': [6, 5.92, 5.58, 5.92, 5, 5.5, 5.42, 5.75], 'weight': [180,190,170,165,100,150,130,150], 'foot': [12,11,12,10,6,8,7,9] }) # I converted "male" to 0, and "female" to 1 in the y which is our target. y = pd.Series([0,0,0,0,1,1,1,1]) # to be used in sklearn due to how sklearn wants the data sample = pd.DataFrame({ 'height': [6], 'weight': [130], 'foot': [8] }) # same data as sample above, but to be used in custome built # classifier due to the data structure our class is expecting. x_test = [6, 13, 8]
This is what the above datasets looks like.


Step 2: Create a model for training our data.
You have to realize that the “training” process of this algorithm is simply obtaining the “mean and variance” of each feature in the data frame.
Each feature is broken up by class, then we obtain the mean and variance of each feature by class. Look at the code below for explanation.
x = x_values # to group a dataset by class, you can simply use # pandas groupby function # to group our x by y, we simply do x.groupby(by=y) # not only can we group using pandas groupby, # the easiast way to get the mean and variance of each # feature by group is to use the pandas groupby function # to get the mean of each feature by group, we do x.groupby(by=y).mean() # to get the variance of each featurey by group, we do x.groupby(by=y).var() # putting all of this together in our fit function, # this is what we got. def fit(x, y): mean = x.groupby(by=y).mean() var = x.groupby(by=y).var() return mean, var # I returned mean, and var so that we can see # what the results look like. When we put everything together # in a class below, then no need to return mean, var. # now, we can call this fit method and see if it works. mean2, var2 = fit(x_values, y)
This is the result of the above code. NOTE: y_person = y, which is the 2 different classes we have. We could potentially have more than 2 classes for classification. But in this example, we only have 2 classes.


Step 3: Pull mean and variance from their individual lists and pair them up.
The next step and really one of the biggest challenge is combining the mean and variance for each feature and class together. This is what I mean.
mean for height male is 5.85, variance for height male is 0.03, e.t.c. Those 2 features will be combined together in order to calculate the probability of someone being male using the test height data we receive. And this probability has to be calculated for each class, using each feature, using the mean and variance of each class and feature.
To extrapolate and combine the mean, and variance by class, we have to convert our dataframe into an array (a list) and then run a for loop through it.
The code below will explain everything.
# converting the mean values into an array m = np.array(mean2) # converting the variance values into an array v = np.array(var2)


Now that we have our mean and variance in a list, let’s loop through it and pull out the mean and variance for each feature and each class and pair them together.
# looping through the mean and variance to pull out # individual values for each feature. for i in range(len(m)): m_row = m[i] v_row = v[i] for a, b in enumerate(m_row): mean = b var = v_row[a] print(f'mean: {mean}, var: {var}')
Here is the result of the above code

Now, we are going to repeat the code above, but put it in a list the we can loop through again.
mean_var = [] #len_row = [] for i in range(len(m)): m_row = m[i] #len_row.append(len(m_row)) v_row = v[i] for a, b in enumerate(m_row): mean = b var = v_row[a] mean_var.append([mean, var])
Here is the result of the above code. Same thing as the previous picture.

Step 4: separate mean_variance pair by class
Now that we have the mean_variance pair in a list, the next step is to separate it by class. we know that the first 3 items in the list belongs to the first class, and the second 3 items in the list belongs to the second class. we are going to use that information to separate our list into different classes.
# first, in order to split our list based on number of classes. # we first have to turn our list into an array. mean_var2 = np.array(mean_var) # next, we need to calculate how many classes we have by # getting number of unique values in y n_class = len(np.unique(y)) >>>Answer: 2 >>>We have 2 unique classes # now to separate mean variance by class, we use numpy vsplit s = np.vsplit(mean_var2, n_class) # mean_var2 is what we are splitting, and n_class is how many # splits should be done
Here is the result of the above code. With these, we have our mean_variance pair separated by classes. This will make it easier to know which probability calculation belong to which class and feature

Step 5: Build the base probability calculation formula using code
Now that we have the mean, and variance paired up and separated by classes, is time to build our BASE Gaussian naive base algorithm using the appropriate formula.
This is the formula to calculate the probability of y given these x values.

Here is what the formula above means

Here is how sklearn describes this formula. Same formula, slightly different symbols.

Now, let’s translate this formula into code to calculate probabilities.
def gnb(x_val, x_mean, x_var): # Variance of the x value in question x_var = x_var # x mean value x_mean = x_mean # natural log e = np.e # pi pi = np.pi # first part of the equation # 1 divided by the sqrt of 2 * pi * y_variance equation_1 = 1/(np.sqrt(2 * pi * x_var)) # second part of equation implementation # denominator of equation denom = 2 * x_var # numerator calculation # the x value that is being used for computation x_val = x_val numerator = (x_val - x_mean) ** 2 # the exponent expo = np.exp(-(numerator/denom)) prob = equation_1 * expo return prob
Step 6: calculating probabilities using our text x_values.
Now that we have all the pieces together, We can go ahead and actually use the results of our fit method (aka, mean & variance) to do predictions on new x_Values.
Assuming we have these test x_values, let’s predict the probability for each feature. Assuming you have this brand new data that says
height = 6
weight = 130
foot size = 8
Does this data classify a male or a female? We have to run our data through our model and get the probability of each feature for each class.
At the end, we will have probabilities of the following
- height being male
- height being female
- weight being male
- weight being female
- foot being male
- foot being female
So then, we will have 6 probabilities calculations.
Then we will add up the probabilities of being male + prior
and then add up the probabilities of being female + prior
in order to determine which class the sample data belongs to
we will discuss what “prior” means later
With that said, this is how to calculate the probabilities.
# xs = test_x_values xs = [6, 130, 8] # do the probability calculations for the number of classes available # probabilities prob = [] for i in range(n_class): # first class class_one = s[i] for i in range(len(class_one)): # first value in class one class_one_x_mean = class_one[i][0] class_one_x_var = class_one[i][1] # pull the x values that correspond to the right index of # mean, median, mode. x_value = xs[i] # now calculate the probabilities of each class. prob.append([gnb(x_value, class_one_x_mean, class_one_x_var)])
The result of the above code is. The first thing on the list is the probability of male & height, then male & weight, then male & foot. The last three items refer to female height, weight, foot.

SIDE NOTE: you might be thinking why I did the split into classes before calculating probabilities. Why didn’t I just calculate probabilities right after pairing mean with variance?
Well, the code and image below will show you our final data structure before calculating probabilities.
xs = [6, 130, 8] # do this for the number of classes available for i in range(n_class): # first class class_one = s[i] for i in range(len(class_one)): # first value in class one class_one_x_mean = class_one[i][0] class_one_x_var = class_one[i][1] x_value = xs[i] print([class_one_x_mean, class_one_x_var, x_value])
If I just print out the “for loop” above before calculating the probabilities, then it makes sense why I did the split into various classes before calculating probabilities. Short Answer: It just makes it easier to pair up the right “x_test_value” with the correct “mean_variance” pair. Here are the results below.

Step 7: divide the probabilities into their separate classes again.
As you can see, we have 6 probabilities, now we need to divide them up again into number of classes. This separation will help with properly identifying which probability belongs to which class.
# turn our probabilities into an array before we split b = np.array(prob) # split the probabilities into the various classes k = np.vsplit(b, n_class)
This is the result of the split. The first array refers to the first class, and the second array refers to the second class.

Step 8. Use the prior and probabilities to calculate the final probability of the sample data being male or female.
What exactly is PRIOR?
Prior = the prior probability distribution. So, basically, without any data, what do we believe is the probability for each class.
In this case, since we have only 2 classes, the probability is 50/505. If we had 3 classes, the probability is 1/3 = 33%, you get the idea.
Now, we are going to calculate the final probability of each class to see which class our sample data most likely belongs to.
# calculate prior prior = 1/n_class final_probabilities = [] for i in k: class_prob = np.prod(i) * prior final_probabilities.append(class_prob)
These are the results of the above code. The final probabilities of each class are below.

Step 9. Figuring out highest probability and relating it back to the class.
Now that we have our probabilities for each class, let’s get the maximum probability.
# calculating maximum probability. maximum_prob = max(final_probabilities) >>>Output: 0.0005377909183630018 # getting the index that corresponds to the maximum probability prob_index = final_probabilities.index(maximum_prob) # using the index above to map out what our prediction is prediction = y[prob_index] >>>Output: 1
Since our maximum probability is the second number which belongs to the second class, we can conclude that the sample data we received probably belong to a FEMALE.
There you have it. A step by step walk through of how to build a Gaussian naive Bayes classifier. Let me know what you think in the comment section BELOW.
Step 10: All the code above put together in a class
If we take everything above and put it into a class, this is what it would look like
import pandas as pd import numpy as np class gnb: def __init__(self, prior=None, n_class=None, mean=None, variance = None, classes=None): # prior assumption of probability self.prior = prior # how many unique classes self.n_class = n_class # mean of x values self.mean = mean # variance of x values self.variance = variance # the unique classes present self.classes = classes # get the mean and variance of the x values def fit(self, x, y): # get the mean and variance of the x values self.x = x self.y = y self.mean = np.array(x.groupby(by=y).mean()) self.variance = np.array(x.groupby(by=y).var()) self.n_class = len(np.unique(y)) self.classes = np.unique(y) self.prior = 1/self.n_class return self def mean_var(self): # mean and variance from the trainig data m = np.array(self.mean) v = np.array(self.variance) # pull and combine the corresponding mean and variance self.mean_var = [] for i in range(len(m)): m_row = m[i] v_row = v[i] for a, b in enumerate(m_row): mean = b var = v_row[a] self.mean_var.append([mean, var]) return self.mean_var def split(self): spt = np.vsplit(np.array(self.mean_var()), self.n_class) return spt def gnb_base(self, x_val, x_mean, x_var): # define the base formula for prediction probabilities # Variance of the x value in question self.x_val = x_val # x mean value self.x_mean = x_mean # the x value that is being used for computation self.x_var = x_var # natural log e = np.e # pi pi = np.pi # first part of the equation # 1 divided by the sqrt of 2 * pi * x_variance equation_1 = 1/(np.sqrt(2 * pi * x_var)) # second part of equation implementation # denominator of equation denom = 2 * x_var # numerator calculation numerator = (x_val - x_mean) ** 2 # the exponent expo = np.exp(-(numerator/denom)) prob = equation_1 * expo return prob def predict(self, X): self.X = X # calculate the probabilities using base formula above # defining the mean and variance that has being split into # various classes. split_class = self.split() prob = [] for i in range(self.n_class): # first class class_one = split_class[i] for i in range(len(class_one)): # first value in class one class_one_x_mean = class_one[i][0] class_one_x_var = class_one[i][1] x_value = X[i] # now calculate the probabilities of each class. prob.append([self.gnb_base(x_value, class_one_x_mean, class_one_x_var)]) # turn prob into an array prob_array = np.array(prob) # split the probability into various classes again prob_split = np.vsplit(prob_array, self.n_class) # calculate the final probabilities final_probabilities = [] for i in prob_split: class_prob = np.prod(i) * self.prior final_probabilities.append(class_prob) # determining the maximum probability maximum_prob = max(final_probabilities) # getting the index that corresponds to maximum probability prob_index = final_probabilities.index(maximum_prob) # using the index of the maximum probability to get # the class that corresponds to the maximum probability prediction = self.classes[prob_index] return prediction
And to demonstrate how this would work, we will use the dummy data created above.
import pandas as pd import numpy as np x_values = pd.DataFrame({ 'height': [6, 5.92, 5.58, 5.92, 5, 5.5, 5.42, 5.75], 'weight': [180,190,170,165,100,150,130,150], 'foot': [12,11,12,10,6,8,7,9] }) y = pd.Series([0,0,0,0,1,1,1,1]) # to be used in sklearn due to how sklearn wants the data sample = pd.DataFrame({ 'height': [6], 'weight': [130], 'foot': [8] }) # same data as sample above, but to be used in custome built # classifier due to the data structure our class is expecting. x_test = [6, 13, 8]
Then this is the part where we instantiate the class we created above.
gnb = gnb() # fit the class to the x_values and target gnb.fit(x_values, y) # given these X values, let's do a prediction gnb.predict([6, 130, 8]) Output: 1
When I run these training and test data through sklearn gaussian nb classifier, we also get “1” as the prediction
Here is the link to the github repo for this project.
Let me know what you think in the comment section below.
Leave a Comment