fbpx

How to build Gaussian naive Bayes classifier from scratch using pandas, Numpy, & python

Here is the github repo for this project

I am not going to bug you down with naive Bayes theorem and the different types of theorem. If you want more information about naive Bayes theorem, I suggest you check out this Wikipedia page https://en.wikipedia.org/wiki/Naive_Bayes_classifier

I choose to implement the Gaussian naive Bayes as opposed to the other naive base algorithms because I felt like the Gaussian naive Bayes mathematical equation was a bit easier to understand and implement.

To start off, it is better to use an existing example. I am going to build this project using example data from Wikipedia. Wikipedia already worked out an example. So, as we are writing this code, we are comparing our answers to the answers from the Wikipedia page to verify that we are doing the right calculations.

The example calculation we will use to verify our code and calculations can be found here https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Examples

I don’t want to regurgitate the Gaussian naive Bayes equation explanation here because I feel like Wikipedia does a better job of explaining it than I do. So, if you REALLY need to understand the equation and mathematics of Gaussian naive Bayes before diving into how to code it up, I highly suggest you visit this link https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Examples

I am also screen-shotting and adding the pages from Wikipedia below to help you understand the basics of how the Gaussian naive Bayes equation classifier is supposed to work.

Here are the images from wikipedia showing example calculations.

Step 1: Pick a DataFrame for testing the code.

The data below corresponds to the data used in the Wikipedia example above. So, the answers we get below should be the same thing as the answers they got above.

import pandas as pd
import numpy as np

x_values = pd.DataFrame({
    'height': [6, 5.92, 5.58, 5.92, 5, 5.5, 5.42, 5.75],
    'weight': [180,190,170,165,100,150,130,150],
    'foot': [12,11,12,10,6,8,7,9]
})

# I converted "male" to 0, and "female" to 1 in the y which is our target.
y = pd.Series([0,0,0,0,1,1,1,1])

# to be used in sklearn due to how sklearn wants the data
sample = pd.DataFrame({
    'height': [6],
    'weight': [130],
    'foot': [8]
})

# same data as sample above, but to be used in custome built
# classifier due to the data structure our class is expecting.
x_test = [6, 13, 8]

This is what the above datasets looks like.

Step 2: Create a model for training our data.

You have to realize that the “training” process of this algorithm is simply obtaining the “mean and variance” of each feature in the data frame.

Each feature is broken up by class, then we obtain the mean and variance of each feature by class. Look at the code below for explanation.

x = x_values
# to group a dataset by class, you can simply use 
# pandas groupby function

# to group our x by y, we simply do
x.groupby(by=y)

# not only can we group using pandas groupby, 
# the easiast way to get the mean and variance of each
# feature by group is to use the pandas groupby function

# to get the mean of each feature by group, we do
x.groupby(by=y).mean()

# to get the variance of each featurey by group, we do
x.groupby(by=y).var()

# putting all of this together in our fit function, 
# this is what we got. 

def fit(x, y):
  mean = x.groupby(by=y).mean()

  var = x.groupby(by=y).var()

  return mean, var

# I returned mean, and var so that we can see 
# what the results look like. When we put everything together 
# in a class below, then no need to return mean, var.

# now, we can call this fit method and see if it works. 
mean2, var2 = fit(x_values, y)

This is the result of the above code. NOTE: y_person = y, which is the 2 different classes we have. We could potentially have more than 2 classes for classification. But in this example, we only have 2 classes.

Step 3: Pull mean and variance from their individual lists and pair them up.

The next step and really one of the biggest challenge is combining the mean and variance for each feature and class together. This is what I mean.

mean for height male is 5.85, variance for height male is 0.03, e.t.c. Those 2 features will be combined together in order to calculate the probability of someone being male using the test height data we receive. And this probability has to be calculated for each class, using each feature, using the mean and variance of each class and feature.

To extrapolate and combine the mean, and variance by class, we have to convert our dataframe into an array (a list) and then run a for loop through it.

The code below will explain everything.

# converting the mean values into an array
m = np.array(mean2)

# converting the variance values into an array
v = np.array(var2)

Now that we have our mean and variance in a list, let’s loop through it and pull out the mean and variance for each feature and each class and pair them together.

# looping through the mean and variance to pull out 
# individual values for each feature. 

for i in range(len(m)):
  m_row = m[i]
  v_row = v[i]
  for a, b in enumerate(m_row):
    mean = b
    var = v_row[a]
    print(f'mean: {mean}, var: {var}')

Here is the result of the above code

Now, we are going to repeat the code above, but put it in a list the we can loop through again.

mean_var = []
#len_row = []
for i in range(len(m)):
  m_row = m[i]
  #len_row.append(len(m_row))
  v_row = v[i]
  for a, b in enumerate(m_row):
    mean = b
    var = v_row[a]
    mean_var.append([mean, var])

Here is the result of the above code. Same thing as the previous picture.

Step 4: separate mean_variance pair by class

Now that we have the mean_variance pair in a list, the next step is to separate it by class. we know that the first 3 items in the list belongs to the first class, and the second 3 items in the list belongs to the second class. we are going to use that information to separate our list into different classes.

# first, in order to split our list based on number of classes.
# we first have to turn our list into an array. 
mean_var2 = np.array(mean_var)

# next, we need to calculate how many classes we have by
# getting number of unique values in y 
n_class = len(np.unique(y))
>>>Answer: 2
>>>We have 2 unique classes

# now to separate mean variance by class, we use numpy vsplit
s = np.vsplit(mean_var2, n_class)
# mean_var2 is what we are splitting, and n_class is how many
# splits should be done

Here is the result of the above code. With these, we have our mean_variance pair separated by classes. This will make it easier to know which probability calculation belong to which class and feature

Step 5: Build the base probability calculation formula using code

Now that we have the mean, and variance paired up and separated by classes, is time to build our BASE Gaussian naive base algorithm using the appropriate formula.

This is the formula to calculate the probability of y given these x values.

Here is what the formula above means

Here is how sklearn describes this formula. Same formula, slightly different symbols.

Now, let’s translate this formula into code to calculate probabilities.

def gnb(x_val, x_mean, x_var):
  # Variance of the x value in question
  x_var = x_var
  # x mean value
  x_mean = x_mean
  # natural log
  e = np.e
  # pi
  pi = np.pi
  # first part of the equation
  # 1 divided by the sqrt of 2 * pi * y_variance
  equation_1 = 1/(np.sqrt(2 * pi * x_var))
  
  # second part of equation implementation
  # denominator of equation
  denom = 2 * x_var

  # numerator calculation
  # the x value that is being used for computation
  x_val = x_val
  numerator = (x_val - x_mean) ** 2
  # the exponent
  expo = np.exp(-(numerator/denom))
  prob = equation_1 * expo

  return prob

Step 6: calculating probabilities using our text x_values.

Now that we have all the pieces together, We can go ahead and actually use the results of our fit method (aka, mean & variance) to do predictions on new x_Values.

Assuming we have these test x_values, let’s predict the probability for each feature. Assuming you have this brand new data that says

height = 6
weight = 130
foot size = 8

Does this data classify a male or a female? We have to run our data through our model and get the probability of each feature for each class.

At the end, we will have probabilities of the following

  • height being male
  • height being female
  • weight being male
  • weight being female
  • foot being male
  • foot being female

So then, we will have 6 probabilities calculations.
Then we will add up the probabilities of being male + prior
and then add up the probabilities of being female + prior
in order to determine which class the sample data belongs to

we will discuss what “prior” means later

With that said, this is how to calculate the probabilities.

# xs = test_x_values

xs = [6, 130, 8]

# do the probability calculations for the number of classes available

# probabilities
prob = []
for i in range(n_class):
  # first class
  class_one = s[i]
  for i in range(len(class_one)):
    # first value in class one
    class_one_x_mean = class_one[i][0]
    class_one_x_var = class_one[i][1]
		# pull the x values that correspond to the right index of 
		# mean, median, mode. 
    x_value = xs[i]
    # now calculate the probabilities of each class. 
    prob.append([gnb(x_value, class_one_x_mean, class_one_x_var)])

The result of the above code is. The first thing on the list is the probability of male & height, then male & weight, then male & foot. The last three items refer to female height, weight, foot.

SIDE NOTE: you might be thinking why I did the split into classes before calculating probabilities. Why didn’t I just calculate probabilities right after pairing mean with variance?

Well, the code and image below will show you our final data structure before calculating probabilities.

xs = [6, 130, 8]
# do this for the number of classes available

for i in range(n_class):
  # first class
  class_one = s[i]
  for i in range(len(class_one)):
    # first value in class one
    class_one_x_mean = class_one[i][0]
    class_one_x_var = class_one[i][1]
    x_value = xs[i]
    
    print([class_one_x_mean, class_one_x_var, x_value])

If I just print out the “for loop” above before calculating the probabilities, then it makes sense why I did the split into various classes before calculating probabilities. Short Answer: It just makes it easier to pair up the right “x_test_value” with the correct “mean_variance” pair. Here are the results below.

Step 7: divide the probabilities into their separate classes again.

As you can see, we have 6 probabilities, now we need to divide them up again into number of classes. This separation will help with properly identifying which probability belongs to which class.

# turn our probabilities into an array before we split
b = np.array(prob)

# split the probabilities into the various classes
k = np.vsplit(b, n_class)

This is the result of the split. The first array refers to the first class, and the second array refers to the second class.

Step 8. Use the prior and probabilities to calculate the final probability of the sample data being male or female.

What exactly is PRIOR?

Prior = the prior probability distribution. So, basically, without any data, what do we believe is the probability for each class.

In this case, since we have only 2 classes, the probability is 50/505. If we had 3 classes, the probability is 1/3 = 33%, you get the idea.

Now, we are going to calculate the final probability of each class to see which class our sample data most likely belongs to.

# calculate prior
prior = 1/n_class

final_probabilities = []
for i in k:
  class_prob = np.prod(i) * prior
  final_probabilities.append(class_prob)

These are the results of the above code. The final probabilities of each class are below.

Step 9. Figuring out highest probability and relating it back to the class.

Now that we have our probabilities for each class, let’s get the maximum probability.

# calculating maximum probability. 
maximum_prob = max(final_probabilities)
>>>Output: 0.0005377909183630018

# getting the index that corresponds to the maximum probability
prob_index = final_probabilities.index(maximum_prob)

# using the index above to map out what our prediction is
prediction = y[prob_index]
>>>Output: 1

Since our maximum probability is the second number which belongs to the second class, we can conclude that the sample data we received probably belong to a FEMALE.

There you have it. A step by step walk through of how to build a Gaussian naive Bayes classifier. Let me know what you think in the comment section BELOW.

Step 10: All the code above put together in a class

If we take everything above and put it into a class, this is what it would look like

import pandas as pd
import numpy as np
class gnb:
  def __init__(self, prior=None, n_class=None, 
               mean=None, variance = None, classes=None):
    # prior assumption of probability
    self.prior = prior
    # how many unique classes
    self.n_class = n_class
    # mean of x values
    self.mean = mean
    # variance of x values
    self.variance = variance
    # the unique classes present
    self.classes = classes

  # get the mean and variance of the x values
  def fit(self, x, y):
    # get the mean and variance of the x values
    self.x = x
    self.y = y
    self.mean = np.array(x.groupby(by=y).mean())
    self.variance = np.array(x.groupby(by=y).var())
    self.n_class = len(np.unique(y))
    self.classes = np.unique(y)
    self.prior = 1/self.n_class
    return self

  def mean_var(self):
    # mean and variance from the trainig data
    m = np.array(self.mean)
    v = np.array(self.variance)

    # pull and combine the corresponding mean and variance
    self.mean_var = []
    for i in range(len(m)):
      m_row = m[i]
      v_row = v[i]
      for a, b in enumerate(m_row):
        mean = b
        var = v_row[a]
        self.mean_var.append([mean, var])

    return self.mean_var

  def split(self):
    spt = np.vsplit(np.array(self.mean_var()), self.n_class)
    return spt

  def gnb_base(self, x_val, x_mean, x_var):
    # define the base formula for prediction probabilities
    # Variance of the x value in question
    self.x_val = x_val
    # x mean value
    self.x_mean = x_mean
    # the x value that is being used for computation
    self.x_var = x_var

    # natural log
    e = np.e
    # pi
    pi = np.pi
    # first part of the equation
    # 1 divided by the sqrt of 2 * pi * x_variance
    equation_1 = 1/(np.sqrt(2 * pi * x_var))
    
    # second part of equation implementation
    # denominator of equation
    denom = 2 * x_var

    # numerator calculation

    numerator = (x_val - x_mean) ** 2
    # the exponent
    expo = np.exp(-(numerator/denom))
    prob = equation_1 * expo

    return prob

  def predict(self, X):
    self.X = X
    # calculate the probabilities using base formula above

    # defining the mean and variance that has being split into
    # various classes.

    split_class = self.split()
    prob = []
    for i in range(self.n_class):
      # first class
      class_one = split_class[i]
      for i in range(len(class_one)):
        # first value in class one
        class_one_x_mean = class_one[i][0]
        class_one_x_var = class_one[i][1]
        x_value = X[i]
        # now calculate the probabilities of each class. 
        prob.append([self.gnb_base(x_value, class_one_x_mean, 
                                   class_one_x_var)])

      # turn prob into an array

      prob_array = np.array(prob)

      # split the probability into various classes again

      prob_split = np.vsplit(prob_array, self.n_class)

      # calculate the final probabilities

      final_probabilities = []

      for i in prob_split:
        class_prob = np.prod(i) * self.prior
        final_probabilities.append(class_prob)

      # determining the maximum probability 
      maximum_prob = max(final_probabilities)

      # getting the index that corresponds to maximum probability
      prob_index = final_probabilities.index(maximum_prob)

      # using the index of the maximum probability to get
      # the class that corresponds to the maximum probability
      prediction = self.classes[prob_index]

      return prediction

And to demonstrate how this would work, we will use the dummy data created above.

import pandas as pd
import numpy as np

x_values = pd.DataFrame({
    'height': [6, 5.92, 5.58, 5.92, 5, 5.5, 5.42, 5.75],
    'weight': [180,190,170,165,100,150,130,150],
    'foot': [12,11,12,10,6,8,7,9]
})

y = pd.Series([0,0,0,0,1,1,1,1])

# to be used in sklearn due to how sklearn wants the data
sample = pd.DataFrame({
    'height': [6],
    'weight': [130],
    'foot': [8]
})

# same data as sample above, but to be used in custome built
# classifier due to the data structure our class is expecting.
x_test = [6, 13, 8]

Then this is the part where we instantiate the class we created above.

gnb = gnb()

# fit the class to the x_values and target
gnb.fit(x_values, y)

# given these X values, let's do a prediction

gnb.predict([6, 130, 8])

Output: 1

When I run these training and test data through sklearn gaussian nb classifier, we also get “1” as the prediction

Here is the link to the github repo for this project.

Let me know what you think in the comment section below.

Leave a Comment

Scroll to Top