The Naive Bayes Classifier

“Like all magnificent things, it’s very simple.”  ― Natalie Babbitt

In my last post, I described the basic framework that is used to derive most of the ML algorithms. In this post, I’m going to dive deeper into that idea with the concrete example of Naive Bayes Classifier.

But before that, let’s be clear about what are we trying to achieve. We want to build a mathematical model of a given data set. Why would we want to do that? Well, if we have a model that explains the observed data, then we can use the same model to predict the unobserved outcomes as well. You could have various models that can “fit” (explain) the data, however their “accuracies” or correctness of prediction can vary. Therefore, every model can be viewed as a hypothesis that gets evaluated by it’s ability to predict.

Describing the Data

Let’s assume we have a dataset D with a few variables (x1,x2,x3) where x3 is the class label. In order to distinguish this target variable, we will call it Y. Now, let X be the set of variables (x1,x2) which are nothing but the features that describe the data.

Deriving the Model

First things first. Naive Bayes is a generative model. This simply means that we will be calculating the joint distribution p(X,Y). Consider the following equation for Bayes theorem:

p(Y|X)=\dfrac{p(X,Y)}{p(X)}

The denominator is effectively a constant as it does not depend on Y. Therefore, maximising p(Y|X) would be equivalent to maximising p(X,Y). Why should we do this? Well, if know how to calculate p(Y|X), we can predict the class with maximum probability and that would be our final predicted class.

Expanding the numerator :

p(X,Y)=p(x1,x2|Y).p(Y)

Naive Bayes assumes that the features are conditionally independent. This means that

p(x1,x2|Y)=p(x1|Y).p(x2|Y)

Therefore,

p(X,Y)=p(x1|Y).p(x2|Y).p(Y)

Generalising the above to n features,

p(X,Y)=p(x1|Y).p(x2|Y)..p(xn|Y).p(Y)

That’s it. Now that we know how to get the joint distribution, all we need to do is to calculate p(X,Y) for every possible value of Y and choose the one with highest likelihood.

Practically, Naive Bayes is often used as a baseline approach and in spite of it’s crude feature independence assumption usually gives a fairly decent accuracy.

Now let’s implement this model with a fun example!

The code is also available here.

# We are going to use Kaggle's playground data as an example. 
# The data files can be downloaded from https://www.kaggle.com/c/ghouls-goblins-and-ghosts-boo/data
import pandas as pd
import scipy.stats as stats
# Load the training data. To simplify this example we will be ignoring the color feature from the data.
training_data=pd.read_csv("train.csv")
training_data=training_data[['bone_length','rotting_flesh','hair_length','has_soul','type']]
training_data.head(n=5)
bone_length rotting_flesh hair_length has_soul type
0 0.354512 0.350839 0.465761 0.781142 Ghoul
1 0.575560 0.425868 0.531401 0.439899 Goblin
2 0.467875 0.354330 0.811616 0.791225 Ghoul
3 0.776652 0.508723 0.636766 0.884464 Ghoul
4 0.566117 0.875862 0.418594 0.636438 Ghost
# Load the test data and ignore color feature
test_data=pd.read_csv("test.csv")
test_data=test_data[['id','bone_length','rotting_flesh','hair_length','has_soul']]
test_data.head(n=5)
id bone_length rotting_flesh hair_length has_soul
0 3 0.471774 0.387937 0.706087 0.698537
1 6 0.427332 0.645024 0.565558 0.451462
2 9 0.549602 0.491931 0.660387 0.449809
3 10 0.638095 0.682867 0.471409 0.356924
4 13 0.361762 0.583997 0.377256 0.276364
# Now we write a function to generate p(X|Y) for our Naive Bayes model 
# Note that we are using a Normal Distribution as the features are continuous
# The parameters (mean and standard deviation) of this distribution are 
# estimated from the training data
def p_x_y(test_x,train_series_given_y,features):
    mean=train_series_given_y.mean()
    std=train_series_given_y.std()
    p_x_y=[stats.norm.pdf(test_x[f],mean[f],std[f]) for f in features]
    p=1.0
    for l in p_x_y:
        p=p*l   
    return p
# Calculate p_x_y for every label for every test data
features=['bone_length','rotting_flesh','hair_length','has_soul']
index_probs=pd.DataFrame(columns=['index','label','num_prob'])
i=0
for index,row in test_data.iterrows():
    for label in ['Ghoul','Goblin','Ghost']:
        p=p_x_y(row[features],training_data[training_data['type']==label],features)  
        index_probs.loc[i]=[row['id'],label,p]
        i+=1
# For each id, choose label with max p_x_y
max_prob=index_probs.groupby('index').max()['num_prob'].reset_index()
final=index_probs.merge(max_prob)
final=final[['index','label']]
final.columns=['id','type']
final['id']=final['id'].astype(int)
final.head(n=10)
id type
0 3 Ghoul
1 6 Goblin
2 9 Ghoul
3 10 Ghost
4 13 Ghost
5 14 Ghost
6 15 Ghoul
7 16 Ghoul
8 17 Goblin
9 18 Ghoul

Posting this solution on Kaggle resulted in a score of 0.7429, which is not bad for a model as simple as this. I will encourage you to keep trying to improve this score further using feature engineering or by using other models that we will see in future posts.

Hope this post was helpful in getting you started with a simple baseline model. In the next posts, we shall look at models that do not assume feature independence and are hence more complex in nature.