The Perceptron – Part I

In my last post, I discussed a simple yet useful construct of Naive Bayes Classifier. In this post, we will cover the fundamental building block of an artificial neural network called as the perceptron. We will first cover single layer perceptron and then move on to multilayer perceptron.

Single Layer Perceptron 

neuron_model

As the above image shows (courtesy Andrej Karpathy), the perceptron has its inpiration from a biological neuron. Without going into details of the analogy, let’s understand how the perceptron operates. There are three input signals above: x0, x1, x2 . For every input, there are corresponding weights: w0, w1, w2 . The cell body essentially combines all these signals and adds a bias b. Finally, we apply an activation function to the output to control when the perceptron should be activated. Hence, the entire perceptron does the operation f(\sum_i w_i*x_i  +b). Here f is an activation function of our choice. There are several activation functions that can be used (check this blog) for different purposes. We will be using sigmoid: f(z)=\dfrac{1}{1+\exp(-z)} for the sake of this post. As we know, sigmoid is a non-linear function. However, note that without hidden layers we cannot perform non-linear classification as illustrated here. However, we will be able to perform non-linear classification once we build a multi layer perceptron later on. For now, the sigmoid will act more like a squashing function that brings the output in our desired 0-1 range.

Loss Function and Gradient Descent

Our goal is to minimise the error or loss between target and predicted classes. There are several ways to capture this loss as summarised here that come with various trade-offs. We will be using the commonly used MSE (Mean Squared Error) or L2 loss. As the name suggests, MSE is given by error=\dfrac{1}{n} \sum_i^n (y_{i_{actual}}-y_{i_{predicted}})^2 where n is the number of samples in the dataset.

Let’s understand how we can minimise the above loss function. One common way to perform this optimisation is using Gradient Descent. There are different variations of Gradient Descent as this paper describes. We will stick to the basic form of Gradient Descent, which gives us the update rule:

W=W - \alpha \dfrac{\partial error}{\partial W}

Inserting the value of error term above and differentiating, we get:CodeCogsEqn (6)Here y_{i_{predicted}} is nothing but our perceptron output, which is a function of our input X. Therefore, the derivative reduces to simply:CodeCogsEqn (5)Note that we have safely ignored the constant from above. Also note the change in sign due to derivative.

Implementation

I recommend reading this blog, which is a good guide for implementation of perceptron. We will use a modified version of the same for our continued example of the toy Kaggle problem. We will be using Tensorflow for our implementation, however it goes without saying that the above constructs remain same irrespective of the tool used. The code below is also available here.

In [1]:
# We are going to use Kaggle's playground data as an example. 
# The data files can be downloaded from https://www.kaggle.com/c/ghouls-goblins-and-ghosts-boo/data
In [2]:
import tensorflow as tf
import pandas as pd
In [3]:
# Load the training data. To simplify this example we will be ignoring the color feature from the data.
In [4]:
training_data=pd.read_csv("../Data/GGG/train.csv")
training_data=training_data[['bone_length','rotting_flesh','hair_length','has_soul','type']]
training_data.head(n=5)
Out[4]:
bone_length rotting_flesh hair_length has_soul type
0 0.354512 0.350839 0.465761 0.781142 Ghoul
1 0.575560 0.425868 0.531401 0.439899 Goblin
2 0.467875 0.354330 0.811616 0.791225 Ghoul
3 0.776652 0.508723 0.636766 0.884464 Ghoul
4 0.566117 0.875862 0.418594 0.636438 Ghost
In [5]:
# Load the test data and ignore color feature
In [6]:
test_data=pd.read_csv("../Data/GGG/test.csv")
test_data=test_data[['id','bone_length','rotting_flesh','hair_length','has_soul']]
test_data.head(n=5)
Out[6]:
id bone_length rotting_flesh hair_length has_soul
0 3 0.471774 0.387937 0.706087 0.698537
1 6 0.427332 0.645024 0.565558 0.451462
2 9 0.549602 0.491931 0.660387 0.449809
3 10 0.638095 0.682867 0.471409 0.356924
4 13 0.361762 0.583997 0.377256 0.276364
In [7]:
# We are going to use preprocessing module from sklearn, which is simple to work with
from sklearn import preprocessing
import numpy as np
In [8]:
# Separate the features and target
x=training_data.drop('type',axis=1).values
(n,num_features)=x.shape
y=training_data['type'].values
In [9]:
# Since we have three categorical labels, we will use LabelEncoder and OneHotEncoder to get into proper format
le = preprocessing.LabelEncoder()
y=le.fit_transform(y)
onehot=preprocessing.OneHotEncoder(sparse=False)
y=y.reshape(n,1)
y=onehot.fit_transform(y)
In [10]:
# Validate the shape of target
y.shape
Out[10]:
(371, 3)
In [11]:
# Create placeholders in tensorflow
X=tf.placeholder(tf.float32,shape=[None,num_features])
Y=tf.placeholder(tf.float32,shape=[None,3])
# Create variables, note the shape 
W = tf.Variable(tf.zeros([num_features, 3]), tf.float32)
B = tf.Variable(tf.zeros([1, 3]), tf.float32)
In [12]:
# Set a learning rate (alpha)
learning_rate=0.01
In [13]:
# This is the core weight updation logic. Note that we are using softmax given we have three possible labels
Y_pred= tf.nn.softmax(tf.add(tf.matmul(X,W),B))
err=Y - tf.to_float(Y_pred)
deltaW = tf.matmul(tf.transpose(X), err) 
deltaB = tf.reduce_sum(err, 0) 
W_ = W + learning_rate * deltaW
B_ = B + learning_rate * deltaB
step = tf.group(W.assign(W_), B.assign(B_)) 
In [14]:
# Train the perceptron
num_iter=10000
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
 
for k in range(num_iter):
    sess.run(step, feed_dict={X: x, Y: y})
    
W = sess.run(W)
b = sess.run(B)
In [15]:
# Predict for test set
ids=test_data['id'].values
x_test=test_data.drop('id',axis=1).values
(n,num_features)=x_test.shape
X_test = tf.placeholder(tf.float32,shape=[None,num_features])
preds  = sess.run(tf.argmax(tf.nn.softmax(tf.add(tf.matmul(X_test,W),b)),axis=1),feed_dict={X_test:x_test})
# Get the actual type back from LabelEncoder
preds_trans=le.inverse_transform(preds)
In [16]:
# Write result to dataframe in required format
result=pd.DataFrame(ids.reshape(len(ids),1))
result['type']=preds_trans.reshape(len(preds_trans),1)
result.columns=[['id','type']]
result.to_csv('../Data/GGG/perceptron_ggg.csv',index=False)
# Submitting the above file to Kaggle gave a score of 0.74291, similar to our score from Naive Bayes from last post

In the next part, we shall extend this construct by adding more layers to the perceptron.

Advertisements

The Naive Bayes Classifier

“Like all magnificent things, it’s very simple.”  ― Natalie Babbitt

In my last post, I described the basic framework that is used to derive most of the ML algorithms. In this post, I’m going to dive deeper into that idea with the concrete example of Naive Bayes Classifier.

But before that, let’s be clear about what are we trying to achieve. We want to build a mathematical model of a given data set. Why would we want to do that? Well, if we have a model that explains the observed data, then we can use the same model to predict the unobserved outcomes as well. You could have various models that can “fit” (explain) the data, however their “accuracies” or correctness of prediction can vary. Therefore, every model can be viewed as a hypothesis that gets evaluated by it’s ability to predict.

Describing the Data

Let’s assume we have a dataset D with a few variables (x1,x2,x3) where x3 is the class label. In order to distinguish this target variable, we will call it Y. Now, let X be the set of variables (x1,x2) which are nothing but the features that describe the data.

Deriving the Model

First things first. Naive Bayes is a generative model. This simply means that we will be calculating the joint distribution p(X,Y). Consider the following equation for Bayes theorem:

p(Y|X)=\dfrac{p(X,Y)}{p(X)}

The denominator is effectively a constant as it does not depend on Y. Therefore, maximising p(Y|X) would be equivalent to maximising p(X,Y). Why should we do this? Well, if know how to calculate p(Y|X), we can predict the class with maximum probability and that would be our final predicted class.

Expanding the numerator :

p(X,Y)=p(x1,x2|Y).p(Y)

Naive Bayes assumes that the features are conditionally independent. This means that

p(x1,x2|Y)=p(x1|Y).p(x2|Y)

Therefore,

p(X,Y)=p(x1|Y).p(x2|Y).p(Y)

Generalising the above to n features,

p(X,Y)=p(x1|Y).p(x2|Y)..p(xn|Y).p(Y)

That’s it. Now that we know how to get the joint distribution, all we need to do is to calculate p(X,Y) for every possible value of Y and choose the one with highest likelihood.

Practically, Naive Bayes is often used as a baseline approach and in spite of it’s crude feature independence assumption usually gives a fairly decent accuracy.

Now let’s implement this model with a fun example!

The code is also available here.

# We are going to use Kaggle's playground data as an example. 
# The data files can be downloaded from https://www.kaggle.com/c/ghouls-goblins-and-ghosts-boo/data
import pandas as pd
import scipy.stats as stats
# Load the training data. To simplify this example we will be ignoring the color feature from the data.
training_data=pd.read_csv("train.csv")
training_data=training_data[['bone_length','rotting_flesh','hair_length','has_soul','type']]
training_data.head(n=5)
bone_length rotting_flesh hair_length has_soul type
0 0.354512 0.350839 0.465761 0.781142 Ghoul
1 0.575560 0.425868 0.531401 0.439899 Goblin
2 0.467875 0.354330 0.811616 0.791225 Ghoul
3 0.776652 0.508723 0.636766 0.884464 Ghoul
4 0.566117 0.875862 0.418594 0.636438 Ghost
# Load the test data and ignore color feature
test_data=pd.read_csv("test.csv")
test_data=test_data[['id','bone_length','rotting_flesh','hair_length','has_soul']]
test_data.head(n=5)
id bone_length rotting_flesh hair_length has_soul
0 3 0.471774 0.387937 0.706087 0.698537
1 6 0.427332 0.645024 0.565558 0.451462
2 9 0.549602 0.491931 0.660387 0.449809
3 10 0.638095 0.682867 0.471409 0.356924
4 13 0.361762 0.583997 0.377256 0.276364
# Now we write a function to generate p(X|Y) for our Naive Bayes model 
# Note that we are using a Normal Distribution as the features are continuous
# The parameters (mean and standard deviation) of this distribution are 
# estimated from the training data
def p_x_y(test_x,train_series_given_y,features):
    mean=train_series_given_y.mean()
    std=train_series_given_y.std()
    p_x_y=[stats.norm.pdf(test_x[f],mean[f],std[f]) for f in features]
    p=1.0
    for l in p_x_y:
        p=p*l   
    return p
# Calculate p_x_y for every label for every test data
features=['bone_length','rotting_flesh','hair_length','has_soul']
index_probs=pd.DataFrame(columns=['index','label','num_prob'])
i=0
for index,row in test_data.iterrows():
    for label in ['Ghoul','Goblin','Ghost']:
        p=p_x_y(row[features],training_data[training_data['type']==label],features)  
        index_probs.loc[i]=[row['id'],label,p]
        i+=1
# For each id, choose label with max p_x_y
max_prob=index_probs.groupby('index').max()['num_prob'].reset_index()
final=index_probs.merge(max_prob)
final=final[['index','label']]
final.columns=['id','type']
final['id']=final['id'].astype(int)
final.head(n=10)
id type
0 3 Ghoul
1 6 Goblin
2 9 Ghoul
3 10 Ghost
4 13 Ghost
5 14 Ghost
6 15 Ghoul
7 16 Ghoul
8 17 Goblin
9 18 Ghoul

Posting this solution on Kaggle resulted in a score of 0.7429, which is not bad for a model as simple as this. I will encourage you to keep trying to improve this score further using feature engineering or by using other models that we will see in future posts.

Hope this post was helpful in getting you started with a simple baseline model. In the next posts, we shall look at models that do not assume feature independence and are hence more complex in nature.

The Basic Machine Learning Framework

“Education is what remains after you have forgotten everything you learned in school.” – Albert Einstein

There are many, many Machine Learning (ML) algorithms out there and it is quite intimidating to a beginner as to where to start. The key is to realise that though there are several techniques, they are all governed by the same fundamental framework. Many folks jump the gun and instead of focussing on the basics, directly move to advanced topics resulting in a weak foundation and understanding. I have seen folks applying advanced models to data without really understanding what it’s doing underneath. They tend to think it is as a black box – feed it data and get the output. Unfortunately, this is not how ML works in practice.

Real data is incredibly messy, and nothing like what they show in textbooks. To make ML algorithms work with real data, it is important to understand how they work, what are the tweaks needed, and finally how to interpret the results. And for such an understanding, the basic framework of ML is critical. Once you understand the basics, it is easy to understand any new algorithm quickly. Even if you don’t remember all the nitty gritty details of the derivation of an algorithm, but you know the fundamental principles and assumptions behind it, you are good to leverage it for your use case. In terms of the quote you read at the start of this post, basic ML framework is what will stay with you even if you forget the gory mathematical details of several ML algorithms.

The Framework

First of all, let us understand that behind all the glittering posters of ML, there is Mathematics. You cannot run away from the math if you want to be a good Data Scientist. However, this does not mean you leave everything, pick up several text books and dedicate a year to study math. All we need to do is brush up the relevant concepts from Linear Algebra, Statistics, Calculus as and when the concepts come up in different topics. I will cover the relevant topics of math in this blog, as required.

Second, it is important to understand the broad framework or taxonomy of all ML algorithms. I have drawn a very basic framework here.

ML_Framework_3.png

 

You don’t need to understand all of it in one go. Also, you will see certain models like HMM and LDA appearing both in the Supervised and Unsupervised Learning. This is because they are generic algorithms and can be used in both the settings. More on this in later posts when we dive deeper into the models. For now, we should understand what each of the boxes in the above diagram are at a high level. For that, let’s take an example.

Let’s say you have historical data about customers from an online store. So each “sample” or data point is a customer. For every customer, you know certain attributes or “features” that describe them well. The data header might look something like the following:

customer_id, number_transactions, total_profit, number_returns, total_revenue

Unsupervised Learning: 

In the above example, let’s say we wanted to create segments of similar customers. There would be two ways to go about it. One is to make up certain heuristics (total_revenue>10000 and total_profit>1000 goes to segment 5, etc). This approach usually works if you have certain hard rules for the segments. However, many times we don’t know what the rules will be and there could be multiple ways of grouping same customers. In such cases, unsupervised approaches like Clustering help. At a high level, clustering algorithms like K-means will group the customers into n clusters, each cluster having customers with similar features. The is usually defined by business needs or using heuristics.

Supervised Learning:

In the same example, let’s say we had an additional “label” called is_fraud that marks a particular customer as a fraudulent customer or not (0/1). These labels could come from a human or some other input (hence supervised). Now we want to build an intelligent system, that analyses the features and predicts if a customer is likely to be fraudulent. For such scenarios, supervised learning algorithms are widely used. The problem I just described above is called classification where you divide the data into fixed predefined classes (labels). Now, let’s say instead of a discrete variable (label), we instead had the amount of fraud that can be attributed to the customer. In this case, regression would be right approach to be used. Mathematically, if our target variable is discrete, we use classification and if it is continuous we use regression.

Within classification, we have two broad types of models – generative and discriminative. Explaining these two is slightly more involved and we shall go over them in a future post. For now, think of generative models as those that can actually create an artificial sample of data (apart from being able to classify the samples), whereas discriminative models only aim to solve the problem of classification. Now you might think, what is the need to have discriminative models at all? This is because in several experiments, discriminative models (like SVM) have outperformed generative models for the task of classification.

Now that we know the basic framework of ML, it is important to analyse every model and algorithm that we know in it’s context. This will give a solid foundation to the way we think and approach a real Data Science problem.

Data Processing – The Modern Way

“If I had eight hours to cut down a tree, I’d spend six hours sharpening the axe”

– Abraham Lincoln

We have come a long way from the times when digital data was sparse and inaccessible. Today, the amount of data added to the Internet is staggering. In 2014, YouTube users alone uploaded 72 hours of new video every single minute! At Myntra itself, millions of data points are produced every single day, which include several user interactions.

You can’t go to war with a knife in hand. You need the right tools for the right job. Without a doubt, modern data crunching technologies are required to help Data Scientists mine insights. Only, there is a bit of a problem. Even the Big Data space has been crowded and exploding with newer tools; each one claiming to be superior than the earlier ones. Fortunately, we can be selective here and choose the weapons of our liking. I will briefly share my experience with a couple of platforms that I have recently experimented with: Dato and Apache Spark. Both of these promising technologies offer built-in Machine Learning capabilities, and more.

Turi (https://turi.com):  Earlier known as Dato/GraphLab, Turi is a scalable Machine Learning platform. It is a good option if you want to get started quickly and are willing to lose out on some flexibility. It is intuitive and straightforward to use, especially with Python. The layer of abstraction which I found to be most useful is the SFrame. SFrame can be thought of as a scaled version of the Pandas DataFrame, and is extremely powerful with joins and several other data transformations. It supports loading data from various formats including the DataFrame, Python dictionary, and JSON. However, be advised that Turi is a commercial offering. You can try out their trial version and check if it’s something you can effectively work with.

Apache Spark (http://spark.apache.org):  If I had to put my time and effort on one platform, this would be it. Yes, it comes with an initial learning curve and you would need to understand the architecture before achieving anything useful with it. But in my opinion, it’s worth it, for several reasons. For starters, it’s open source and has an active community. There are drivers providing seamless integration with many other technologies including Amazon S3 and Apache Cassandra, which is where much of the existing data would lie for many folks. Most importantly, Spark provides a layer of abstraction called the RDD (Resilient Distributed Datasets), which rivals Turi’s SFrame in some sense. In fact, Turi allows you to load data into an SFrame from an existing RDD. The RDD abstraction allows several fundamental operations for processing data. Spark also provides data structures like LabeledPoint, which can help you leverage MLib, the machine learning library for Spark.

Spark Streaming is an extension that enables you to consume streaming data. At Myntra, we have leveraged Spark Streaming to consume real time user interactions. We were able to process and aggregate these interactions in near real time, thus engineering the features for our predictive model. The output of the entire process was easily stored into Cassandra. Thus, a single platform enabled us to build the feature engineering pipeline, build offline predictive models, as well as score new data coming in, near real time.

To conclude, as data keeps piling up every single minute, exploring technologies like Dato and Spark might soon be vital to stay afloat in the Big Data ocean.

What is Data Science?

Good question! In simple terms, Data Science is everything that can help solve problems using data. It is an interdisciplinary field requiring skills from various disciplines like Software Engineering, Data Mining and Machine Learning. Depending on the job description, the weight given to one or more of these disciplines changes. Why is that? Each institution has it’s own requirements and depending on it’s goals gives more emphasis on one area over other. A research establishment might give more weight to develop core Machine Learning algorithms, while an industrial setting would tend to give emphasis on writing production level code with an in depth knowledge of state of the art Machine Learning algorithms.

While everyone is making up their own definitions of Data Science, let me make up my own for the purpose of this blog. Data Science is a field that spans the complete pipeline – right from gathering and processing the raw data, engineering features from this data, building predictive models and finally powering business use-cases with the obtained models. That seems quite a task! While we are expected to have skills across the entire pipeline, very few would actually build deep expertise in every stage of this pipeline. Data Scientists usually develop deep expertise in one or more stages of this pipeline and collaborate with others who have expertise in other stages. For example, you could build expertise in Feature Engineering/Machine Learning and collaborate with Data Engineers to build ETLs for the data. However, this might vary depending on the size of the organisation. In a startup setting, one is required to wear multiple hats and it may not be always feasible to rely on a Data Engineer to give you precious data, which is the very backbone of this profession. So, it always helps to know a bit of everything, while building deep expertise in any one of the stages of the pipeline. This one stage usually is Machine Learning for most of the Data Scientists and will be the primary focus for this blog, though at times I will write about relevant techniques for data processing as well.

Broadly, Data Science problems can be classified into three buckets –

Descriptive Analysis : This is also referred to as “backward looking” analysis. In this type of problems, we try to mainly understand historical data, or events that have already occurred. Example: Who are my best customers historically?

Predictive Analysis: This “forward looking” analysis is used to predict events in future. Example: Who will be my best customers in future?

Prescriptive Analysis: This is the most advanced form of analysis where we expect our algorithms to prescribe actions that would fix a problem. Example: What steps should I take to avoid customer churn?

In this blog, I shall discuss problems in each of these buckets and the techniques that are used to solve them.