The Perceptron – Part I

In my last post, I discussed a simple yet useful construct of Naive Bayes Classifier. In this post, we will cover the fundamental building block of an artificial neural network called as the perceptron. We will first cover single layer perceptron and then move on to multilayer perceptron.

Single Layer Perceptron 


As the above image shows (courtesy Andrej Karpathy), the perceptron has its inpiration from a biological neuron. Without going into details of the analogy, let’s understand how the perceptron operates. There are three input signals above: x0, x1, x2 . For every input, there are corresponding weights: w0, w1, w2 . The cell body essentially combines all these signals and adds a bias b. Finally, we apply an activation function to the output to control when the perceptron should be activated. Hence, the entire perceptron does the operation f(\sum_i w_i*x_i  +b). Here f is an activation function of our choice. There are several activation functions that can be used (check this blog) for different purposes. We will be using sigmoid: f(z)=\dfrac{1}{1+\exp(-z)} for the sake of this post. As we know, sigmoid is a non-linear function. However, note that without hidden layers we cannot perform non-linear classification as illustrated here. However, we will be able to perform non-linear classification once we build a multi layer perceptron later on. For now, the sigmoid will act more like a squashing function that brings the output in our desired 0-1 range.

Loss Function and Gradient Descent

Our goal is to minimise the error or loss between target and predicted classes. There are several ways to capture this loss as summarised here that come with various trade-offs. We will be using the commonly used MSE (Mean Squared Error) or L2 loss. As the name suggests, MSE is given by error=\dfrac{1}{n} \sum_i^n (y_{i_{actual}}-y_{i_{predicted}})^2 where n is the number of samples in the dataset.

Let’s understand how we can minimise the above loss function. One common way to perform this optimisation is using Gradient Descent. There are different variations of Gradient Descent as this paper describes. We will stick to the basic form of Gradient Descent, which gives us the update rule:

W=W - \alpha \dfrac{\partial error}{\partial W}

Inserting the value of error term above and differentiating, we get:CodeCogsEqn (6)Here y_{i_{predicted}} is nothing but our perceptron output, which is a function of our input X. Therefore, the derivative reduces to simply:CodeCogsEqn (5)Note that we have safely ignored the constant from above. Also note the change in sign due to derivative.


I recommend reading this blog, which is a good guide for implementation of perceptron. We will use a modified version of the same for our continued example of the toy Kaggle problem. We will be using Tensorflow for our implementation, however it goes without saying that the above constructs remain same irrespective of the tool used. The code below is also available here.

In [1]:
# We are going to use Kaggle's playground data as an example. 
# The data files can be downloaded from
In [2]:
import tensorflow as tf
import pandas as pd
In [3]:
# Load the training data. To simplify this example we will be ignoring the color feature from the data.
In [4]:
bone_length rotting_flesh hair_length has_soul type
0 0.354512 0.350839 0.465761 0.781142 Ghoul
1 0.575560 0.425868 0.531401 0.439899 Goblin
2 0.467875 0.354330 0.811616 0.791225 Ghoul
3 0.776652 0.508723 0.636766 0.884464 Ghoul
4 0.566117 0.875862 0.418594 0.636438 Ghost
In [5]:
# Load the test data and ignore color feature
In [6]:
id bone_length rotting_flesh hair_length has_soul
0 3 0.471774 0.387937 0.706087 0.698537
1 6 0.427332 0.645024 0.565558 0.451462
2 9 0.549602 0.491931 0.660387 0.449809
3 10 0.638095 0.682867 0.471409 0.356924
4 13 0.361762 0.583997 0.377256 0.276364
In [7]:
# We are going to use preprocessing module from sklearn, which is simple to work with
from sklearn import preprocessing
import numpy as np
In [8]:
# Separate the features and target
In [9]:
# Since we have three categorical labels, we will use LabelEncoder and OneHotEncoder to get into proper format
le = preprocessing.LabelEncoder()
In [10]:
# Validate the shape of target
(371, 3)
In [11]:
# Create placeholders in tensorflow
# Create variables, note the shape 
W = tf.Variable(tf.zeros([num_features, 3]), tf.float32)
B = tf.Variable(tf.zeros([1, 3]), tf.float32)
In [12]:
# Set a learning rate (alpha)
In [13]:
# This is the core weight updation logic. Note that we are using softmax given we have three possible labels
Y_pred= tf.nn.softmax(tf.add(tf.matmul(X,W),B))
err=Y - tf.to_float(Y_pred)
deltaW = tf.matmul(tf.transpose(X), err) 
deltaB = tf.reduce_sum(err, 0) 
W_ = W + learning_rate * deltaW
B_ = B + learning_rate * deltaB
step =, B.assign(B_)) 
In [14]:
# Train the perceptron
sess = tf.Session()
init = tf.global_variables_initializer()
for k in range(num_iter):, feed_dict={X: x, Y: y})
W =
b =
In [15]:
# Predict for test set
X_test = tf.placeholder(tf.float32,shape=[None,num_features])
preds  =,W),b)),axis=1),feed_dict={X_test:x_test})
# Get the actual type back from LabelEncoder
In [16]:
# Write result to dataframe in required format
# Submitting the above file to Kaggle gave a score of 0.74291, similar to our score from Naive Bayes from last post

In the next part, we shall extend this construct by adding more layers to the perceptron.