The Basic Machine Learning Framework

“Education is what remains after you have forgotten everything you learned in school.” – Albert Einstein

There are many, many Machine Learning (ML) algorithms out there and it is quite intimidating to a beginner as to where to start. The key is to realise that though there are several techniques, they are all governed by the same fundamental framework. Many folks jump the gun and instead of focussing on the basics, directly move to advanced topics resulting in a weak foundation and understanding. I have seen folks applying advanced models to data without really understanding what it’s doing underneath. They tend to think it is as a black box – feed it data and get the output. Unfortunately, this is not how ML works in practice.

Real data is incredibly messy, and nothing like what they show in textbooks. To make ML algorithms work with real data, it is important to understand how they work, what are the tweaks needed, and finally how to interpret the results. And for such an understanding, the basic framework of ML is critical. Once you understand the basics, it is easy to understand any new algorithm quickly. Even if you don’t remember all the nitty gritty details of the derivation of an algorithm, but you know the fundamental principles and assumptions behind it, you are good to leverage it for your use case. In terms of the quote you read at the start of this post, basic ML framework is what will stay with you even if you forget the gory mathematical details of several ML algorithms.

The Framework

First of all, let us understand that behind all the glittering posters of ML, there is Mathematics. You cannot run away from the math if you want to be a good Data Scientist. However, this does not mean you leave everything, pick up several text books and dedicate a year to study math. All we need to do is brush up the relevant concepts from Linear Algebra, Statistics, Calculus as and when the concepts come up in different topics. I will cover the relevant topics of math in this blog, as required.

Second, it is important to understand the broad framework or taxonomy of all ML algorithms. I have drawn a very basic framework here.

ML_Framework_3.png

 

You don’t need to understand all of it in one go. Also, you will see certain models like HMM and LDA appearing both in the Supervised and Unsupervised Learning. This is because they are generic algorithms and can be used in both the settings. More on this in later posts when we dive deeper into the models. For now, we should understand what each of the boxes in the above diagram are at a high level. For that, let’s take an example.

Let’s say you have historical data about customers from an online store. So each “sample” or data point is a customer. For every customer, you know certain attributes or “features” that describe them well. The data header might look something like the following:

customer_id, number_transactions, total_profit, number_returns, total_revenue

Unsupervised Learning: 

In the above example, let’s say we wanted to create segments of similar customers. There would be two ways to go about it. One is to make up certain heuristics (total_revenue>10000 and total_profit>1000 goes to segment 5, etc). This approach usually works if you have certain hard rules for the segments. However, many times we don’t know what the rules will be and there could be multiple ways of grouping same customers. In such cases, unsupervised approaches like Clustering help. At a high level, clustering algorithms like K-means will group the customers into n clusters, each cluster having customers with similar features. The is usually defined by business needs or using heuristics.

Supervised Learning:

In the same example, let’s say we had an additional “label” called is_fraud that marks a particular customer as a fraudulent customer or not (0/1). These labels could come from a human or some other input (hence supervised). Now we want to build an intelligent system, that analyses the features and predicts if a customer is likely to be fraudulent. For such scenarios, supervised learning algorithms are widely used. The problem I just described above is called classification where you divide the data into fixed predefined classes (labels). Now, let’s say instead of a discrete variable (label), we instead had the amount of fraud that can be attributed to the customer. In this case, regression would be right approach to be used. Mathematically, if our target variable is discrete, we use classification and if it is continuous we use regression.

Within classification, we have two broad types of models – generative and discriminative. Explaining these two is slightly more involved and we shall go over them in a future post. For now, think of generative models as those that can actually create an artificial sample of data (apart from being able to classify the samples), whereas discriminative models only aim to solve the problem of classification. Now you might think, what is the need to have discriminative models at all? This is because in several experiments, discriminative models (like SVM) have outperformed generative models for the task of classification.

Now that we know the basic framework of ML, it is important to analyse every model and algorithm that we know in it’s context. This will give a solid foundation to the way we think and approach a real Data Science problem.

Data Processing – The Modern Way

“If I had eight hours to cut down a tree, I’d spend six hours sharpening the axe”

– Abraham Lincoln

We have come a long way from the times when digital data was sparse and inaccessible. Today, the amount of data added to the Internet is staggering. In 2014, YouTube users alone uploaded 72 hours of new video every single minute! At Myntra itself, millions of data points are produced every single day, which include several user interactions.

You can’t go to war with a knife in hand. You need the right tools for the right job. Without a doubt, modern data crunching technologies are required to help Data Scientists mine insights. Only, there is a bit of a problem. Even the Big Data space has been crowded and exploding with newer tools; each one claiming to be superior than the earlier ones. Fortunately, we can be selective here and choose the weapons of our liking. I will briefly share my experience with a couple of platforms that I have recently experimented with: Dato and Apache Spark. Both of these promising technologies offer built-in Machine Learning capabilities, and more.

Turi (https://turi.com):  Earlier known as Dato/GraphLab, Turi is a scalable Machine Learning platform. It is a good option if you want to get started quickly and are willing to lose out on some flexibility. It is intuitive and straightforward to use, especially with Python. The layer of abstraction which I found to be most useful is the SFrame. SFrame can be thought of as a scaled version of the Pandas DataFrame, and is extremely powerful with joins and several other data transformations. It supports loading data from various formats including the DataFrame, Python dictionary, and JSON. However, be advised that Turi is a commercial offering. You can try out their trial version and check if it’s something you can effectively work with.

Apache Spark (http://spark.apache.org):  If I had to put my time and effort on one platform, this would be it. Yes, it comes with an initial learning curve and you would need to understand the architecture before achieving anything useful with it. But in my opinion, it’s worth it, for several reasons. For starters, it’s open source and has an active community. There are drivers providing seamless integration with many other technologies including Amazon S3 and Apache Cassandra, which is where much of the existing data would lie for many folks. Most importantly, Spark provides a layer of abstraction called the RDD (Resilient Distributed Datasets), which rivals Turi’s SFrame in some sense. In fact, Turi allows you to load data into an SFrame from an existing RDD. The RDD abstraction allows several fundamental operations for processing data. Spark also provides data structures like LabeledPoint, which can help you leverage MLib, the machine learning library for Spark.

Spark Streaming is an extension that enables you to consume streaming data. At Myntra, we have leveraged Spark Streaming to consume real time user interactions. We were able to process and aggregate these interactions in near real time, thus engineering the features for our predictive model. The output of the entire process was easily stored into Cassandra. Thus, a single platform enabled us to build the feature engineering pipeline, build offline predictive models, as well as score new data coming in, near real time.

To conclude, as data keeps piling up every single minute, exploring technologies like Dato and Spark might soon be vital to stay afloat in the Big Data ocean.

What is Data Science?

Good question! In simple terms, Data Science is everything that can help solve problems using data. It is an interdisciplinary field requiring skills from various disciplines like Software Engineering, Data Mining and Machine Learning. Depending on the job description, the weight given to one or more of these disciplines changes. Why is that? Each institution has it’s own requirements and depending on it’s goals gives more emphasis on one area over other. A research establishment might give more weight to develop core Machine Learning algorithms, while an industrial setting would tend to give emphasis on writing production level code with an in depth knowledge of state of the art Machine Learning algorithms.

While everyone is making up their own definitions of Data Science, let me make up my own for the purpose of this blog. Data Science is a field that spans the complete pipeline – right from gathering and processing the raw data, engineering features from this data, building predictive models and finally powering business use-cases with the obtained models. That seems quite a task! While we are expected to have skills across the entire pipeline, very few would actually build deep expertise in every stage of this pipeline. Data Scientists usually develop deep expertise in one or more stages of this pipeline and collaborate with others who have expertise in other stages. For example, you could build expertise in Feature Engineering/Machine Learning and collaborate with Data Engineers to build ETLs for the data. However, this might vary depending on the size of the organisation. In a startup setting, one is required to wear multiple hats and it may not be always feasible to rely on a Data Engineer to give you precious data, which is the very backbone of this profession. So, it always helps to know a bit of everything, while building deep expertise in any one of the stages of the pipeline. This one stage usually is Machine Learning for most of the Data Scientists and will be the primary focus for this blog, though at times I will write about relevant techniques for data processing as well.

Broadly, Data Science problems can be classified into three buckets –

Descriptive Analysis : This is also referred to as “backward looking” analysis. In this type of problems, we try to mainly understand historical data, or events that have already occurred. Example: Who are my best customers historically?

Predictive Analysis: This “forward looking” analysis is used to predict events in future. Example: Who will be my best customers in future?

Prescriptive Analysis: This is the most advanced form of analysis where we expect our algorithms to prescribe actions that would fix a problem. Example: What steps should I take to avoid customer churn?

In this blog, I shall discuss problems in each of these buckets and the techniques that are used to solve them.