“Education is what remains after you have forgotten everything you learned in school.” – Albert Einstein
There are many, many Machine Learning (ML) algorithms out there and it is quite intimidating to a beginner as to where to start. The key is to realise that though there are several techniques, they are all governed by the same fundamental framework. Many folks jump the gun and instead of focussing on the basics, directly move to advanced topics resulting in a weak foundation and understanding. I have seen folks applying advanced models to data without really understanding what it’s doing underneath. They tend to think it is as a black box – feed it data and get the output. Unfortunately, this is not how ML works in practice.
Real data is incredibly messy, and nothing like what they show in textbooks. To make ML algorithms work with real data, it is important to understand how they work, what are the tweaks needed, and finally how to interpret the results. And for such an understanding, the basic framework of ML is critical. Once you understand the basics, it is easy to understand any new algorithm quickly. Even if you don’t remember all the nitty gritty details of the derivation of an algorithm, but you know the fundamental principles and assumptions behind it, you are good to leverage it for your use case. In terms of the quote you read at the start of this post, basic ML framework is what will stay with you even if you forget the gory mathematical details of several ML algorithms.
First of all, let us understand that behind all the glittering posters of ML, there is Mathematics. You cannot run away from the math if you want to be a good Data Scientist. However, this does not mean you leave everything, pick up several text books and dedicate a year to study math. All we need to do is brush up the relevant concepts from Linear Algebra, Statistics, Calculus as and when the concepts come up in different topics. I will cover the relevant topics of math in this blog, as required.
Second, it is important to understand the broad framework or taxonomy of all ML algorithms. I have drawn a very basic framework here.
You don’t need to understand all of it in one go. Also, you will see certain models like HMM and LDA appearing both in the Supervised and Unsupervised Learning. This is because they are generic algorithms and can be used in both the settings. More on this in later posts when we dive deeper into the models. For now, we should understand what each of the boxes in the above diagram are at a high level. For that, let’s take an example.
Let’s say you have historical data about customers from an online store. So each “sample” or data point is a customer. For every customer, you know certain attributes or “features” that describe them well. The data header might look something like the following:
customer_id, number_transactions, total_profit, number_returns, total_revenue
In the above example, let’s say we wanted to create segments of similar customers. There would be two ways to go about it. One is to make up certain heuristics (total_revenue>10000 and total_profit>1000 goes to segment 5, etc). This approach usually works if you have certain hard rules for the segments. However, many times we don’t know what the rules will be and there could be multiple ways of grouping same customers. In such cases, unsupervised approaches like Clustering help. At a high level, clustering algorithms like K-means will group the customers into n clusters, each cluster having customers with similar features. The n is usually defined by business needs or using heuristics.
In the same example, let’s say we had an additional “label” called is_fraud that marks a particular customer as a fraudulent customer or not (0/1). These labels could come from a human or some other input (hence supervised). Now we want to build an intelligent system, that analyses the features and predicts if a customer is likely to be fraudulent. For such scenarios, supervised learning algorithms are widely used. The problem I just described above is called classification where you divide the data into fixed predefined classes (labels). Now, let’s say instead of a discrete variable (label), we instead had the amount of fraud that can be attributed to the customer. In this case, regression would be right approach to be used. Mathematically, if our target variable is discrete, we use classification and if it is continuous we use regression.
Within classification, we have two broad types of models – generative and discriminative. Explaining these two is slightly more involved and we shall go over them in a future post. For now, think of generative models as those that can actually create an artificial sample of data (apart from being able to classify the samples), whereas discriminative models only aim to solve the problem of classification. Now you might think, what is the need to have discriminative models at all? This is because in several experiments, discriminative models (like SVM) have outperformed generative models for the task of classification.
Now that we know the basic framework of ML, it is important to analyse every model and algorithm that we know in it’s context. This will give a solid foundation to the way we think and approach a real Data Science problem.