Good question! In simple terms, Data Science is everything that can help solve problems using data. It is an interdisciplinary field requiring skills from various disciplines like Software Engineering, Data Mining and Machine Learning. Depending on the job description, the weight given to one or more of these disciplines changes. Why is that? Each institution has it’s own requirements and depending on it’s goals gives more emphasis on one area over other. A research establishment might give more weight to develop core Machine Learning algorithms, while an industrial setting would tend to give emphasis on writing production level code with an in depth knowledge of state of the art Machine Learning algorithms.
While everyone is making up their own definitions of Data Science, let me make up my own for the purpose of this blog. Data Science is a field that spans the complete pipeline – right from gathering and processing the raw data, engineering features from this data, building predictive models and finally powering business use-cases with the obtained models. That seems quite a task! While we are expected to have skills across the entire pipeline, very few would actually build deep expertise in every stage of this pipeline. Data Scientists usually develop deep expertise in one or more stages of this pipeline and collaborate with others who have expertise in other stages. For example, you could build expertise in Feature Engineering/Machine Learning and collaborate with Data Engineers to build ETLs for the data. However, this might vary depending on the size of the organisation. In a startup setting, one is required to wear multiple hats and it may not be always feasible to rely on a Data Engineer to give you precious data, which is the very backbone of this profession. So, it always helps to know a bit of everything, while building deep expertise in any one of the stages of the pipeline. This one stage usually is Machine Learning for most of the Data Scientists and will be the primary focus for this blog, though at times I will write about relevant techniques for data processing as well.
Broadly, Data Science problems can be classified into three buckets –
Descriptive Analysis : This is also referred to as “backward looking” analysis. In this type of problems, we try to mainly understand historical data, or events that have already occurred. Example: Who are my best customers historically?
Predictive Analysis: This “forward looking” analysis is used to predict events in future. Example: Who will be my best customers in future?
Prescriptive Analysis: This is the most advanced form of analysis where we expect our algorithms to prescribe actions that would fix a problem. Example: What steps should I take to avoid customer churn?
In this blog, I shall discuss problems in each of these buckets and the techniques that are used to solve them.