“If I had eight hours to cut down a tree, I’d spend six hours sharpening the axe”
– Abraham Lincoln
We have come a long way from the times when digital data was sparse and inaccessible. Today, the amount of data added to the Internet is staggering. In 2014, YouTube users alone uploaded 72 hours of new video every single minute! At Myntra itself, millions of data points are produced every single day, which include several user interactions.
You can’t go to war with a knife in hand. You need the right tools for the right job. Without a doubt, modern data crunching technologies are required to help Data Scientists mine insights. Only, there is a bit of a problem. Even the Big Data space has been crowded and exploding with newer tools; each one claiming to be superior than the earlier ones. Fortunately, we can be selective here and choose the weapons of our liking. I will briefly share my experience with a couple of platforms that I have recently experimented with: Dato and Apache Spark. Both of these promising technologies offer built-in Machine Learning capabilities, and more.
Turi (https://turi.com): Earlier known as Dato/GraphLab, Turi is a scalable Machine Learning platform. It is a good option if you want to get started quickly and are willing to lose out on some flexibility. It is intuitive and straightforward to use, especially with Python. The layer of abstraction which I found to be most useful is the SFrame. SFrame can be thought of as a scaled version of the Pandas DataFrame, and is extremely powerful with joins and several other data transformations. It supports loading data from various formats including the DataFrame, Python dictionary, and JSON. However, be advised that Turi is a commercial offering. You can try out their trial version and check if it’s something you can effectively work with.
Apache Spark (http://spark.apache.org): If I had to put my time and effort on one platform, this would be it. Yes, it comes with an initial learning curve and you would need to understand the architecture before achieving anything useful with it. But in my opinion, it’s worth it, for several reasons. For starters, it’s open source and has an active community. There are drivers providing seamless integration with many other technologies including Amazon S3 and Apache Cassandra, which is where much of the existing data would lie for many folks. Most importantly, Spark provides a layer of abstraction called the RDD (Resilient Distributed Datasets), which rivals Turi’s SFrame in some sense. In fact, Turi allows you to load data into an SFrame from an existing RDD. The RDD abstraction allows several fundamental operations for processing data. Spark also provides data structures like LabeledPoint, which can help you leverage MLib, the machine learning library for Spark.
Spark Streaming is an extension that enables you to consume streaming data. At Myntra, we have leveraged Spark Streaming to consume real time user interactions. We were able to process and aggregate these interactions in near real time, thus engineering the features for our predictive model. The output of the entire process was easily stored into Cassandra. Thus, a single platform enabled us to build the feature engineering pipeline, build offline predictive models, as well as score new data coming in, near real time.
To conclude, as data keeps piling up every single minute, exploring technologies like Dato and Spark might soon be vital to stay afloat in the Big Data ocean.