A survey on Big Data technologies 2016 by Matúš Cimerman

Big Data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else doing it, so everyone claims they are doing it…

(Dan Ariely)

This blog is a summary of several survey articles about the current state of Big Data technologies, the list is included at the end of the article. At some places, my own observations and comments are included.

I have met two groups of people, one was saying: “We are doing Big Data!” and second one: “There is no Big Data (in Slovakia)”. So I came up with two fundamental questions. Why and how do you think you are doing Big Data? Why do you think there is no Big Data (in Slovakia)? I am not sure about the first question. But it is certain there is Big Data back in Slovakia. Just think of a telecommunication operator as an example.

Thinking of Big Data as a whole, an early years were driven by a set of large, mainly Internet, companies which were also creators of the core Big Data technologies. For example Google developed Hadoop (currently developed under Apache) framework for the distributed processing of large data sets across clusters of computers using simple programming models. The very best engineers from these companies went on they own and established their own Big Data startups.

This interesting and massive domain includes three core topics: statistics, machine learning and data mining. Each of these knowledge is an independent and extensive area which includes various research problems. A key thing to understand is that: Big Data is about assembling a set of technologies and processes together. You need to capture data, store them, clean them, query them, analyze and visualize them. Today, there is a thing about capturing and storing data. With massive amounts (e.g. TB and more per day) of data originated from data firehoses, storing all these data in raw form is a rising problem.

I recall what have some people predicted for today. They’ve predicted that last years were supposed to be the years of natural language processing and image processing or recognition (using traditional methods). Are they? Certainly not. Just take a look at venture capitals, Big Data startups received $6.64B in venture capital investment in 2015, 11% of total tech VC.

Speaking about technologies, then 2015 was year of Spark without a doubt. We have seen more than linear growth in several indicators of Spark usage. Spark in an open source framework whose core providing in-memory processing. Other exciting frameworks continue to gain more momentum such as Samza, Flink, Kudu, Mesos or Heron (not open-sourced yet) as a successor of Twitter’s Storm. In the world of databases there are also emerging technologies like Neo4j founded in LinkedIn, CockroachDB or InfluxDB. Most of these technologies are released as open-source by large Internet companies, the same like in beginning of Big Data era this millennium. Startups and rising companies often build their business on top of these technologies. You can see all the popular technologies for the year 2016, grouped in figure 1. Another full list of Big Data technologies by Matt Turck lists all the most important technologies to look into.

matt_turck_big_data_landscape_v11 (1)

Figure 1. Big Data landscape companies in 2016 (original available here http://mattturck.com/2016/02/01/big-data-landscape/)

Trend in the last few months is about focusing on artificial intelligence/machine learning, to build up better analyses and predictions using massive amounts of data. In business world, it is simply to deliver revenue or predictive insights on market against rivals. We can notice this trend even in the fastest being adopted Big Data framework: Spark where machine learning library was added recently. We are still in the early stage and evolving phase of the Big Data phenomena. Combination of AI/machine learning now emerging towards Big Data. This combination will drive innovation and research across various industries. From that perspective, opportunity hidden in Big Data is far beyond than we thought.

Last, we would like to propose some predictions for the year 2016 and Big Data. Many people say it will be year of Internet of Things and corresponding Big Data analytics. But, for example Gregory Piatetsky, President of KDNuggets (I personally recommend subscribing to their newsletter if you are interested in Big Data including all three domains in the top of article) says:

“2016 will be the year of deep learning. It will move from experimental to deployed technology in image recognition, language understanding, and exceed human performance in many areas.”

 

This article was written as review of the following articles: