The Day We Started to Outgrow Relational Databases

Look around you. Look closer. Pay more attention. What do you see? When I look around me I can see activity trackers, digital cameras, smart watches, interconnected devices, virtual reality gadgets, wearable technology, smart elevators, energy saving light systems, intelligent traffic lights, smart cars with over the air updates and that can gather data on your driving habits, intelligent houses, eco friendly buildings and more. All of these generate massive amounts of data. But let’s hold that thought for a minute. Now this is just what’s happening around you. What do you have in your pocket or in your hand right now? Most likely a smart phone. It is your portal to the digital world and even though it became second nature – pretty much everyone walks around with a phone in their hand now a days – it is a relatively new phenonem. It is highly likely that you use your smartphone constantly to check Facebook, Twitter, Instagram or search the web using Google among a few other applications. This generates humongous amounts of data. Let’s throw out a few numbers just to put it in perspective. Facebook has 1.6 billion users- yes, that is with a B – millions of them who log in every day to upload millions of pictures, add comments, like posts and perform many actions. And every action has an impact but as the amount of data grows, it gets harder to determine what that impact needs to be. Then we have Twitter, which may be a tad smaller albeit still plenty of data by any definition. But the beauty of Twitter is not just the human interactions, but instead what can be extracted from the data. And Google… well… what can I say? Try indexing the internet and then we can talk about it. Just do it, invite me over and I will buy you a coffee while you tell me how it went. How much data do you think is generated daily by all these applications? But besides the applications, remember how I mentioned many devices that also create mountains of data? This means that besides human generated data, we also have machine generated data. A Big Data World The world that has changed in unimaginable ways. Together, human and machine generated data, bring us into an information explosion era the likes of which the world has never seen. You are living a digital revolution and you can consider yourself lucky for being part of it. And tweets, posts, likes, pictures and stats are very nice. But that is just the tip of the iceberg of what’s to come. There are many other applications that require analyzing those massive amounts of data to help reduce costs, detect fraud, and many other potential use cases that help drive innovation for all mankind. All these are scenarios that help increase profits, decrease costs, innovate or help stop the bad guys are nice. But let’s take it up one notch. There are some people trying to make a difference like hospitals that are working to cure cancer by analyzing DNA records, comparing them and find ways to save human lives. Imagine if one of those lives they saved was your son, your daughter, your wife or your parents. The world has changed around us. We now live in a world of Big Data. Getting Insights But data by itself is just data and as I mentioned something needs to be done with this data to get insights. How can this be achieved? Well, let’s first rewind a few years and think how this was done before. A while back, if you had a “massive” amount of data you went to your prefered vendor and wrote them a huge check for an equivalent machine. Then you wrote another very big check to your favorite database provider to run in this machine and built an application that could consume the data, process it and give you the answers you needed. Also, you usually had to limit the amount of data as you were constrained by the limits of your big box so you had to throw data away. Outgrowing Relational Databases But there were times when a big box was not enough. For example, what if you wanted to index the entire internet? There was no box big enough for this. Also, there were a lot more scalability constraints like performance. If you had 1 terabyte and it took 1 hour to process and then you add another terabyte, well it will probably take almost twice the amount of time or even more. And this was a problem that needed a whole new way to be solved. It all happened when Google published a paper circa the early 2000s where they explained how they invented a way to solve this problem using GFS and MapReduce. And then magic happened. Two Yahoo employees, Doug Cutting and Mike Cafarella, read the paper and they had to solve the same problem so they got inspired and created Hadoop! Welcome to Distributed Computing at Its Best Hadoop took a different approach to solve any Big Data problem. Instead of a big box, it relied on creating clusters of many smaller and way cheaper computers, also known as commodity hardware. The data is then distributed among all these computers and then processed locally. In hindsight it sounds so common sense, but instead of taking the data to where it is processed, it takes the computation to the data. Each individual node in the cluster has its own copy of the data and does all the computation locally. Agreed, a big box is probably more reliable than a bunch of commodity servers, so probably a few of them might fail during processing. This is not a problem as Hadoop has data redundancy so if one server fails, there is a copy of the data replicated in another server that can pick up the work. And so we have distributed, resilient, dependable and efficient Big Data systems that are helping us change the world. Data drives the modern world. But who drives the data? That is what Cloudera is here for. Ask Bigger Questions!

Oct 10, 2016