For a time, Apache Hadoop was almost synonymous with big data. Like Band-Aid for bandage, Google for search and other common genericized trademarks, there was a reason the pairing came very close to taking root before starting to wither on the vine. The Java-based, open-source framework for big data processing was revolutionary when it burst on the scene about 15 years ago. Its distributed data analytics and data storage model that allowed for parallel processing of large datasets, by building out clusters of nodes on multiple machines, really birthed the entire notion of working with big data.
This article will look at the rise – and fall – of Hadoop and explore why other options are gaining traction, specifically the emerging popularity of container orchestration platforms to manage containerized, cloud-native applications.
The Rise: Why Hadoop Mattered
What is Hadoop, and why did it become synonymous with big data, at least for a time? By distributing data storage and analytics workloads across multiple nodes, parallel processing allowed for – at the time – much faster data processing of large datasets. It promised high availability, lower cost and a safe way to manage data backups. It became the go-to, high-performance big data solution because it could collect and store structured and unstructured data without having to convert data into a single format, and because once required data was extracted, Hadoop could store unprocessed data in perpetuity, in theory with low storage costs because the distribution framework spread the stored data across multiple nodes using commodity hardware.
Hadoop remains popular today, but it is no longer a standard-bearer leading the big data charge, having been overtaken by advancements in technology and more flexible, lower-cost options. In addition, enterprise data volumes are increasing at levels unforeseen when Hadoop came along in the early 2000s. Big data and data lakes became possible largely because Hadoop allowed for relatively inexpensive storage of large datasets. While there is a use for this information it isn’t ideal for marketing and some of the more transactional aspects of data management, as stored data is static data. Fast-moving, data-driven enterprises – particularly those that rely on customer data to create differentiated experiences – require dynamic data.
In addition, a consolidation of Hadoop providers – Cloudera’s acquisition of Hortonworks, HPE’s acquisition of MapR and IBM abandoning its Hadoop distribution – is further evidence of the shrinking demand for Hadoop distributions and the loss of differentiation in Hadoop’s big data processing.
The Fall: Barriers to Success
Container management software to orchestrate and manage containerized, cloud-native applications has been steadily gaining steam. Its popularity, driven mainly by Kubernetes, is chiefly responsible for Hadoop adoption plateauing. Because Kubernetes does not do storage, per se, it is somewhat of an apples to oranges comparison, but in supplying both processing power and operational elegance using containers, vs. the computational power of Hadoop, Kubernetes scores a TKO.
While Hadoop revolutionized parallel processing of big data, the implicit promise of cost benefits and, to an extent performance, never really materialized because of the complexity of management. As for cost, while Hadoop certainly made it possible to scale out as many conventional processors as needed to guarantee high performance, the fact is that when implemented in a cloud environment CPUs do not really offer any material cost savings. It is not uncommon for an enterprise to need hundreds of nodes using 30 or more core CPUs. That is an enormous, fixed expense.
Performance, too, while unquestionably offering raw speed that simply would not be possible without parallel processing, can sometimes be slow and buggy. With massive datasets, underlying code has a heavy lift to match queries on distributed files. And taking full advantage of Hadoop parallelism with programming in MapReduce, Pig, and Spark is notoriously difficult, generally requiring an expertise incongruent with tools that people are more familiar with, such as SQL and basic BI graphing and querying. Enterprises needed to invest not only in the software solutions but also in specialized staff to manage integrations.
Those are the two main reasons why containerization will continue to grow as the favored option for big data processing. Hadoop limitations aside, we will now examine how Kubernetes, and container management software in general, can achieve the cost-savings and processing power that failed to materialize with Hadoop.
The Up-and-Comer: Containerization Transforming Data Management
Unlike distributed, parallel processing where clusters of nodes run on dedicated hardware and more clusters means more infrastructure, containerized cloud-native apps are so-called because a container contains a “pre-deployed” instance of the software, built for Kubernetes or another container-orchestration tool, where the traditional infrastructure is inherited in the container. Security, controllability, single-sign-on, interaction with different services, network configuration – those are all controlled by Kubernetes.
The chief advantage financially is fewer fixed infrastructure costs because containerized costs are on a consumption basis. When another instance of the software is needed, an enterprise can easily spin up another container. Importantly this can be done automatically, with no IT intervention. Kubernetes will spin containers up and down as needed based on the resource load, and the enterprise only pays for what it uses.
In theory, the ability to spin nodes in a cluster up or down as needed in Hadoop was conceivable, but improbable. Scaling a cluster is easier said than done and doesn’t really align with how an enterprise uses a big data platform. If you’re running processing jobs 24/7, you can’t really spin down an entire cluster if you’re still running, say, 50 percent of the cluster at a given time. With containerization, that roadblock disappears. The enterprise can ramp up or down day-by-day or even hour-by-hour, perfectly matching the processing ebbs and flows of the business. This elastic scaling, together with built-in operations oversight and container- and micro-services-based high availability makes Kubernetes a much more governable, transparent, cost-effective source of processing power.
Containerization for the Win
Lastly, Hadoop and distributed data analytics and data storage in general helped give rise to the concept of the data lake. Relatively low data storage costs compared to a traditional data warehouse popularized the data lake as a repository for analytics data and big data management. But too often, what happened was businesses just dumped data into a data lake without assessing its value or planning for its use and management. Data lakes have a habit of quickly becoming data swamps, creating enormous data governance issues.
With time-sensitive data, however, such as that required by a data-driven enterprise engaging with customers in real time, cloud-based big data storage services offered by AWS, GCP and Azure provide seamless integration with big datasets in containerized applications without the need for persistent data storage. That model, in addition to the ability to spin up and down as needed, makes working with container-management software far more palatable for dynamic organizations with sophisticated data needs, including a need for real-time data to meet the expectations of consumers for personalized, relevant experiences at the moment of interaction.