The Hadoop ecosystem is a body in motion. Just a few years ago, you might quickly but fairly describe Hadoop as “HDFS, MapReduce and some glue” — referring to the Hadoop Distributed File System, its associated software programming model and an emerging collection of APIs and utilities, which together were becoming synonymous with big data systems. What you knew then was true, but only for a spell.
If you journeyed last week to the San Jose Convention Center in California for Yahoo’s and Hortonworks’ Hadoop Summit 2013 to look for an introduction to the open source framework, you walked in at a time when things are changing. Hadoop 2.0 is getting closer, and with it comes enhancements that could systematize an emerging style of data programming that comes under the banner of “Hadoop,” but is more than just Hadoop itself.
While improvements are due for the Hadoop Distributed File System (HDFS) and Hadoop ecosystem components — such as the HBase database, Hive data warehouse and Knox security gateway — much of the attention now is being directed at Hadoop 2.0’s YARN component. The acronym humbly stands for “Yet Another Resource Manager.” The humility is deceptive because YARN allows you to swap out MapReduce if you choose, and promotes a type of interactivity different from the batch processing methods that brought Hadoop to prominence.
The Hadoop 2.0 HDFS implementation brings “some re-architecting of the system to remove some of the single points of failure,” said Colin White, president and founder of consultancy BI Research in Ashland, Ore. That is a start — but the real progress comes on the application programming interface (API) level. “What is quite a change is YARN,” White said. “It enables you to use other file systems and things like that. It allows you to add flexibility to the environment, which is something enterprise users had been complaining about.”
So your basic Hadoop definition goes by the boards. People are already talking about using IBM’s General Parallel File System, the Lustre file system for high-performance computing clusters, as well as other file systems with Hadoop. With YARN, MapReduce, too, becomes an option, not a defining element.
No longer is there a “core Hadoop,” said Merv Adrian, an analyst at Gartner Inc. in Stamford, Conn., though he added that the Apache open source development group would challenge that notion.
YARN and definitions aside, one could say that pluggable options for data architectures are the order of the day, and Hadoop is the means. Or, in Adrian’s words: “There is no level at which substitutes are not possible.”
Everyone under the Hadoop tent
FOR MORE ON TALKING DATA
Track the folks bringing data to more people
Are you ready for business decision making in the age of big data?
Learn more about Hadoop features as they evolve for the enterprise
“The Hadoop community is a center of gravity that is attracting innovative new uses,” he said, while noting that Gartner itself recommends users rely on commercially available versions of Hadoop software and employ their freely downloadable open source counterparts only forsandboxing and the like.
What has happened in recent years, in Adrian’s words, is “an explosion in data stores,” many of them the NoSQL kind. They challenged the “SQL-only” data model of the day. Hadoop provides a big tent for the new movement.
The major reasons the various data stores came into being, Adrian said, were: First, the costs of the incumbent relational databases were too high for large-scale deployments; second, bureaucracy in the form of database schemas had too often become an encumbrance to invention; and third, relational data technology basically was not the right fit for Web applications.
Don’t call me late for dinner
What’s important is that what the Hadoop community is doing these days represents a major shift for data management. The strangely named litany of open source tools and APIs that are camped out in the Hadoop tent lets developers work with data in innovative ways that the old data regime just didn’t allow.