According to the Apache Software Foundation, the Hadoop project
“…develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.”
Hadoop was created by Doug Cutting and Mike Cafarella and adopted by Apache, and is supported by a global community of contributors and users. Part of Hadoop’s appeal is that it offers a means of storing and processing very large amounts of data more cost-effectively than traditional databases or data warehouses. But also, Hadoop’s lack of inherent structure enables organizations to quickly and flexibly incorporate new data without a master plan. Data can be simply “dumped” into Hadoop for later structuring and analysis.
In contrast, traditional databases require careful planning and documentation before the first record can be loaded. Initial releases of Hadoop, starting in 2007, required users to rely heavily on MapReduce, a coding-intensive programming model for managing and manipulating data. Hadoop version 2.0, released in 2013, introduced the YARN architecture, short for “Yet Another Resource Negotiator.” Apache described it as “a general-purpose, distributed, application management framework that supersedes the classic Apache Hadoop MapReduce framework for processing data in Hadoop clusters.” YARN allows applications to run directly in Hadoop, bypassing MapReduce.