What Makes Apache Hadoop Agile

Summary:

Agile software development emphasizes continuous delivery of valuable software, simplicity, working software, customer collaboration, responding to change, and welcoming changing requirements. In this article, we shall discuss how Apache Hadoop (one of the top open-source projects)-based software fulfills Agile software development principles. The two main components of Hadoop are MapReduce for data computation and Hadoop Distributed File System (HDFS) for data storage and data processing.

Individuals and Interactions, and Customer Collaboration

Individuals and customers are involved in every aspect of the design of a Hadoop cluster, such as:

Designing cluster size and capacity, including volume of data, and number of nodes
Choosing commodity hardware for node machines, racks, network switches
Designing network bandwidth
Choosing the type of workload
Choosing data compression methodology
Choosing a data retention policy

Early and Continuous Delivery of Valuable Software

Apache Hadoop is a highly available, fault-tolerant, distributed framework designed for the continuous delivery of software with negligible downtime. HDFS is designed for fast, concurrent access to multiple clients. HDFS provides parallel streaming access to tens of thousands of clients. Hadoop is a large-scale distributed processing system designed for a cluster consisting of hundreds or thousands of nodes, each with multi-processor CPU cores. Hadoop is designed to distribute and process large quantities of data across the nodes in the cluster. If a machine fails, the failed computation is automatically started on another machine. Hadoop is designed to process web-scale data in the order of hundreds of GB to 100s of TB, to several PB. Hadoop makes use of a distributed file system that breaks up the input data and distributes the data across several machines in a cluster and processes the distributed data in parallel. Processing data in parallel is much faster than processing data sequentially.

HDFS stores data reliably. In a large-scale distributed cluster, failure of some of the nodes is expected and failure of a few nodes does not cause the data to become unavailable. Block replication provides data durability. Each block of a file is replicated at least three times by default. If a single block were to be lost due to node failure, a block replica is used and the block is quickly re-replicated to restore the replication level.

With the HDFS Namenode High Availability feature, the addition of the Active NameNode-Standby NameNode configuration can be used to avoid a single point of failure (SPOF).

Simplicity

Apache Hadoop is designed specifically for large-scale computations of large datasets, such as:

Market trends data
Sales data
Consumer preferences data
Social media data
Search engine data

If your use case meets the requirements of distributed, large-scale computations, Hadoop is simple to use. Computation is distributed across a set of machines instead of using a single machine. If your requirements are data analysis that involves small datasets, and a large number of small files, with real-time or low latency tasks, Hadoop is not the framework to use. Hadoop is not a database, nor does Hadoop replace traditional database systems. Hadoop is designed to cover the scenario of storing and processing large-scale data, which most traditional systems do not support or do not support efficiently.

One of the simple design features of the HDFS is data locality. The computation code is moved to where the data is rather than moving the data to the computation. Data locality has a definite advantage as HDFS is designed for data-intensive computation, and it is much more efficient to move the computation to the data; the computation code being relatively small. Having to move fewer data utilizes less network bandwidth and as a result, increases the overall efficiency and throughput of the system.

Hadoop is designed for data coherency with an HDFS that is based on a write-once-read-many model. What that implies is that HDFS does not support changes to the stored data other than appending new data. A MapReduce application or a web search application could benefit from the write-once-read-many model for high throughput data access.

HDFS is designed for:

High throughput of data rather than low latency tasks
Batch processing rather than interactive user access
Large streaming, sequential reads of data rather than random seeks to arbitrary positions in files

Another simplicity-related feature is that HDFS is designed to support unstructured data. Data stored in HDFS does not need to conform to a predefined schema.

HDFS is highly accessible with support for multiple interfaces including Java and C/C++ with MapReduce for C (MR4C)—an open-source framework, HTTP browser, and command-line FS Shell.

Working Software

You may assume that such a large-scale, distributed, and highly available framework must make use of custom, specially designed hardware, but Hadoop relies on off-the-shelf reliable, commodity hardware with the priority being working software. Large-scale computation requires that the input data be distributed across multiple machines and processed in parallel on different machines. HDFS overcomes data loss with data replication. How does Hadoop manage to store large files? A file is broken into blocks and stored across a cluster with a single file having blocks on multiple nodes (machines). The blocks are replicated. As blocks are distributed and replicated it provides fault tolerance; the failure of a single node does not cause data loss. A single file in the HDFS can be larger than the size of a single disk on the underlying filesystem because a file is broken into blocks. A file’s blocks are stored over multiple disks in the cluster. Replicating and distributing a file over multiple nodes provides high availability of data for the data locality, which was discussed earlier.

Hadoop adapts well to the reliability demands of distributed systems that support large-scale computational models because Hadoop:

Supports partial failure rather than complete failure.
Supports data recoverability. If some components fail, their workload is taken up by the functioning machines.
Supports individual recoverability. A full cluster restart is not needed. If a node fails, it is able to restart and rejoin the cluster, or a replacement node is used.
Provides consistency with concurrent operations.

As HDFS is designed to be used on commodity hardware, it is portable across heterogeneous hardware platforms. HDFS’s software components are built using Java, a portable language with multi-platform support. HDFS network communication protocols are built over TCP/IP, the most common protocol.

Responding to Changing Requirements

Hadoop is highly scalable and responds to changing requirements quickly. Hadoop is designed for both unstructured and structured data, which makes it suitable for the web as the type of data generated on the web is also both structured and unstructured. The distributed system is able to take an increased load without performance degradation. An increase in the number of machines provides a proportionate increase in capacity. Hadoop scales linearly. A Hadoop cluster could scale from a 12-machine cluster to a cluster with hundreds and thousands of machines without any degradation in performance.

With such an Agile framework as Apache Hadoop, it is no wonder that Hadoop has more committers than any other Apache project.

Tags:

About the author

Deepak Vohra

Deepak is a Sun Certified Java Programmer and Web Component Developer, and has worked in the fields of XML, Java programming and Java EE for ten years. Deepak is the co-author of the Apress book Pro XML Development with Java Technology and was the technical reviewer for the O'Reilly book WebLogic: The Definitive Guide. Deepak was also the technical reviewer for the Course Technology PTR book Ruby Programming for the Absolute Beginner. Deepak is also the author of the Packt Publishing books JDBC 4.0 and Oracle JDeveloper for J2EE Development, Processing XML Documents with Oracle JDeveloper 11g, EJB 3.0 Database Persistence with Oracle Fusion Middleware 11g, and Java EE Development in Eclipse IDE. Deepak is a Docker Mentor and has published 5 books on Docker and Kubernetes.

Apr 27	STAREAST Software Testing Conference in Orlando & Online
Jun 08	AI Con USA An Intelligence-Driven Future
Sep 21	STARWEST Software Testing Conference in Anaheim & Online

Nov 14	The Impact of Continuous Testing: How Organizations Transform Their Testing from Reactive to Innovative
Nov 21	Debunking the Top 5 Myths of AI in Testing
Dec 05	Unlocking the Future of Testing: How AI is Revolutionizing Software Quality
On Demand	Building Confidence in Your Automation
On Demand	Leveraging Open Source Tools for DevSecOps