Deploy Hadoop cluster (Data Lake stack)

First published: Wednesday, May 28, 2025 | Last updated: Wednesday, May 28, 2025

Deploy Hadoop cluster (Data Lake stack) using the SloopStash Kubernetes starter-kit.

Previous: Deploy NodeJS + MongoDB + Nginx (CTM stack)

Next: Deploy Kafka cluster (Data Lake stack)

What is Apache Hadoop?

Apache Hadoop is a scalable and highly distributed Big Data framework that transforms how we store and process large volumes of data. Imagine a robust ecosystem filled with powerful components such as HDFS, YARN, and MapReduce, all working together seamlessly. The high-level architecture of a Hadoop cluster consists of master and slave nodes. The master nodes, known as name nodes, oversee the entire cluster operation, while the data nodes act as the dedicated workers, storing, processing, and managing the data. Together, they create a powerful data cluster capable of handling even the most complex datasets, making data processing not only manageable but also efficient and agile.

Deploy and manage Data Lake stack (Hadoop cluster) environments

Here, we deploy and run a single environment of an Apache Hadoop cluster that consists of one name node and three data nodes. Each node in the Hadoop cluster runs as an OCI container within the Kubernetes cluster. The deployment process is automated and orchestrated using Kubernetes YAML templates. These templates define the necessary Kubernetes resources, such as persistent-volumes, stateful-sets, pods, services, and more, for the Hadoop cluster nodes. You can find the Kubernetes YAML templates in the SloopStash Kubernetes starter-kit, which provides everything you need to quickly set up a functional Hadoop cluster for testing and development purposes.

Likewise, the SloopStash Kubernetes starter-kit allows you to deploy and manage multiple environments of the Apache Hadoop cluster by adjusting the relevant environment variables. This documentation also provides information on testing and verifying Hadoop cluster nodes that run as OCI containers within the Kubernetes cluster.

Each Kubernetes worker node must have atleast 2 GB RAM to avoid JVM memory pressure while running this 4-node Apache Hadoop cluster.

Kubernetes

# Switch to SloopStash Kubernetes starter-kit directory.
$ cd /opt/kickstart-kubernetes

# Create Kubernetes namespace.
$ kubectl create namespace sloopstash-${ENVIRONMENT}-data-lake-s1

# Create Kubernetes storage-class.
$ envsubst < storage-class/data-lake/hadoop/name.yml | kubectl apply -f -
$ envsubst < storage-class/data-lake/hadoop/data.yml | kubectl apply -f -

# Create Kubernetes persistent-volume.
$ envsubst < persistent-volume/data-lake/hadoop/name.yml | kubectl apply -f -
$ envsubst < persistent-volume/data-lake/hadoop/data.yml | kubectl apply -f -

# Create Kubernetes config-map.
$ kubectl create configmap hadoop-name \
--from-file=workload/hadoop/${DATA_LAKE_HADOOP_VERSION}/name/conf/ \
--from-file=workload/hadoop/${DATA_LAKE_HADOOP_VERSION}/name/script/ \
--from-file=supervisor-server=workload/supervisor/conf/server.conf \
-n sloopstash-${ENVIRONMENT}-data-lake-s1
$ kubectl create configmap hadoop-data \
--from-file=workload/hadoop/${DATA_LAKE_HADOOP_VERSION}/data/conf/ \
--from-file=workload/hadoop/${DATA_LAKE_HADOOP_VERSION}/data/script/ \
--from-file=supervisor-server=workload/supervisor/conf/server.conf \
-n sloopstash-${ENVIRONMENT}-data-lake-s1

# Create Kubernetes service.
$ kubectl apply -f service/data-lake/hadoop.yml -n sloopstash-${ENVIRONMENT}-data-lake-s1

# Create Kubernetes stateful-set.
$ envsubst < stateful-set/data-lake/hadoop/name.yml | kubectl apply -f - -n sloopstash-${ENVIRONMENT}-data-lake-s1
$ envsubst < stateful-set/data-lake/hadoop/data.yml | kubectl apply -f - -n sloopstash-${ENVIRONMENT}-data-lake-s1

# List Kubernetes resources.
$ kubectl get sc,pv,ns -o wide

# List resources under Kubernetes namespace.
$ kubectl get pvc,cm,sts,deploy,rs,ds,po,svc,ep,ing -o wide -n sloopstash-${ENVIRONMENT}-data-lake-s1

# Delete Kubernetes namespace.
$ kubectl delete namespace sloopstash-${ENVIRONMENT}-data-lake-s1

# Delete Kubernetes persistent-volume.
$ envsubst < persistent-volume/data-lake/hadoop/name.yml | kubectl delete -f -
$ envsubst < persistent-volume/data-lake/hadoop/data.yml | kubectl delete -f -

# Delete Kubernetes storage-class.
$ envsubst < storage-class/data-lake/hadoop/name.yml | kubectl delete -f -
$ envsubst < storage-class/data-lake/hadoop/data.yml | kubectl delete -f -

Hadoop

Verify Hadoop cluster

# Access Bash shell of existing OCI container running Hadoop name node 0.
$ kubectl exec -ti -n sloopstash-${ENVIRONMENT}-data-lake-s1 hadoop-name-0 -c main -- /bin/bash

# List Hadoop data nodes.
$ hdfs dfsadmin -report

# Exit shell.
$ exit

Write data to HDFS filesystem

# Access Bash shell of existing OCI container running Hadoop data node 0.
$ kubectl exec -ti -n sloopstash-${ENVIRONMENT}-data-lake-s1 hadoop-data-0 -c main -- /bin/bash

# Write data to HDFS filesystem.
$ hdfs dfs -mkdir -p /nginx/log/14-07-2024
$ touch access.log
$ echo "[14-07-2024 10:50:23] 14.1.1.1 app.crm.sloopstash.dv GET /dashboard HTTP/1.1 200 http://app.crm.sloopstash.dv/dashboard 950 - Mozilla Firefox - 0.034" > access.log
$ hdfs dfs -put -f access.log /nginx/log/14-07-2024

# Exit shell.
$ exit

Read data from HDFS filesystem

# Access Bash shell of existing OCI container running Hadoop data node 1.
$ kubectl exec -ti -n sloopstash-${ENVIRONMENT}-data-lake-s1 hadoop-data-1 -c main -- /bin/bash

# Read data from HDFS filesystem.
$ hdfs dfs -ls -R /
$ hdfs dfs -cat /nginx/log/14-07-2024/access.log

# Exit shell.
$ exit