Deploy & run Apache Hadoop cluster as containers in Kubernetes cluster

First published: Tuesday, October 1, 2024 | Last updated: Tuesday, October 1, 2024

Learn to deploy and run Apache Hadoop cluster nodes as OCI containers inside the Kubernetes cluster using Kubernetes YAML templates.


This blog post provides a complete guide on deploying and running Apache Hadoop cluster nodes as OCI containers inside the Kubernetes cluster, a sophisticated container cluster and orchestrator that uses Kubernetes YAML templates.

For high-level understanding, Apache Hadoop is a Big Data framework to store and process humongous data. The Hadoop framework or ecosystem has multiple components such as HDFS, YARN, MapReduce, etc. Generally, a Hadoop cluster is composed of master and slave nodes, where the name nodes are the master nodes and the data nodes are the slave nodes.

To implement the solution described in this blog post, you’ll need a Kubernetes cluster with a control plane (master node) and a minimum of 2 worker nodes. Then, you’ll need to install the Kubernetes client on your local developer machine running Windows, Mac, or Linux operating system.

Here, we proceed with an assumption that you already have a Kubernetes cluster deployed and running in the cloud/on-premise. The only thing we need to get started is the Kubernetes client. However, if the Kubernetes client is not yet installed, you can easily set it up by following the articles provided below. These articles will walk you through the steps of installing the Kubernetes client on your Windows, Mac, or Linux operating system, ensuring you’re ready to proceed with the solution.

  1. How to install and configure Git on Windows, Mac, and Linux?
  2. How to install Kubernetes client on Windows, Mac, and Linux?

Setup our Kubernetes starter-kit from GitHub

At SloopStash, we are proud to offer our own open source Kubernetes starter-kit repository on GitHub. This repository is designed to provide robust support to orchestrate and automate OCI containers running popular fullstacks, microservices, and Big Data workloads using Containerd, Docker, and Kubernetes. Additionally, we are committed to creating, uploading, and maintaining our OCI/container images for popular workloads in the Docker Hub. Our comprehensive Kubernetes starter-kit has been meticulously curated to encompass all the tools and resources necessary for orchestrating, automating, and deploying popular workloads or services.

Here, we setup our Kubernetes starter-kit which contains all code, configuration, and technical things required for deploying and running the Apache Hadoop cluster nodes as OCI containers inside the Kubernetes cluster using Kubernetes YAML templates.

Windows

Below are the steps you can use to setup our Kubernetes starter-kit while using the Windows operating system.

  1. Open the Git Bash terminal in administrator mode.
  2. Execute the below commands in the Git Bash terminal to setup our Kubernetes starter-kit.
# Download Kubernetes starter-kit from GitHub to local filesystem path.
$ git clone https://github.com/sloopstash/kickstart-kubernetes.git /opt/kickstart-kubernetes

Mac

Here are the instructions for setting up our Kubernetes starter-kit on a Mac operating system.

  1. Open the terminal.
  2. Execute the below commands in the terminal to setup our Kubernetes starter-kit.
# Download Kubernetes starter-kit from GitHub to local filesystem path.
$ sudo git clone https://github.com/sloopstash/kickstart-kubernetes.git /opt/kickstart-kubernetes

# Change ownership of Kubernetes starter-kit directory.
$ sudo chown -R $USER /opt/kickstart-kubernetes

Linux

You can use the following commands to setup our Kubernetes starter-kit in any Linux-based operating system.

  1. Open the terminal.
  2. Execute the below commands in the terminal to setup our Kubernetes starter-kit.
# Download Kubernetes starter-kit from GitHub to local filesystem path.
$ sudo git clone https://github.com/sloopstash/kickstart-kubernetes.git /opt/kickstart-kubernetes

# Change ownership of Kubernetes starter-kit directory.
$ sudo chown -R $USER:$USER /opt/kickstart-kubernetes

Deploy and run Apache Hadoop cluster inside Kubernetes cluster

Here, we deploy and run a single environment of Apache Hadoop cluster consisting of 1 name node and 3 data nodes. Each node in the Hadoop cluster runs as an OCI container inside the Kubernetes cluster. The deployment is automated and orchestrated through Kubernetes YAML templates, in which we define the Kubernetes resources such as persistent-volumes, stateful-sets, pods, services, etc., required for the Hadoop cluster nodes. You can find the Kubernetes YAML templates in our Kubernetes starter-kit, providing you with everything you need to quickly spin up a functional Hadoop cluster for testing and development purposes.

Each Kubernetes worker node must have atleast 2 GB RAM to avoid JVM memory pressure while running this 4-node Apache Hadoop cluster.

# Store environment variables.
$ export ENVIRONMENT=stg

# Switch to Kubernetes starter-kit directory.
$ cd /opt/kickstart-kubernetes

# Store Kubernetes variables as environment variables.
$ source var/STG.env

# Add labels to Kubernetes node.
$ kubectl label nodes <KUBERNETES_MASTER_1> type=on-premise provider=host service=virtualbox region=local availability_zone=local-a
$ kubectl label nodes <KUBERNETES_WORKER_1> type=on-premise provider=host service=virtualbox region=local availability_zone=local-b node-role.kubernetes.io/worker=worker
$ kubectl label nodes <KUBERNETES_WORKER_2> type=on-premise provider=host service=virtualbox region=local availability_zone=local-c node-role.kubernetes.io/worker=worker

# Create Kubernetes namespace.
$ kubectl create namespace sloopstash-${ENVIRONMENT}-data-lake-s1

# Create directories for Kubernetes persistent-volumes on worker nodes.
$ sudo mkdir -p /mnt/sloopstash/${ENVIRONMENT}/data-lake/hadoop/name/0/data
$ sudo mkdir -p /mnt/sloopstash/${ENVIRONMENT}/data-lake/hadoop/name/0/log
$ sudo mkdir -p /mnt/sloopstash/${ENVIRONMENT}/data-lake/hadoop/name/0/tmp
$ sudo mkdir -p /mnt/sloopstash/${ENVIRONMENT}/data-lake/hadoop/data/0/data
$ sudo mkdir -p /mnt/sloopstash/${ENVIRONMENT}/data-lake/hadoop/data/0/log
$ sudo mkdir -p /mnt/sloopstash/${ENVIRONMENT}/data-lake/hadoop/data/0/tmp
$ sudo mkdir -p /mnt/sloopstash/${ENVIRONMENT}/data-lake/hadoop/data/1/data
$ sudo mkdir -p /mnt/sloopstash/${ENVIRONMENT}/data-lake/hadoop/data/1/log
$ sudo mkdir -p /mnt/sloopstash/${ENVIRONMENT}/data-lake/hadoop/data/1/tmp
$ sudo mkdir -p /mnt/sloopstash/${ENVIRONMENT}/data-lake/hadoop/data/2/data
$ sudo mkdir -p /mnt/sloopstash/${ENVIRONMENT}/data-lake/hadoop/data/2/log
$ sudo mkdir -p /mnt/sloopstash/${ENVIRONMENT}/data-lake/hadoop/data/2/tmp
$ sudo chmod -R ugo+rwx /mnt/sloopstash

# Create Kubernetes persistent-volume.
$ envsubst < persistent-volume/data-lake/hadoop/name.yml | kubectl apply -f -
$ envsubst < persistent-volume/data-lake/hadoop/data.yml | kubectl apply -f -

# Create Kubernetes config-map.
$ kubectl create configmap hadoop-name \
--from-file=workload/hadoop/${DATA_LAKE_HADOOP_VERSION}/name/conf/ \
--from-file=workload/hadoop/${DATA_LAKE_HADOOP_VERSION}/name/script/ \
--from-file=supervisor-server=workload/supervisor/conf/server.conf \
-n sloopstash-${ENVIRONMENT}-data-lake-s1
$ kubectl create configmap hadoop-data \
--from-file=workload/hadoop/${DATA_LAKE_HADOOP_VERSION}/data/conf/ \
--from-file=workload/hadoop/${DATA_LAKE_HADOOP_VERSION}/data/script/ \
--from-file=supervisor-server=workload/supervisor/conf/server.conf \
-n sloopstash-${ENVIRONMENT}-data-lake-s1

# Create Kubernetes service.
$ kubectl apply -f service/data-lake/hadoop.yml -n sloopstash-${ENVIRONMENT}-data-lake-s1

# Create Kubernetes stateful-set.
$ envsubst < stateful-set/data-lake/hadoop/name.yml | kubectl apply -f - -n sloopstash-${ENVIRONMENT}-data-lake-s1
$ envsubst < stateful-set/data-lake/hadoop/data.yml | kubectl apply -f - -n sloopstash-${ENVIRONMENT}-data-lake-s1

Similarly, our Kubernetes starter-kit enables you to deploy and manage multiple environments of the Apache Hadoop cluster by modifying the relevant environment variables. Detailed instructions for this process can be found in our Kubernetes starter-kit wiki on GitHub. The wiki also includes information on testing and verifying the Hadoop cluster nodes running as OCI containers in the Kubernetes cluster.