Deploy and run Apache Hadoop cluster as OCI containers using Docker

First published: Monday, September 16, 2024 | Last updated: Monday, September 16, 2024

Learn to deploy and run Apache Hadoop cluster nodes as OCI containers using the Docker container engine and the Docker compose YAML template.


This blog post provides a complete guide on deploying and running Apache Hadoop cluster nodes as OCI containers using the Docker container engine and the Docker compose YAML template.

For high-level understanding, Apache Hadoop is a Big Data framework to store and process humongous data. The Hadoop framework or ecosystem has multiple components such as HDFS, YARN, MapReduce, etc. Generally, a Hadoop cluster is composed of master and slave nodes, where the name nodes are the master nodes and the data nodes are the slave nodes.

To implement the solution described in this blog post, you’ll need a Linux machine with Docker tools properly installed in it.

If you already have a Linux host or VM, all you need to do is install the Docker container engine to get started. However, if Docker tools are not yet installed, you can easily set it up by following the articles provided below. These articles will walk you through the steps of installing Docker tools on any Linux-based operating system, ensuring you’re ready to proceed with the solution.

  1. How to install Docker on Linux?

If you don’t have a Linux host or VM, you’ll need to build a new Linux VM and install Docker tools in it. No need to worry, we have already automated the Linux VM build process for you. Simply follow the links below for step-by-step guidance on building a Linux VM, connecting via SSH, and installing Docker tools. These instructions will help you build a fully functional Linux VM and Docker tools installed in it, so you’re ready to apply the solution in this blog post.

  1. How to automate and deploy Linux VMs on Windows, Mac, and Linux?
  2. How to install Docker on Linux?

Setup our Docker starter-kit from GitHub

At SloopStash, we are proud to offer our own open source Docker starter-kit repository on GitHub. This repository is designed to containerize and deploy popular fullstacks, microservices, and Big Data workloads using Containerd, Docker, Docker compose, and Docker swarm. Additionally, we are committed to creating, uploading, and maintaining our OCI/container images for popular workloads in the Docker Hub. Our comprehensive Docker starter-kit has been meticulously curated to encompass all the tools and resources necessary for containerizing and deploying popular workloads or services.

Here, we setup our Docker starter-kit which contains all code, configuration, and technical things required for deploying and running the Apache Hadoop cluster nodes as OCI containers using the Docker container engine and the Docker compose. You can use the following commands to setup our Docker starter-kit in any Linux-based operating system.

  1. Open the terminal.
  2. Execute the below commands in the terminal to setup our Docker starter-kit.
# Download Docker starter-kit from GitHub to local filesystem path.
$ sudo git clone https://github.com/sloopstash/kickstart-docker.git /opt/kickstart-docker

# Change ownership of Docker starter-kit directory.
$ sudo chown -R $USER:$USER /opt/kickstart-docker

Deploy and run Apache Hadoop cluster using Docker & Docker compose

Here, we deploy and run a single environment of Apache Hadoop cluster consisting of 1 name node and 3 data nodes. Each node in the Hadoop cluster runs as an OCI container using the Docker container engine. The deployment is automated and orchestrated through a Docker compose YAML template, which defines the Docker resources required for the Hadoop cluster nodes. You can find the Docker compose YAML template in our Docker starter-kit, providing you with everything you need to quickly spin up a functional Hadoop cluster for testing and development purposes.

Your Linux machine must have atleast 1.5 GB RAM to avoid JVM memory pressure while running this 4-node Apache Hadoop cluster.

# Store environment variables.
$ export ENVIRONMENT=dev

# Switch to Docker starter-kit directory.
$ cd /opt/kickstart-docker

# Provision OCI containers using Docker compose.
$ sudo docker compose -f compose/data-lake/hadoop/main.yml --env-file compose/${ENVIRONMENT^^}.env -p sloopstash-${ENVIRONMENT}-data-lake-s1 up -d

# Stop OCI containers using Docker compose.
$ sudo docker compose -f compose/data-lake/hadoop/main.yml --env-file compose/${ENVIRONMENT^^}.env -p sloopstash-${ENVIRONMENT}-data-lake-s1 down

# Restart OCI containers using Docker compose.
$ sudo docker compose -f compose/data-lake/hadoop/main.yml --env-file compose/${ENVIRONMENT^^}.env -p sloopstash-${ENVIRONMENT}-data-lake-s1 restart

Similarly, our Docker starter-kit enables you to deploy and manage multiple environments of the Apache Hadoop cluster by modifying the relevant environment variables. Detailed instructions for this process can be found in our Docker starter-kit wiki on GitHub. The wiki also includes information on testing and verifying the Hadoop cluster nodes running as OCI containers.