Inside Out | Hadoop cluster in docker

We will talk about a hadoop playground for learning and experimentation. Our goal is to get a hadoop cluster with three nodes, store some data on it. And learn a few things along the way :)

We will do all of this in a single dev machine using docker.

Big data

So we’re talking about large volume of data, most of it is of semi or unstructured variety, and lastly it is produced at a high velocity. Our traditional scale up methods may not work great on such requirements. Given the velocity and volume, we will likely consume all the computation/storage power sooner.

We will need to scale out.

Enter hadoop - a distributed file system. It’s highly available, replicates the data, runs on commodity hardware. It partitions the high volume data into blocks, the blocks are stored on Datanodes and the structure/namespace is maintained by a Namenode. Need more storage, add a few more data nodes!

For our experiment, we will try a setup with 1 namenode and 2 datanodes.

Docker!

Thanks to containers, we don’t require bunch of machines to create this environment. We will pull in hadoop images for docker and try setting them up.

Secondly, since this is an experiment, we will need to teardown and scale nodes. Containers provide a super fast way to setup/teardown setups. Let’s get started.

Some notes on installing docker on Arch Linux. If you’re on another operating system, feel free to skip this section.

Let’s install docker tools.

$ sudo pacman -S docker docker-compose

Add self to docker group.

$ sudo gpasswd -a arun docker
$ newgrp docker   # makes current shell aware of new group

I typically partition the root drive to store files/libraries related to the operating system. This allows recovery from catastrophic failures easy (with the exception of /etc, which is backed up regularly). We will move the docker data directories outside root partition.

$ cat /etc/systemd/system/docker.service.d/docker-storage.conf
[Service]
ExecStart= 
ExecStart=/usr/bin/dockerd --data-root=/store/lib/docker -H fd://

Start docker service. I am not enabling this service for autostart to save some resources until docker is required.

$ systemctl start docker

Let’s try hello-world example to ensure docker is working!

$ docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
5b0f327be733: Pull complete 
Digest: sha256:07d5f7800dfe37b8c2196c7b1c524c33808ce2e0f74e7aa00e603295ca9a0972
Status: Downloaded newer image for hello-world:latest

Hello from Docker!                                                                                                                                             
This message shows that your installation appears to be working correctly.

All is well.

A Hadoop cluster

We’ll use the excellent docker images provided by uhopper↗. We can model our setup of 3 machines using docker-compose.

Here’s how the docker-compose.yml looks like. It’s available in github↗ too.

version: "3"
services:
  namenode:
    image: uhopper/hadoop-namenode:2.8.1
    hostname: namenode
    container_name: namenode
    networks:
      - hadoop
    volumes:
      - namenode:/hadoop/dfs/name
    env_file:
      - ./hadoop.env
    environment:
      - CLUSTER_NAME=hadoop-cluster

  datanode1:
    image: uhopper/hadoop-datanode:2.8.1
    hostname: datanode1
    container_name: datanode1
    networks:
      - hadoop
    volumes:
      - datanode1:/hadoop/dfs/data
    env_file:
      - ./hadoop.env

  datanode2:
    image: uhopper/hadoop-datanode:2.8.1
    hostname: datanode2
    container_name: datanode2
    networks:
      - hadoop
    volumes:
      - datanode2:/hadoop/dfs/data
    env_file:
      - ./hadoop.env

networks:
  hadoop:

volumes:
  namenode:
  datanode1:
  datanode2:

# vim: sw=2

Save this file in a local directory. And let’s start it.

$ docker-compose -f hadoop-basic.yml up

Verify the cluster setup

Let’s find out the ip for namenode.

$ docker inspect namenode | grep IPAddress
            "SecondaryIPAddresses": null,                                                                                                                      
            "IPAddress": "",                                                                                                                                   
                    "IPAddress": "172.18.0.2",

# open http://172.18.0.2:50070/explorer.html in your favorite browser

You should see a page with our cluster details. Feel free to play around and check the data nodes setup.

Configure the cluster

The default blocksize of the files are 64M. However for our experiment, it is likely that the data is less than that. It may be a good idea to reduce the blocksize so that we can store smaller files and still experiment with a distributed file system.

A note on configuration.

Various properties of the Namenode or Datanode are configured via /etc/hadoop/core-site.xml etc..
For the docker images, these can be configured easily with hadoop.env file (setting an env variable)

Here’s how we can set the blocksize.

# Set following in the hadoop.env file
HDFS_CONF_dfs_blocksize=1m

# This setting will automatically be setup in `/etc/hadoop/hdfs-site.xml` file.
# Refer https://bitbucket.org/uhopper/hadoop-docker/src/master/README.md (Hadoop Configuration) for details.

# Press Ctrl+C so that docker will teardown the nodes. Then compose up again!
$ docker-compose up

Now the configuration should be applied to all nodes. The cluster will be up in less than a minute.

Data operations

Let’s try to push some data into the hadoop storage.

Open a bash shell to the namenode.

$ docker exec -it namenode bash
root@namenode:/$ cd tmp

Download Complete Works of Mark Twain and push it to HDFS.

# Download Complete Works of Mark Twain from Project Gutenberg
root@namenode:/tmp$ curl -Lo marktwain.txt http://www.gutenberg.org/cache/epub/3200/pg3200.txt
root@namenode:/tmp$ hdfs dfs -mkdir -p hdfs://namenode:8020/project/wordcount
root@namenode:/tmp$ hdfs dfs -cp file:///tmp/marktwain.txt hdfs://namenode:8020/project/wordcount/marktwain.txt

Check if the file is uploaded via command line.

root@namenode:/tmp$ hdfs dfs -ls -R hdfs://namenode:8020/project
drwxr-xr-x   - root supergroup          0 2017-10-28 14:21 hdfs://namenode:8020/project/wordcount
-rw-r--r--   3 root supergroup   16013958 2017-10-28 14:21 hdfs://namenode:8020/project/wordcount/marktwain.txt

You can also check if the file is uploaded via web portal. Navigate to http://172.18.0.2:50070/explorer.html↗. Replace your namenode ip address.

That’s all about the setup. We will try some more big data experiments next time. Namaste!