We will talk about a hadoop playground for learning and experimentation. Our goal is to get a hadoop cluster with three nodes, store some data on it. And learn a few things along the way :)
We will do all of this in a single dev machine using docker.
Big data
So we’re talking about large volume of data, most of it is of semi or unstructured variety, and lastly it is produced at a high velocity. Our traditional scale up methods may not work great on such requirements. Given the velocity and volume, we will likely consume all the computation/storage power sooner.
We will need to scale out.
Enter hadoop - a distributed file system. It’s highly available, replicates the data, runs on commodity hardware. It partitions the high volume data into blocks, the blocks are stored on Datanodes and the structure/namespace is maintained by a Namenode. Need more storage, add a few more data nodes!
For our experiment, we will try a setup with 1 namenode and 2 datanodes.
Docker!
Thanks to containers, we don’t require bunch of machines to create this
environment. We will pull in hadoop
images for docker and try setting them up.
Secondly, since this is an experiment, we will need to teardown and scale nodes. Containers provide a super fast way to setup/teardown setups. Let’s get started.
Some notes on installing docker on Arch Linux
. If you’re on another operating
system, feel free to skip this section.
Let’s install docker tools.
$ sudo pacman -S docker docker-compose
Add self to docker
group.
$ sudo gpasswd -a arun docker
$ newgrp docker # makes current shell aware of new group
I typically partition the root drive to store files/libraries related to the
operating system. This allows recovery from catastrophic failures easy (with the
exception of /etc
, which is backed up regularly). We will move the docker data
directories outside root partition.
$ cat /etc/systemd/system/docker.service.d/docker-storage.conf
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd --data-root=/store/lib/docker -H fd://
Start docker service. I am not enabling this service for autostart to save some resources until docker is required.
$ systemctl start docker
Let’s try hello-world
example to ensure docker is working!
$ docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
5b0f327be733: Pull complete
Digest: sha256:07d5f7800dfe37b8c2196c7b1c524c33808ce2e0f74e7aa00e603295ca9a0972
Status: Downloaded newer image for hello-world:latest
Hello from Docker!
This message shows that your installation appears to be working correctly.
All is well.
A Hadoop cluster
We’ll use the excellent docker images provided by uhopper. We can model our
setup of 3 machines using docker-compose
.
Here’s how the docker-compose.yml
looks like. It’s available in github
too.
version: "3"
services:
namenode:
image: uhopper/hadoop-namenode:2.8.1
hostname: namenode
container_name: namenode
networks:
- hadoop
volumes:
- namenode:/hadoop/dfs/name
env_file:
- ./hadoop.env
environment:
- CLUSTER_NAME=hadoop-cluster
datanode1:
image: uhopper/hadoop-datanode:2.8.1
hostname: datanode1
container_name: datanode1
networks:
- hadoop
volumes:
- datanode1:/hadoop/dfs/data
env_file:
- ./hadoop.env
datanode2:
image: uhopper/hadoop-datanode:2.8.1
hostname: datanode2
container_name: datanode2
networks:
- hadoop
volumes:
- datanode2:/hadoop/dfs/data
env_file:
- ./hadoop.env
networks:
hadoop:
volumes:
namenode:
datanode1:
datanode2:
# vim: sw=2
Save this file in a local directory. And let’s start it.
$ docker-compose -f hadoop-basic.yml up
Verify the cluster setup
Let’s find out the ip for namenode
.
$ docker inspect namenode | grep IPAddress
"SecondaryIPAddresses": null,
"IPAddress": "",
"IPAddress": "172.18.0.2",
# open http://172.18.0.2:50070/explorer.html in your favorite browser
You should see a page with our cluster details. Feel free to play around and check the data nodes setup.
Configure the cluster
The default blocksize of the files are 64M
. However for our experiment, it is
likely that the data is less than that. It may be a good idea to reduce the
blocksize so that we can store smaller files and still experiment with a
distributed file system.
A note on configuration.
- Various properties of the
Namenode
orDatanode
are configured via/etc/hadoop/core-site.xml
etc.. - For the docker images, these can be configured easily with
hadoop.env
file (setting an env variable)
Here’s how we can set the blocksize.
# Set following in the hadoop.env file
HDFS_CONF_dfs_blocksize=1m
# This setting will automatically be setup in `/etc/hadoop/hdfs-site.xml` file.
# Refer https://bitbucket.org/uhopper/hadoop-docker/src/master/README.md (Hadoop Configuration) for details.
# Press Ctrl+C so that docker will teardown the nodes. Then compose up again!
$ docker-compose up
Now the configuration should be applied to all nodes. The cluster will be up in less than a minute.
Data operations
Let’s try to push some data into the hadoop storage.
Open a bash shell to the namenode
.
$ docker exec -it namenode bash
root@namenode:/$ cd tmp
Download Complete Works of Mark Twain
and push it to HDFS.
# Download Complete Works of Mark Twain from Project Gutenberg
root@namenode:/tmp$ curl -Lo marktwain.txt http://www.gutenberg.org/cache/epub/3200/pg3200.txt
root@namenode:/tmp$ hdfs dfs -mkdir -p hdfs://namenode:8020/project/wordcount
root@namenode:/tmp$ hdfs dfs -cp file:///tmp/marktwain.txt hdfs://namenode:8020/project/wordcount/marktwain.txt
Check if the file is uploaded via command line.
root@namenode:/tmp$ hdfs dfs -ls -R hdfs://namenode:8020/project
drwxr-xr-x - root supergroup 0 2017-10-28 14:21 hdfs://namenode:8020/project/wordcount
-rw-r--r-- 3 root supergroup 16013958 2017-10-28 14:21 hdfs://namenode:8020/project/wordcount/marktwain.txt
You can also check if the file is uploaded via web portal. Navigate to
http://172.18.0.2:50070/explorer.html. Replace your namenode
ip address.
That’s all about the setup. We will try some more big data experiments next time. Namaste!