We will talk about a hadoop playground for learning and experimentation. Our goal is to get a hadoop cluster with three nodes, store some data on it. And learn a few things along the way :)
We will do all of this in a single dev machine using docker.
So we’re talking about large volume of data, most of it is of semi or unstructured variety, and lastly it is produced at a high velocity. Our traditional scale up methods may not work great on such requirements. Given the velocity and volume, we will likely consume all the computation/storage power sooner.
We will need to scale out.
Enter hadoop - a distributed file system. It’s highly available, replicates the data, runs on commodity hardware. It partitions the high volume data into blocks, the blocks are stored on Datanodes and the structure/namespace is maintained by a Namenode. Need more storage, add a few more data nodes!
For our experiment, we will try a setup with 1 namenode and 2 datanodes.
Thanks to containers, we don’t require bunch of machines to create this
environment. We will pull in
hadoop images for docker and try setting them up.
Secondly, since this is an experiment, we will need to teardown and scale nodes. Containers provide a super fast way to setup/teardown setups. Let’s get started.
Some notes on installing docker on
Arch Linux. If you’re on another operating
system, feel free to skip this section.
Let’s install docker tools.
$ sudo pacman -S docker docker-compose
Add self to
$ sudo gpasswd -a arun docker $ newgrp docker # makes current shell aware of new group
I typically partition the root drive to store files/libraries related to the
operating system. This allows recovery from catastrophic failures easy (with the
/etc, which is backed up regularly). We will move the docker data
directories outside root partition.
$ cat /etc/systemd/system/docker.service.d/docker-storage.conf [Service] ExecStart= ExecStart=/usr/bin/dockerd --data-root=/store/lib/docker -H fd://
Start docker service. I am not enabling this service for autostart to save some resources until docker is required.
$ systemctl start docker
hello-world example to ensure docker is working!
$ docker run hello-world Unable to find image 'hello-world:latest' locally latest: Pulling from library/hello-world 5b0f327be733: Pull complete Digest: sha256:07d5f7800dfe37b8c2196c7b1c524c33808ce2e0f74e7aa00e603295ca9a0972 Status: Downloaded newer image for hello-world:latest Hello from Docker! This message shows that your installation appears to be working correctly.
All is well.
A Hadoop cluster
We’ll use the excellent docker images provided by uhopper. We can model our
setup of 3 machines using
Here’s how the
docker-compose.yml looks like. It’s available in github too.
version: "3" services: namenode: image: uhopper/hadoop-namenode:2.8.1 hostname: namenode container_name: namenode networks: - hadoop volumes: - namenode:/hadoop/dfs/name env_file: - ./hadoop.env environment: - CLUSTER_NAME=hadoop-cluster datanode1: image: uhopper/hadoop-datanode:2.8.1 hostname: datanode1 container_name: datanode1 networks: - hadoop volumes: - datanode1:/hadoop/dfs/data env_file: - ./hadoop.env datanode2: image: uhopper/hadoop-datanode:2.8.1 hostname: datanode2 container_name: datanode2 networks: - hadoop volumes: - datanode2:/hadoop/dfs/data env_file: - ./hadoop.env networks: hadoop: volumes: namenode: datanode1: datanode2: # vim: sw=2
Save this file in a local directory. And let’s start it.
$ docker-compose -f hadoop-basic.yml up
Verify the cluster setup
Let’s find out the ip for
$ docker inspect namenode | grep IPAddress "SecondaryIPAddresses": null, "IPAddress": "", "IPAddress": "172.18.0.2", # open http://172.18.0.2:50070/explorer.html in your favorite browser
You should see a page with our cluster details. Feel free to play around and check the data nodes setup.
Configure the cluster
The default blocksize of the files are
64M. However for our experiment, it is
likely that the data is less than that. It may be a good idea to reduce the
blocksize so that we can store smaller files and still experiment with a
distributed file system.
A note on configuration.
- Various properties of the
Datanode are configured via
- For the docker images, these can be configured easily with
hadoop.env file (setting an env variable)
Here’s how we can set the blocksize.
# Set following in the hadoop.env file HDFS_CONF_dfs_blocksize=1m # This setting will automatically be setup in `/etc/hadoop/hdfs-site.xml` file. # Refer https://bitbucket.org/uhopper/hadoop-docker/src/master/README.md (Hadoop Configuration) for details. # Press Ctrl+C so that docker will teardown the nodes. Then compose up again! $ docker-compose up
Now the configuration should be applied to all nodes. The cluster will be up in less than a minute.
Let’s try to push some data into the hadoop storage.
Open a bash shell to the
$ docker exec -it namenode bash [email protected]:/$ cd tmp
Complete Works of Mark Twain and push it to HDFS.
# Download Complete Works of Mark Twain from Project Gutenberg [email protected]:/tmp$ curl -Lo marktwain.txt http://www.gutenberg.org/cache/epub/3200/pg3200.txt [email protected]:/tmp$ hdfs dfs -mkdir -p hdfs://namenode:8020/project/wordcount [email protected]:/tmp$ hdfs dfs -cp file:///tmp/marktwain.txt hdfs://namenode:8020/project/wordcount/marktwain.txt
Check if the file is uploaded via command line.
[email protected]:/tmp$ hdfs dfs -ls -R hdfs://namenode:8020/project drwxr-xr-x - root supergroup 0 2017-10-28 14:21 hdfs://namenode:8020/project/wordcount -rw-r--r-- 3 root supergroup 16013958 2017-10-28 14:21 hdfs://namenode:8020/project/wordcount/marktwain.txt
You can also check if the file is uploaded via web portal. Navigate to
http://172.18.0.2:50070/explorer.html. Replace your
namenode ip address.
That’s all about the setup. We will try some more big data experiments next time. Namaste!