Hadoop

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. More information can be found here

Available containers

Name sha256 mpi labels descripion
Singularity.Hadoop-3.2-Java-11.sif 022a101faab9acc056c0315df903099141d7a8f58ec8ffdcbc22f36edb4c0dfa none none Hadoop v3.2, Java 11

Running the container

Settings files

Minimal configuration. Store the following files in some location e.g. /host/hadoop/conf

export HADOOP_CONF_DIF=/host/hadoop/conf
export HADOOP_LOGS=/host/hadoop/logs

core-site.xml

<configuration>
     <property>
        <name>fs.default.name</name>
        <value>hdfs://nodemanager:port</value>
     </property>
     <property>
             <name>hadoop.tmp.dir</name>
             <value>/host/hadoop/data</value>
        </property>
</configuration>

Please replace nodemanager, port and /host/hadoop/data with the appropriate values

hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>rep_value</value>
    </property>
</configuration>

Please replace rep_value with the appropriate number of data replications

Starting cluster

Before first time start, the dfs should be formated

singularity run -B $HADOOP_LOGS:/app/software/Hadoop/3.2Hadoop-3.2-Java-11/logs Singularity.Hadoop-3.2-Java-11.sif hdfs --config $HADOOP_CONF_DIR namenode -format

Starting namenode

singularity run -B $HADOOP_LOGS:/app/software/Hadoop/3.2Hadoop-3.2-Java-11/logs Singularity.Hadoop-3.2-Java-11.sif hdfs --config $HADOOP_CONF_DIR --daemon start namenode

Starting datanodes

On each node run

singularity run -B $HADOOP_LOGS:/app/software/Hadoop/3.2Hadoop-3.2-Java-11/logs Singularity.Hadoop-3.2-Java-11.sif hdfs --config $HADOOP_CONF_DIR --daemon start datanode

Checking the cluster

singularity run -B $HADOOP_LOGS:/app/software/Hadoop/3.2Hadoop-3.2-Java-11/logs Singularity.Hadoop-3.2-Java-11.sif hdfs --config $HADOOP_CONF_DIR dfsadmin -report