Building an RPM for Spark 2.x for Vendor Hadoop Distributions

Technology Blog

Building an RPM for Spark 2.x for Vendor Hadoop Distributions

Building an RPM for Spark 2.x for Vendor Hadoop Distribution

It may be necessary to produce an alternate packaged version of Spark for usage in a vendor provided Hadoop Distribution. This became apparent many times to me when loading Hortonworks HDP into an Enterprise Environment where update/upgrade cycles do not allow for upgrade of HDFS and YARN in a rapid fashion.  So as time went on I came up with an idea on my home Hadoop Cluster to build a Maven project that will grab a specific Spark Version and create an RPM. Something similar can also be done with Cloudera CDH 5.15.x (https://github.com/gss2002/gss-spark2/tree/master/cloudera)

The project is located at: https://github.com/gss2002/gss-spark2/

How to:
Install RPM Build tools on Centos/RHEL Node:
yum install rpm-build
cd gss-spark2
./apache-maven-3.5.4/bin/mvn package -Drpm.version=2.2.2 -Drpm.release=1

 

What is contained in the project specifically for making this custom version of Spark work with HDP 2.5.x

In the install.sh rpm scripts to inject settings that are required to make Spark2 work with Spark 1.x config files and HDP 2.5.x:

What specifically is injected is disabling of Spark’s usage of ATS as their is a Jersey 1.19 compatibility issue, specifying -Dhdp.version= to the version from hdp-select as Spark needs to know where the HDP components are located, note these are needed as the executor, driver, and am (application master) extraJavaOptions. Also spark.yarn.archive is also specified as a path in HDFS and a zip of -spark2/lib/* to speed up execution and startup of Spark2.

if [ -z “${SPARK_HOME}” ]; then

  source “$(dirname “$0”)”/find-spark-home

fi

# disable randomized hash for string in Python 3.3+

export PYTHONHASHSEED=0

HDPSELECT=`/usr/bin/hdp-select | /usr/bin/grep spark-client | /usr/bin/hdp-select | /usr/bin/grep spark-client | /bin/cut -d ” ” -f 3`

export SPARK_SUBMIT_OPTS=”-Dhdp.version=$HDPSELECT”$SPARK_SUBMIT_OPTS

ADDTLARGS=”–conf spark.driver.extraJavaOptions=-Dhdp.version=$HDPSELECT –conf spark.executor.extraJavaOptions=-Dhdp.version=$HDPSELECT –conf spark.yarn.am.extraJavaOptions=-Dhdp.version=$HDPSELECT –conf spark.hadoop.yarn.timeline-service.enabled=false –conf spark.yarn.archive=hdfs:///apps/spark221/spark221.zip”

if [[ “$1” == “pyspark-shell-main” || “$1” == “sparkr-shell-main” ]] ; then

        exec “${SPARK_HOME}”/bin/spark-class org.apache.spark.deploy.SparkSubmit “$@” $ADDTLARGS

else

        exec “${SPARK_HOME}”/bin/spark-class org.apache.spark.deploy.SparkSubmit $ADDTLARGS “$@”

fi

Also in postinstall.sh rpm script the following files are symlinked from /etc/spark/conf from HDP:

lrwxrwxrwx.  1 root root   35 Oct 15 02:02 spark-defaults.conf -> /etc/spark/conf/spark-defaults.conf

lrwxrwxrwx.  1 root root   28 Oct 15 02:02 spark-env.sh -> /etc/spark/conf/spark-env.sh

lrwxrwxrwx.  1 root root   29 Oct 15 02:02 hive-site.xml -> /etc/spark/conf/hive-site.xml

lrwxrwxrwx.  1 root root   32 Oct 15 02:02 log4j.properties -> /etc/spark/conf/log4j.properties

Tags: