Building an RPM for Spark 2.x for Vendor Hadoop Distributions
Building an RPM for Spark 2.x for Vendor Hadoop Distribution
It may be necessary to produce an alternate packaged version of Spark for usage in a vendor provided Hadoop Distribution. This became apparent many times to me when loading Hortonworks HDP into an Enterprise Environment where update/upgrade cycles do not allow for upgrade of HDFS and YARN in a rapid fashion. So as time went on I came up with an idea on my home Hadoop Cluster to build a Maven project that will grab a specific Spark Version and create an RPM. Something similar can also be done with Cloudera CDH 5.15.x (https://github.com/gss2002/gss-spark2/tree/master/cloudera)
The project is located at:Â https://github.com/gss2002/gss-spark2/
How to:
Install RPM Build tools on Centos/RHEL Node:
yum install rpm-build
cd gss-spark2
./apache-maven-3.5.4/bin/mvn package -Drpm.version=2.2.2 -Drpm.release=1
What is contained in the project specifically for making this custom version of Spark work with HDP 2.5.x
In the install.sh rpm scripts to inject settings that are required to make Spark2 work with Spark 1.x config files and HDP 2.5.x:
What specifically is injected is disabling of Spark’s usage of ATS as their is a Jersey 1.19 compatibility issue, specifying -Dhdp.version= to the version from hdp-select as Spark needs to know where the HDP components are located, note these are needed as the executor, driver, and am (application master) extraJavaOptions. Also spark.yarn.archive is also specified as a path in HDFS and a zip of -spark2/lib/* to speed up execution and startup of Spark2.
if [ -z “${SPARK_HOME}” ]; then
 source “$(dirname “$0”)”/find-spark-home
fi
# disable randomized hash for string in Python 3.3+
export PYTHONHASHSEED=0
HDPSELECT=`/usr/bin/hdp-select | /usr/bin/grep spark-client | /usr/bin/hdp-select | /usr/bin/grep spark-client | /bin/cut -d ” ” -f 3`
export SPARK_SUBMIT_OPTS=”-Dhdp.version=$HDPSELECT”$SPARK_SUBMIT_OPTS
ADDTLARGS=”–conf spark.driver.extraJavaOptions=-Dhdp.version=$HDPSELECT –conf spark.executor.extraJavaOptions=-Dhdp.version=$HDPSELECT –conf spark.yarn.am.extraJavaOptions=-Dhdp.version=$HDPSELECT –conf spark.hadoop.yarn.timeline-service.enabled=false –conf spark.yarn.archive=hdfs:///apps/spark221/spark221.zip”
if [[ “$1” == “pyspark-shell-main” || “$1” == “sparkr-shell-main” ]] ; then
    exec “${SPARK_HOME}”/bin/spark-class org.apache.spark.deploy.SparkSubmit “$@” $ADDTLARGS
else
    exec “${SPARK_HOME}”/bin/spark-class org.apache.spark.deploy.SparkSubmit $ADDTLARGS “$@”
fi
Also in postinstall.sh rpm script the following files are symlinked from /etc/spark/conf from HDP:
lrwxrwxrwx. 1 root root  35 Oct 15 02:02 spark-defaults.conf -> /etc/spark/conf/spark-defaults.conf
lrwxrwxrwx. 1 root root  28 Oct 15 02:02 spark-env.sh -> /etc/spark/conf/spark-env.sh
lrwxrwxrwx. 1 root root  29 Oct 15 02:02 hive-site.xml -> /etc/spark/conf/hive-site.xml
lrwxrwxrwx. 1 root root  32 Oct 15 02:02 log4j.properties -> /etc/spark/conf/log4j.properties