Building an RPM for Spark 2.x for Vendor Hadoop Distributions

by GS
in Apache Hadoop, Apache Spark, Hadoop, HDFS, Hortonworks, Pyspark
on October 15, 2018

Building an RPM for Spark 2.x for Vendor Hadoop Distribution

It may be necessary to produce an alternate packaged version of Spark for usage in a vendor provided Hadoop Distribution. This became apparent many times to me when loading Hortonworks HDP into an Enterprise Environment where update/upgrade cycles do not allow for upgrade of HDFS and YARN in a rapid fashion.Â So as time went on I came up with an idea on my home Hadoop Cluster to build a Maven project that will grab a specific Spark Version and create an RPM. Something similar can also be done with Cloudera CDH 5.15.x (https://github.com/gss2002/gss-spark2/tree/master/cloudera)

The project is located at:Â https://github.com/gss2002/gss-spark2/

How to:
Install RPM Build tools on Centos/RHEL Node:
yum install rpm-build
cd gss-spark2
./apache-maven-3.5.4/bin/mvn package -Drpm.version=2.2.2 -Drpm.release=1

What is contained in the project specifically for making this custom version of Spark work with HDP 2.5.x

In the install.sh rpm scripts to inject settings that are required to make Spark2 work with Spark 1.x config files and HDP 2.5.x:

What specifically is injected is disabling of Spark’s usage of ATS as their is a Jersey 1.19 compatibility issue, specifying -Dhdp.version= to the version from hdp-select as Spark needs to know where the HDP components are located, note these are needed as the executor, driver, and am (application master) extraJavaOptions. Also spark.yarn.archive is also specified as a path in HDFS and a zip of -spark2/lib/* to speed up execution and startup of Spark2.

if [ -z “${SPARK_HOME}” ]; then

Â source “$(dirname “$0”)”/find-spark-home

# disable randomized hash for string in Python 3.3+

export PYTHONHASHSEED=0

HDPSELECT=`/usr/bin/hdp-select | /usr/bin/grep spark-client | /usr/bin/hdp-select | /usr/bin/grep spark-client | /bin/cut -d ” ” -f 3`

export SPARK_SUBMIT_OPTS=”-Dhdp.version=$HDPSELECT”$SPARK_SUBMIT_OPTS

ADDTLARGS=”–conf spark.driver.extraJavaOptions=-Dhdp.version=$HDPSELECT –conf spark.executor.extraJavaOptions=-Dhdp.version=$HDPSELECT –conf spark.yarn.am.extraJavaOptions=-Dhdp.version=$HDPSELECT –conf spark.hadoop.yarn.timeline-service.enabled=false –conf spark.yarn.archive=hdfs:///apps/spark221/spark221.zip”

if [[ “$1” == “pyspark-shell-main” || “$1” == “sparkr-shell-main” ]] ; then

Â Â Â Â exec “${SPARK_HOME}”/bin/spark-class org.apache.spark.deploy.SparkSubmit “$@” $ADDTLARGS

else

Â Â Â Â exec “${SPARK_HOME}”/bin/spark-class org.apache.spark.deploy.SparkSubmit $ADDTLARGS “$@”

Also in postinstall.sh rpm script the following files are symlinked from /etc/spark/conf from HDP:

lrwxrwxrwx.Â 1 root root Â 35 Oct 15 02:02 spark-defaults.conf -> /etc/spark/conf/spark-defaults.conf

lrwxrwxrwx.Â 1 root root Â 28 Oct 15 02:02 spark-env.sh -> /etc/spark/conf/spark-env.sh

lrwxrwxrwx.Â 1 root root Â 29 Oct 15 02:02 hive-site.xml -> /etc/spark/conf/hive-site.xml

lrwxrwxrwx.Â 1 root root Â 32 Oct 15 02:02 log4j.properties -> /etc/spark/conf/log4j.properties

Tags: Apache Spark

Building an RPM for Spark 2.x for Vendor Hadoop Distributions