Category Archives: Apache Spark

Building an RPM for Spark 2.x for Vendor Hadoop Distribution

Building an RPM for Spark 2.x for Vendor Hadoop Distribution It may be necessary to produce an alternate packaged version of Spark for usage in a vendor provided Hadoop Distribution. This became apparent many times to me when loading Hortonworks HDP into an Enterprise Environment where update/upgrade cycles do not allow for upgrade of HDFS… Read More »

How to use the Native IBM MQ Client Receiver with Spark Streaming

How to use the Native IBM MQ Client Receiver with Spark Streaming After using Apache Nifi and IBM MQ I noticed that Nifi could not easily guarantee order of incoming messages as failover can occur at anytime. This becomes a problem specifically with database and table replication when the replicating software puts messages to a… Read More »

Apache Ranger Audit Logs stored in HDFS parsed with Apache Spark

Using Apache Spark to parse a large HDFS archive of Ranger Audit logs using Apache Spark to find and verify if a user attempted to access files in HDFS, Hive or HBase. This eliminates the need to use a Hive SerDe to read these Apache Ranger JSON Files and to have to create an external… Read More »