HDFS Archives - GS Tech Blog

Xml Processing with MapReduce/Spark using an Xml StaX Parser

by GS
in Apache, Apache Hadoop, Apache Spark, Hadoop, Hadoop Mapreduce, Hadoop XML Processing, HDFS, Hortonworks, java, StaX XML Parser, XmlInputFormat, XmlStaxInputFormat
on November 20, 2018

0

XmlStaxInputFormat / XmlStaxFileRecordReader Github Project – https://github.com/gss2002/xml-stax-mr After some time it seemed like a gap that existed with Hadoop MapReduce and Spark that the existing XmlInputFormat classes from Mahout were using fseek and searching for strings as the file is read in from HDFS. The ability to break up a large Xml file becomes extremely important…
Read more

Building an RPM for Spark 2.x for Vendor Hadoop Distributions

by GS
in Apache Hadoop, Apache Spark, Hadoop, HDFS, Hortonworks, Pyspark
on October 15, 2018

0

Building an RPM for Spark 2.x for Vendor Hadoop Distribution It may be necessary to produce an alternate packaged version of Spark for usage in a vendor provided Hadoop Distribution. This became apparent many times to me when loading Hortonworks HDP into an Enterprise Environment where update/upgrade cycles do not allow for upgrade of HDFS…
Read more

Apache Ranger Audit Logs stored in HDFS parsed with Apache Spark

by GS
in Apache, Apache Hadoop, Apache Hive, Apache Ranger, Apache Spark, audit, Hadoop, HDFS, Pyspark
on August 31, 2018

0

Using Apache Spark to parse a large HDFS archive of Ranger Audit logs using Apache Spark to find and verify if a user attempted to access files in HDFS, Hive or HBase. This eliminates the need to use a Hive SerDe to read these Apache Ranger JSON Files and to have to create an external…
Read more

Category: HDFS

Xml Processing with MapReduce/Spark using an Xml StaX Parser

Building an RPM for Spark 2.x for Vendor Hadoop Distributions

Apache Ranger Audit Logs stored in HDFS parsed with Apache Spark

Links

Category: HDFS

Xml Processing with MapReduce/Spark using an Xml StaX Parser

Building an RPM for Spark 2.x for Vendor Hadoop Distributions

Apache Ranger Audit Logs stored in HDFS parsed with Apache Spark

Links

Categories