Apache Ranger Audit Logs stored in HDFS parsed with Apache Spark

by GS
in Apache, Apache Hadoop, Apache Hive, Apache Ranger, Apache Spark, audit, Hadoop, HDFS, Pyspark
on August 31, 2018

Using Apache Spark to parse a large HDFS archive of Ranger Audit logs using Apache Spark to find and verify if a user attempted to access files in HDFS, Hive or HBase. This eliminates the need to use a Hive SerDe to read these Apache Ranger JSON Files and to have to create an external table which could compromise the logged data as the user creating the table will need Read/Write to the folder.

PySpark Batch Script to filter data sets:

import sys
import os
from datetime import *
from time import *
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark import RDD
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql import HiveContext

sc = SparkContext()
spark = SparkSession(sc)
sqlContext = HiveContext(sc)

df = spark.read.json(‘/ranger/audit/hdfs/20180[3-8]*/*’)
df.createOrReplaceTempView(“hdfs_audit”)

delresults = sqlContext.sql(‘select reqUser, action, agentHost, cliIp, evtTime, reason, resource from hdfs_audit where resource LIKE “/source/public/pgsql/db1/table_name%” and reqUser != “blahuser”‘)

delresults.write.save(‘/user/testuser/results_gss_orc_out’, format=’orc’, mode=’append’)

Hive Table Creation from Results:

create external table ranger_audit ( reqUser string, action string, agentHost string, cliIp string, evtTime string, reason string, resource string)
STORED AS ORC
LOCATION ‘/user/blahuser/results_gss_orc_out’;

Hive Query Setup and Results:

select count(*) as hits, reqUser, resource from ranger_audit group by reqUser, resource sort by hits desc ;

Tags: Apache Hadoop, Apache Hive, Apache Ranger, Apache Spark, audit, HDFS, pyspark, Python

Apache Ranger Audit Logs stored in HDFS parsed with Apache Spark