Category: Apache

Technology Blog

Xml Processing with MapReduce/Spark using an Xml StaX Parser

XmlStaxInputFormat / XmlStaxFileRecordReader Github Project – https://github.com/gss2002/xml-stax-mr After some time it seemed like a gap that existed with Hadoop MapReduce and Spark that the existing XmlInputFormat classes from Mahout were using fseek and searching for strings as the file is read in from HDFS. The ability to break up a large Xml file becomes extremely important…
Read more

Apache Ranger Audit Logs stored in HDFS parsed with Apache Spark

Using Apache Spark to parse a large HDFS archive of Ranger Audit logs using Apache Spark to find and verify if a user attempted to access files in HDFS, Hive or HBase. This eliminates the need to use a Hive SerDe to read these Apache Ranger JSON Files and to have to create an external…
Read more

Integrating Apache Nifi with IBM MQ

Integrating Apache Nifi with IBM MQ This would be a continuation of the IBM MQ and Hadoop integration article I first posted a few years ago. This explains how to integrate IBM MQ with Apache Nifi or Hortonworks HDF. IBM MQ is extremely important when attempting to integrate new technologies with legacy environments specifically mainframe environments…
Read more

CIFS SMB to HDFS and FTP to HDFS

CIFS/SMB to HDFS and FTP to HDFS Over the past few years since working on with Hadoop and HDFS. Two types of requests that came up pretty regularly. One being can we move files from a Windows SMB/CIFS file share into Hadoop/HDFS usually containing 1000’s of CSVs or XLSX/XLS files. The other use case was…
Read more

Apache SolrCloud Kerberos Configuration

I’ve been working on securing Apache SolrCloud with kerberos. This includes configuring Zookeeper. So after struggling and lots of searching I came up with a working kerberized solution for SolrCloud, with Zookeeper, and Apache Ranger for Authorization. First I tried to secure a standalone Solr instance by updating to the Solr 6x branch which is a SNAPSHOT…
Read more

Hadoop, Java and HTTPD and /etc/security/limits.d/ nproc/pid-max

After successfully running a Large Hadoop Cluster for a period of time. I started to notice strange things occurring initially with the MapReduce PI example task where tasks would be marked as failed. When looking more closely and attempting to logon/su/ssh to a machine with the userid that was running the job the sshd/su would return: -bash:…
Read more

Hadoop and Redhat System Tuning /etc/sysctl.conf

Hadoop and Redhat System Tuning /etc/sysctl.conf One of the most overlooked things after building out a Hadoop cluster is the operating system tuning. This post will cover how to tune settings in /etc/sysctl.conf also known as Linux Kernel Settings. /etc/sysctl.conf ## ALWAYS INCREASE KERNEL SEMAPHORES especially IF using IBM JDK with SharedClassCache also a separate…
Read more

Integrating Apache Hadoop and Apache Flume with IBM MQ

Integrating Apache Hadoop and Flume with IBM MQ Over the past 2 years of working with Apache Hadoop a few things have come up folks wanting to use Apache Kafka which definitely has it’s place in the Hadoop Big Data and Next Generation of Technology spheres. But there is also the need to integrate with…
Read more

Why now?

GS Tech Blog What is the GS Tech Blog! It’s a place for me to rant and provide my thoughts about technology I’ve worked with over many years. So after working as a Technology Systems Engineer for almost 20 years, I decided it’s time to create a blog to publish some of my ideas and…
Read more