Category Archives: Hadoop

Hadoop and Tuning related to Hadoop

Xml Processing with MapReduce/Spark using an Xml StaX Parser

XmlStaxInputFormat / XmlStaxFileRecordReader Github Project – https://github.com/gss2002/xml-stax-mr After some time it seemed like a gap that existed with Hadoop MapReduce and Spark that the existing XmlInputFormat classes from Mahout were using fseek and searching for strings as the file is read in from HDFS. The ability to break up a large Xml file becomes extremely important… Read More »

Building an RPM for Spark 2.x for Vendor Hadoop Distributions

Building an RPM for Spark 2.x for Vendor Hadoop Distribution It may be necessary to produce an alternate packaged version of Spark for usage in a vendor provided Hadoop Distribution. This became apparent many times to me when loading Hortonworks HDP into an Enterprise Environment where update/upgrade cycles do not allow for upgrade of HDFS… Read More »

How to use the Native IBM MQ Client Receiver with Spark Streaming

How to use the Native IBM MQ Client Receiver with Spark Streaming After using Apache Nifi and IBM MQ I noticed that Nifi could not easily guarantee order of incoming messages as failover can occur at anytime. This becomes a problem specifically with database and table replication when the replicating software puts messages to a… Read More »

Apache Ranger Audit Logs stored in HDFS parsed with Apache Spark

Using Apache Spark to parse a large HDFS archive of Ranger Audit logs using Apache Spark to find and verify if a user attempted to access files in HDFS, Hive or HBase. This eliminates the need to use a Hive SerDe to read these Apache Ranger JSON Files and to have to create an external… Read More »

Integrating Apache Nifi with IBM MQ

Integrating Apache Nifi with IBM MQ This would be a continuation of the IBM MQ and Hadoop integration article I first posted a few years ago. This explains how to integrate IBM MQ with Apache Nifi or Hortonworks HDF. IBM MQ is extremely important when attempting to integrate new technologies with legacy environments specifically mainframe environments… Read More »

Hadoop, Java and HTTPD and /etc/security/limits.d/ nproc/pid-max

After successfully running a Large Hadoop Cluster for a period of time. I started to notice strange things occurring initially with the MapReduce PI example task where tasks would be marked as failed. When looking more closely and attempting to logon/su/ssh to a machine with the userid that was running the job the sshd/su would return: -bash:… Read More »

Hadoop and ip_conntrack: table full, dropping packet

I’m pretty sure many folks have seen this specific error across multiple different linux systems specifically when iptables is enabled and the OS has thousands of connections coming in second. In my case I ran into this Examples of this are with Hadoop NameNode. Someone accidentally executed iptables -L to try to get a list… Read More »

Hadoop and Redhat System Tuning /etc/sysctl.conf

Hadoop and Redhat System Tuning /etc/sysctl.conf One of the most overlooked things after building out a Hadoop cluster is the operating system tuning. This post will cover how to tune settings in /etc/sysctl.conf also known as Linux Kernel Settings. /etc/sysctl.conf ## ALWAYS INCREASE KERNEL SEMAPHORES especially IF using IBM JDK with SharedClassCache also a separate… Read More »