Hortonworks Archives - GS Tech Blog

Xml Processing with MapReduce/Spark using an Xml StaX Parser

by GS
in Apache, Apache Hadoop, Apache Spark, Hadoop, Hadoop Mapreduce, Hadoop XML Processing, HDFS, Hortonworks, java, StaX XML Parser, XmlInputFormat, XmlStaxInputFormat
on November 20, 2018

0

XmlStaxInputFormat / XmlStaxFileRecordReader Github Project – https://github.com/gss2002/xml-stax-mr After some time it seemed like a gap that existed with Hadoop MapReduce and Spark that the existing XmlInputFormat classes from Mahout were using fseek and searching for strings as the file is read in from HDFS. The ability to break up a large Xml file becomes extremely important…
Read more

Building an RPM for Spark 2.x for Vendor Hadoop Distributions

by GS
in Apache Hadoop, Apache Spark, Hadoop, HDFS, Hortonworks, Pyspark
on October 15, 2018

0

Building an RPM for Spark 2.x for Vendor Hadoop Distribution It may be necessary to produce an alternate packaged version of Spark for usage in a vendor provided Hadoop Distribution. This became apparent many times to me when loading Hortonworks HDP into an Enterprise Environment where update/upgrade cycles do not allow for upgrade of HDFS…
Read more

How to use the Native IBM MQ Client Receiver with Spark Streaming

by GS
in Apache Hadoop, Apache Spark, Apache Spark Streaming, Hadoop, Hortonworks, IBM, IBM MQ, Messaging, MQ, Nifi
on October 14, 2018

0

How to use the Native IBM MQ Client Receiver with Spark Streaming After using Apache Nifi and IBM MQ I noticed that Nifi could not easily guarantee order of incoming messages as failover can occur at anytime. This becomes a problem specifically with database and table replication when the replicating software puts messages to a…
Read more

Integrating Apache Nifi with IBM MQ

by GS
in Apache, Hadoop, HDF, Hortonworks, IBM, Linux, Messaging, MQ, Nifi
on May 10, 2018

0

Integrating Apache Nifi with IBM MQ This would be a continuation of the IBM MQ and Hadoop integrationÂ article I first posted a few years ago. This explains how to integrate IBM MQ with Apache Nifi or Hortonworks HDF. IBM MQ is extremely important when attempting to integrate new technologies with legacy environments specifically mainframe environments…
Read more

CIFS SMB to HDFS and FTP to HDFS

by GS
in Apache, cifs2hdfs, ftp2hdfs, Hadoop, Hortonworks
on March 21, 2018

0

CIFS/SMB to HDFS and FTP to HDFS Over the past few years since working on with Hadoop and HDFS. Two types of requests that came up pretty regularly. One being can we move files from a Windows SMB/CIFS file share into Hadoop/HDFS usually containing 1000’s of CSVs or XLSX/XLS files. The other use case was…
Read more

Hadoop and Redhat System Tuning /etc/sysctl.conf

by GS
in Apache, Apache Solr, Flume, Hadoop, HDF, Hortonworks, kernel tuning, limits.d, Linux, nproc, pid, sysctl.conf, tcp, tid, Tuning
on February 28, 2016

0

Hadoop and Redhat System Tuning /etc/sysctl.conf One of the most overlooked things after building out a Hadoop cluster is the operating system tuning. This post will cover how to tune settings in /etc/sysctl.conf also known as Linux Kernel Settings. /etc/sysctl.conf ## ALWAYS INCREASE KERNEL SEMAPHORES especially IF using IBM JDK with SharedClassCache also a separate…
Read more

Category: Hortonworks

Xml Processing with MapReduce/Spark using an Xml StaX Parser

Building an RPM for Spark 2.x for Vendor Hadoop Distributions

How to use the Native IBM MQ Client Receiver with Spark Streaming

Integrating Apache Nifi with IBM MQ

CIFS SMB to HDFS and FTP to HDFS

Hadoop and Redhat System Tuning /etc/sysctl.conf

Links

Category: Hortonworks

Xml Processing with MapReduce/Spark using an Xml StaX Parser

Building an RPM for Spark 2.x for Vendor Hadoop Distributions

How to use the Native IBM MQ Client Receiver with Spark Streaming

Integrating Apache Nifi with IBM MQ

CIFS SMB to HDFS and FTP to HDFS

Hadoop and Redhat System Tuning /etc/sysctl.conf

Links

Categories