One of the most overlooked things after setting up a Hadoop cluster is probably OS System Tuning. This entry will cover /etc/sysctl.conf aka the Linux Kernel Params that can be tuned.
## ALWAYS INCREASE KERNEL SEMAPHORES especially IF using IBM JDK with SharedClassCache also a separate discussion
# Controls the default maxmimum size of a mesage queue
kernel.msgmnb = 65536
# Controls the maximum size of a message, in bytes
kernel.msgmax = 65536
# Controls the maximum shared segment size, in bytes
kernel.shmmax = 68719476736
# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 4294967296
## BEGIN INCREASE ALL NETWORK TUNING OPTIONS TO HANDLE ALL HADOOPS NETWORK TRAFFIC ##
net.ipv4.tcp_dsack = 0 – TCP Selective Acknowledgement (TCP SACK), controlled by the boolean tcp_sack, allows the receiving side to give the sender more detail about lost segments, reducing volume of retransmissions. This is useful on high latency networks but disable this to improve throughput on high-speed LANs (internal networks).
net.ipv4.tcp_sack = 0 – Also disable tcp_dsack, if you aren’t sending SACK you certainly don’t want to send duplicates! Forward Acknowledgement works on top of SACK and will be disabled if SACK is.
net.ipv4.tcp_keepalive_probes = 5 – Determines how many times it will probe a connection that is not responding before timing out default is 9
net.ipv4.tcp_fin_timeout = 5 – Determines the time that must elapse before TCP/IP can release a closed connection and reuse its resources. During this TIME_WAIT state, reopening the connection to the client costs less than establishing a new connection. By reducing the value of this entry, TCP/IP can release closed connections faster, making more resources available for new connections. With Java Servers and Clients not interrogating the OS Socket state and notifying the OS the connection is gone in some cases having a high value will cause issues.
net.ipv4.tcp_keepalive_intvl = 15 – Determines the wait time between isAlive interval probes. The longer the wait time between probes the longer it takes to destroy unused connections.
net.ipv4.tcp_retries2 = 10 – Determines the wait on retry attempts. Having this at a high value on a high-performance network is detrimental as a lower value will allow systems to determine they are truly unavailable faster.
net.core.netdev_max_backlog = 250000 – Determines how many packets can be stored/queued in the kernel queue before they are discarded with high performance 10Gbps networks the default value of 1000 is to low and packets will be dropped and discarded.
net.ipv4.tcp_max_syn_backlog = 200000 – Allows remembered connections to be stored until the server accepts them. Useful during service restarts when clients start flooding the server’s sockets. Specifically, when the server has not fully ack’d the client yet.
net.ipv4.tcp_frto = 0 – an enhanced recovery algorithm for TCP retransmission timeouts (RTOs). It is particularly beneficial in wireless environments where packet loss is typically due to random radio interference rather than intermediate router congestion. See RFC 4138 for more details. Starting with RHEL6 this parameter default became 2 which is for WIFI cards.
net.core.rmem_max = 134217728 – Maximum receive socket buffer size (size of BDP)
net.core.wmem_max = 134217728 – Maximum send socket buffer size (size of BDP)
net.core.optmem_max = 134217728 – Socket option memory is used in a few cases for storing extra structures relating to usage of the socket(crypto, rdma etc)
net.core.rmem_default = 16777216 – Sets the default receive buffer to 16MB. This is critical if application doesn’t use autotune and there are plenty that are not using autotune
net.core.wmem_default = 16777216 – Sets the default send buffer to 16MB. This is critical if application doesn’t use autotune and there are plenty that are not using autotune
net.ipv4.tcp_mem = 16777216 16777216 16777216 – The tcp_mem variable defines how the TCP stack should behave when it comes to memory usage. … The first value specified in the tcp_mem variable tells the kernel the low threshold. Below this point, the TCP stack do not bother at all about putting any pressure on the memory usage by different TCP sockets. … The second value tells the kernel at which point to start pressuring memory usage down. … The final value tells the kernel how many memory pages it may use maximally. If this value is reached, TCP streams and packets start getting dropped until we reach a lower memory usage again. This value includes all TCP sockets currently in use. Most folks say don’t tune this but if you are using network IO intensive operations you do not want to be dropping packets because the buffers are being collapsed and pruned.
net.ipv4.tcp_rmem = 4096 16777216 67108864 – Minimum,initial,and max TCP Receive buffer size in Bytes. Used by the autotune function. Especially critical to have a large default buffer especially when VMWare and Java where GC cycles can cause buffers to fill up quickly. 16MB for default and 64MB for max seems to work well with multiple 10Gbps NIC(s)
net.ipv4.tcp_wmem = 4096 16777216 67108864 – Minimum,initial,and max TCP Send buffer size in Bytes. Used by the autotune function. Especially critical to have a large default buffer especially when VMWare and Java where GC cycles can cause buffers to fill up quickly. 16MB for default and 64MB for max seems to work well with multiple 10Gbps NIC(s)
net.core.somaxconn = 60000 – somaxconn defines the number of request_sock structures allocated per each listen call. The queue is persistent through the life of the listen socket.
## END INCREASE ALL NETWORK TUNING OPTIONS TO HANDLE ALL HADOOPS NETWORK TRAFFIC ##
net.ipv4.tcp_window_scaling = 1 – turns TCP window scaling support on, default is on for RHEL but adding to be safe as great throughput increase are seen when it’s enabled
net.ipv4.ip_local_port_range = 8196 65535 – increase number of ports available for ephemeral ports for all of the Yarn things like MR/Spark/Tez etc
net.ipv4.tcp_rfc1337 = 1 – Enable a fix for RFC1337 – time-wait assassination hazards in TCP
# VM Memory Tuning
vm.min_free_kbytes = 204800
vm.lower_zone_protection = 1024
vm.page-cluster = 20
vm.swappiness = 10
vm.vm_vfs_scan_ratio = 2
# Increase Max File Size
# Hadoop OpenFiles Problem
fs.epoll.max_user_instances = 4096
# Java performs poor with CFS Scheduler This option came from IBM Websphere Java Tuning Docy
kernel.sched_compat_yield = 1
# Increase conntracks only needed if someone runs iptables -L. This is a whole separate discussion
net.nf_conntrack_max = 10000000
# This is needed for all Hadoop Daemon processes that listen on a port especially on DataNodes or NodeManagers or HBase RegionServers.. You will eventually end up with some process using an ephemeral port on one of these critical ports say for a NodeManager.
net.ipv4.ip_local_reserved_ports = 7337,8000-8088,8141,8188,8440-8485,8651-8670,8788,8983,9083,9898,10000-10033,10200,11000,13562,15000,19888,45454,50010,50020,50030,50060,50070,50075,50090,50091,50470,50475,50100,50105,50111,60010-60030