AIX network debugging

Since a couple of day at work I had to investigate on some network issues for an application, it seems that we have a strange ARP configuration that may be the main issue… by the way, I discovered another strange thing where icmp packets sent in burst mode (like a flood) are lost at variable rates. Here is the note on my investigations

Let’s start by the beginning of the story, I had to reproduce a high frequency request/reply on a server to reproduce what is happening on an application having some trouble. For this I chose icmp as it is easy to simulate. Unfortunately ping flood is only available for root but as IBM may thing normal user should be able to overload an interface  that have created the -l option allowing the same such violent method. Thank you IBM.

Here is my script to produce heavy and aggressive ping

#!/bin/ksh
num=100
server=localhost

totsend=0
totlos=0
while true ; do
   res=`ping -q -w 2 -s 1024 -l $(( $num - 1 )) -c $num $server | tail -2 | tr '\n' '#'`
   echo $res
   # 100 packets transmitted, 89 packets received, 11% packet loss#round-trip min/avg/max = 0/0/1 ms#
   received=`echo $res | cut -d ',' -f 2 | cut -d ' ' -f 2`
   totsend=$(( $totsend + $num ))
   totlos=$(( $totlos + $num - $received ))
   echo "overall lost : $(( 1000 * $totlos / $totsend )) o/oo == $totlos / $totsend"
   sleep 1

done

As the localhost is a good example as it always lost 39% of the traffic it is interesting to investigate that one.

With netstat we can investigate on where the packets has been dropped.  First approach is to check ICMP

#netstat -p icmp
icmp:
        1132 calls to icmp_error
        0 errors not generated because old message was icmp
        Output histogram:
                echo reply: 743
                destination unreachable: 1132
                echo: 369853
        0 messages with bad code fields
        0 messages < minimum length
        0 bad checksums
        0 messages with bad length
        Input histogram:
                echo reply: 369853
                destination unreachable: 1137
                echo: 743
        743 message responses generated

So here, we have some errors but in fact, when the tping.sh script is launched, these numbers are not changing. Look strange but why not .. take a look to IP propotocol

#netstat -p IP
ip:
        64337400 total packets received
        [...]
        579 packets dropped due to the full socket receive buffer
        0 dead gateway detection packets sent

By running this command and the script in parallel it is easy to see that this counter is growing as we are losing icmp packets.

It seems that this socket receive buffer is the reason of this loss issue. This could make sense regarding what I have measured on different servers : faster is the response time, more loss we have:

  • local host is 39%
  • different lpar / same frame : 15%
  • different lpar / different frame : 5%
  • different lapr / different datacenter : 0,5%
  • Linux box on a vm : no loss

So … we can assume faster is the response, quicker the input buffer is full. What is interesting is to consider that this buffer is about 61% of 100 icmp packet of 1024 … we can assume this buffer looks limited to 64KB. I assume this can be an issue when our application is trying to transfer large XML messages in mass over the loopback interface between internal services.

This can also be seen by running netstat -D where the Idrops column is growing for IP Protocol in my case.

Let’s confirm this hypothesis by looking the configuration …

# no -a

The size of the received queue is defined by ipqmaxlen parameter.

# no -a| grep ipqmaxlen
ipqmaxlen = 100

This means 100 packet can be stored in the queue. By-the-way, I’m only able to process 61 on 100 … so I assume this is due to my icmp lenght of 1024 potentially split into multiple packets. Let’s change this for 110 packet of 512B …

ksh ./tping.sh
110 packets transmitted, 103 packets received, 6% packet loss#round-trip min/avg/max = 0/0/1 ms#
overall lost : 63 o/oo == 7 / 110

Considering some of he packets are proceeded before all are sent, this would make sense to have 103 proceed and 7 losts

So the next step is to ask root to change this parameter to something higher !

in Fact later, after deep reading and no efficient solution, I learned that this parameter will affect the ipintrq counter. In my case the value was 0 (no pbm)! ]

# netstat -s | grep ipin
        0 ipintrq overflows

This done, I was expecting to get a better result ! unfortunately not 🙁 🙁

So ! let’s consider the other parameters : The IBM documentation helps

But I found no parameters really interesting me at this point.

Let’s take a look to the network stack to have a better understanding : the source of this picture is here.

AIX network stack

AIX network stack

So according to the error counter (full socket receive buffer) and this schema, I would say that socket-receive buffer would be full the main issue for me here is that I’m running ICMP, not TCP/UDP… by the way, I will assume the mechanism are similar. This could confirm receive queue (on a lower level) is not in cause.

If the drop was on a lower layer, it could be interesting to investigate parameter on the interface, they can be access with the following command :

# lsattr -E -l enXXX

As I’m reproducing the problem on loopback, these parameter have no limitation and should not be in cause.

Let’s take a look to mbuf : these buffers are allocated to store the network data. They are pre-allocated buffers per CPU.

 # netstat -m
******* CPU 0 *******
By size           inuse     calls failed   delayed    free   hiwat   freed
64                  200    481148      0         6     376  524288       0
128                1082     35437      0        33     198  393216       0
256                4864    101700      0       300     208  786432       0
512                6757   7985069      0       839     163 1638400       0
1024                 73    151994      0        31      55  393216       0
2048               1826    332816      0      1013     338  983040       0
4096                 10       128      0         3       5   65536       0
8192                 16      1627      0        12       4   16384       0
16384               385       495      0        54       1    8192       0
32768               137    471826      0       114     372    4096       0
65536               187       308      0       114      10    4096       0
131072                2        75      0         0    5120   10240       0

These buffer are allocated by size (first column).Next column are the following

  • inuse = how many buffer are actually in use
  • calls = how many request have been made for this size
  • failed = how many request have not been positively answered
  • delayed = how many request have been delayed when authorized
  • free = how many buffer of this size are actually available for request
  • hiwat = under this limit, the free list will not be returned (free) to the system
  • freed = number of buffer returned (free) to the system

So, only failed column is a good indicator of a mbuf memory issue. In my system, no failed. Apparently, according to some documentation network stack is using 256 and 4096 mbufs. Delayed mbuf could be a point to take into consideration, when performance issues are detected.

Another way to check buffer allocation is to use netstat -M eventually we can sort the result on failed events:

# netstat -M | sort -n -k 4 | tail -3
 2048          62   72035      477        0           89 F1000E007E368E00
 2048          62  198183      694        0           94 F1000E007E37FE00
 2048         512  464072    84068      512          639 F1000E00A40E3400

Seems there is something to look at here, mostly as it is growing … but how to identify what is the source of this (no solution found)?

So let’s get back on network parameters, here are the one in relation with buffer (accessible with no -o yyyyy command)

  • sb_max : is the size for all socket, in bytes. (2MB in my system)
  • tcp_recvspace : is the size for tcp receive sockets in byte (256KB in my system)
  • tcp_sendspace : is the size for tcp send socket in byte (256KB in my system)
  • udp_recvspace : is the size for udp receive sockets in byte (256KB in my system)
  • udp_sendspace : is the size for udp send socket in byte (256KB in my system)

sb_max must be higher than tcp* + udp*. My problem is that ICMP is part of ?

According to different test, I assume I’m having a drop occurring when the amount of data is about 64 – 74KB of data. This id lower than the limit here defined.

The maximum size of mbuf is defined according to this parameter :

# lsattr -E -l sys0 -a maxmbuf

In my system, 0 indicates, no specific limit.

To go further, the use of the netpmon tool is really interesting to get a per process network usage analysis.

To use it :

# netpmon

Run trcstop command to signal end of trace.
Fri Apr  4 10:35:09 2014
System: AIX 7.1 Node: xxxaixxxxxx Machine: xxxxxxx

# ...
# trcstop

And we can get report like this :

First Level Interrupt Handler CPU Usage Statistics:
---------------------------------------------------
                                                   Network
FLIH                              CPU Time   CPU %   CPU %
----------------------------------------------------------
data page fault                     4.2324   1.221   0.000
UNKNOWN                             0.3702   0.107   0.000
PPC decrementer                     0.0878   0.025   0.000
external device                     0.0083   0.002   0.000
queued interrupt                    0.0078   0.002   0.000
instruction page fault              0.0000   0.000   0.000
----------------------------------------------------------
Total (all FLIHs)                   4.7066   1.358   0.000

========================================================================
TCP Socket Call Statistics (by Process):
----------------------------------------
                                   ------ Read -----   ----- Write -----
Process (top 20)             PID   Calls/s   Bytes/s   Calls/s   Bytes/s
------------------------------------------------------------------------
df_engine               29294834     92.20   2470242     80.96    107869
java                    12255424     69.29    511481     69.38     46462
sshd:                   27328748      0.14      2269      0.21        13

========================================================================

Detailed TCP Socket Call Statistics (by Process):
-------------------------------------------------

PROCESS: df_engine   PID: 29294834
reads:                  3995
  read sizes (bytes):   avg 26791.2 min 8208    max 65536   sdev NaNQ
  read times (msec):    avg 2.422   min 0.003   max 3013.614 sdev 48.963
writes:                 3508
  write sizes (bytes):  avg 1332.3  min 13      max 14679   sdev 3378.6
  write times (msec):   avg 0.044   min 0.013   max 19.546  sdev 0.365

PROCESS: java   PID: 12255424

Providing good information on network usage

 

I also used the following script to check the mbuf delay activity

 while true ; do 
   v=0 ; 
   netstat -m | grep "^[0123456789]" | sort -k 5 | tr -s " " | \
      cut -d " " -f 5 | grep -v "failures" | \
      while read ; do v=$(( $v + $REPLY )) ; done 
   echo `date '+%H:%M:%S' ` $v 
   sleep 1 
  done

This second script is usefull to track memory fail on the 512 and 4096 block used by the network stack :

#!/bin/sh

while true ; do
  b512=`netstat -M | tr -s " " | grep "^ 512 " | cut -d " " -f 5 | \
   ( v=0 ; while read ; do v=$(( $v + $REPLY )) ; done ; echo $v )`
  b4096=`netstat -M | tr -s " " | grep "^ 4096 " | cut -d " " -f 5 |\
   ( v=0 ; while read ; do v=$(( $v + $REPLY )) ; done ; echo $v )`
  echo `date '+%H:%m:%S' ` $b512 $b4096
  sleep 5
done

This analysis is really interesting because there is a corellation between the network freeze seen and the growth of the delayed mbuf. My analysis is the following one : the frame is store in the IP stack and try to go up to TCP, for this a mbuf is required if the tcp stack have nomore mbuf available, I assume the IP layer put the frame in the parking and proceed the others (when smaller they can have mbuf available) what is strange is how many time the frame stay in parking, sometime 30 seconds in my case when in parallel lot of next frame are proceeded immediately.

By-the way, it is possible to give more memory to the tcp queues by changing this parameter :

 

 

This article gives me lots of element for the analysis :o) thank you

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.