We’ve been working with one client who uses a rails application behind Netscalers, and who has been having issues with connections going in to the Netscaler surge queue. The surge queue is where the Netscaler puts connections when the destination load balancing VIP does not have a service it can send them to – as all the services bound to the VIP are at their max connection limit.
Unfortunately, despite what you may think about how a surge queue works, it is not a single queue per load balancing vserver in the netscaler – instead, the Netscaler will associate a request with a service, and put it in the service’s surge queue.
So, if you have 2 services associated with a vserver, and both services have a connection limit of one, and 3 requests come in, each service will process one a request, and one service will have the extra connection placed into its surge queue. All well and good, but if the other service completes it’s request, while the one with the request in the surge queue is running a 30 second long report, the request in the queue will have to wait, even though there is an idle service.
So having traffic in the surge queue is something to avoid, if possible.
The customer in question is savvy enough to subscribe to LogicMonitor, whos Citrix Netscaler monitoring is very good – so they at least knew they were having surge queue issues.
Their main concern was being assured that production traffic was not going into the surge queue. They did not have the spare thousands to buy a separate Netscaler cluster for staging, so staging systems were running on the same Netscaler cluster as production. Given they were getting alerts for the global surge queue, the concern was whether these were production requests or not.
This customer had worked with LogicMonitor to get access to the next release of LogicMonitor, that tracks requests and surge queue levels down to the service (physical server). However, this did not seem to help. Netscaler reports surge queue levels as a gauge, rather than a counter – even polling every minute, it is quite possible to read all zero’s, when there was a spike of 50 in between the poll’s. (This is why counters are the better choice for most datapoints, if you have a choice – but in this case, Netscaler does not expose a counter.)
So, the customer was seeing global surge queue activity, but the individual services would often not show surge queue entries – so how to determine where the surges were going?
The monitoring was not reliably catching. Doing a “stat lb vserver” on the Netscaler is useless for catching transitory spikes.
The only reliable way seems to be log analysis.
So, the above is a long waffling introduction as to why good monitoring is not enough in some cases, and you still have to fall back on good old manual sysadmin skills. (Although you still need good monitoring to alert you to the problem.)
Netscalers log a bunch of information – as has been mentioned before, this can be accessed by nsconmsg.
So, ssh to the netscaler, drop to the shell, and run:
cd /var/tmp #usually this has a lot of space
nsconmsg -K /var/nslog/newnslog -s ConLb=3 -d oldconmsg > NetscalerLog
(You need to use log level 3 to get the surge queue information.)
Now, scp that NetscalerLog file to a linux host somewhere, as the grep on the netscaler is not powerful enough to process it with the useful flags.
To track each entry in the surge queue, and when it occurred, run:
cat NetscalerLog | grep -B2 'SQ([1-9])\|current' | grep -B4 'SQ' | grep 'current\|S(\|SQ'
This will give you a timestamp, then a line for each service and its surge queue, if it has a non-zero surge queue.
current time is Mon Sep 28 22:04:01 2009
S(10.1.1.213:13411:UP) Hits(23587, 0/sec, P[0, 0/sec]) ATr(2:2) Mbps(0.00) BWlmt(0 kbits) RspTime(0.00 ms) Load(0) LConn_Idx: (C:2; V:2,I:1)
Conn: CSvr(8, 0/sec) MCSvr(1) OE(1) E(1) RP(0) SQ(1)
S(10.1.1.212:13390:UP) Hits(23491, 0/sec, P[0, 0/sec]) ATr(2:2) Mbps(0.00) BWlmt(0 kbits) RspTime(0.00 ms) Load(0) LConn_Idx: (C:2; V:2,I:1)
Conn: CSvr(5, 0/sec) MCSvr(1) OE(1) E(1) RP(0) SQ(1)
current time is Mon Sep 28 22:04:57 2009
S(10.1.1.212:13390:UP) Hits(23508, 0/sec, P[0, 0/sec]) ATr(2:2) Mbps(0.00) BWlmt(0 kbits) RspTime(0.00 ms) Load(0) LConn_Idx: (C:2; V:2,I:1)
Conn: CSvr(7, 0/sec) MCSvr(1) OE(1) E(1) RP(0) SQ(1)
current time is Mon Sep 28 22:17:05 2009
S(10.1.1.237:13249:UP) Hits(68292, 0/sec, P[0, 0/sec]) ATr(2:2) Mbps(0.05) BWlmt(0 kbits) RspTime(100.16 ms) Load(0) LConn_Idx: (C:2; V:2,I:1)
Conn: CSvr(1, 0/sec) MCSvr(1) OE(1) E(1) RP(0) SQ(1)
Alternatively, if you want to see the distribution of which servers had how many queued connections in the current logfile:
cat nscon.log | grep -B2 'SQ([1-9])' | grep -v "\-\-"| sed 's/SQ(\(.*\))/\1\n---/' | awk ' {RS="---"; print $1, "Surge queue:", $NF }' | awk ' { services[$1]+=$NF }; END { for (i in services) { print services[i],i; }} '
0
3 S(10.1.1.213:13411:UP)
1 S(10.1.1.212:13295:UP)
83 S(10.1.1.237:13249:UP)
0 Surge
72 S(10.1.1.236:13658:UP)
1 Other:
1 S(10.1.1.212:13764:UP)
4 S(10.1.1.212:13390:UP)
2 S(10.1.1.236:13657:UP)
(Yes, it has some garbage in there, and the awk commands could be combined, but it gets the job done, and we didn’t want to spend the clients time/money perfecting it.)
In this case it is clear now that the affected hosts where staging systems only, so the client can relax that production was not impacted, while their developers figure out why their staging systems are running slowly.
So anyway, hopefully those two command lines will help some people trace down which connections are being placed in the surge queue, and also demonstrate that good monitoring is necessary, but not sufficient.