Database monitoring: are things better or worse?

August 31st, 2010

On a recent project, some application changes meant the application could more intelligently cache its data. (A good application architect, with intimate knowledge of how the application uses data, should generally be able to devise a better memory use (caching) system than a general purpose one, such as the OS file system cache or a database query cache. It’s often not worth it to do so, though.)
The app was tested, and rolled out to some live servers.
Success! All the database monitoring metrics were looking good! Except for those that were worse!

Looking at the various MySQL monitoring graphs below, you can see the “Hmm.” The release was on 8/5.
CPU Load dropped:

Query Cache had less queries hitting it:

All the graphed MySQL operations dropped:

But..
Total Mysql Questions went up…

OK, so we weren’t terribly worried – all the important metrics went the right way. There were less Innodb cache misses; so less disk IO, better query response time, etc.
But what was with the increase in total Questions per second? How could that even be when all the operations graphed went down?
That was exactly the right question to ask – given all the counters that were graphed decreased, it must be one that was not graphed by the monitoring system.
A quick ’show global status’ showed the only counter with a big value that was not graphed was com_set_options.

The new application code sets options on every transaction – and the monitoring system didn’t (by default) graph com_set_options as it’s not usually a statement that affects performance. We added that counter to the data collection of the Mysql monitoring for this client, so we’ll be able to see it clearly when we update the code to not set so many options – and it just makes diagnosis easier. The more you know, the more you know.

Why Apache Monitoring is not as easy as they say…

August 18th, 2010

So when you set up Apache Monitoring, it’s not really as easy as they say.
Or more accurately, getting apache monitored is easy – setting the appropriate alerts is not always.
While Logicmonitor, referenced above, has sensible defaults based on Apache’s httpd default MaxClients, many sites have to tune the MaxClients (either to increase it, if they are serving mostly static content, or decrease it if serving with php, perl or other backends, due to memory/scalability issues.) Monitoring systems have no way of knowing what the real limit of an httpd server has been set to – and while it’s easy enough to adjust the thresholds in the monitoring system, I just wonder why mod_status does not report the configured limits of MaxClients (& ThreadsPerChild).
Then the monitoring systems could read this and set their thresholds appropriately, automatically. (At least decent ones like LogicMonitor could. Any monitoring system that can’t do calculated alert thresholds is not worth deploying, IMHO. )

Running NetApp Simulator on a vSphere virtual machine.

August 13th, 2010

Setting up the NetApp simulator on a CentOS server seemed, from all accounts, like it should be simple.
So, provisioned another Centos server using cobbler on a vSphere host. Installed the NetApp simulator, ran setup:
first minor hiccup was that if you try to give the simulator lots of memory (I tried about 3G) it silently reports a segmentation fault. (Silent from the runsim.sh script – run the called command manually and you can see it.)
Change that to 512M, and it works fine.

Second hiccup – I now had a working NetApp running on my Linux host, but it could not communicate with any other host. I could see it’s ARP requests reaching the default gateway, and the gateway replying, by the replies were not being seen on the Linux host hosting the simulator, nor on the simulator itself.

This tickled the memory that VMware locks down the virtual switch from promiscuous mode (and the simulator puts the host Linux into promiscuous mode, so it can receive packets for the virtual NetApp, which has its own mac and IP).
so:
fire up vCenter, click the Host, Configuration, Networking, properties, and set the VM Network switch to accept Promiscuous mode, Mac address changes, and Forged transmits.

Now my virtual NetApp is reachable, I can mount its volumes from other hosts in the lab, and all is well.
It responds to NetApp monitoring software just like a real NetApp, too, with API monitoring, etc.

Monit and “unable to parse response” errors

May 25th, 2010

Just upgraded a client’s monit installation on staging systems to 5.1.1 from 4.8.
After the restart of the new monit via /etc/init.d/monit stop; /etc/init.d/monit start, everything was fine, but some interactive commands reported “unable to parse response”
e.g. /usr/bin/monit restart all -g my_group
monit: action failed — unable to parse response

A quick google didn’t reveal much except the source code that generated the message, so thought I’d post in case it helps anyone else.
The source code showed it was from a piece of code connecting to the monit http status page.
So..telnet to that port on localhost, showed a banner from monit 4.8.
Apparently /etc/init.d/monit stop doesn’t stop all the bits of monit.
So /etc/init.d/monit stop again; kill the pid of the remaining process of monit; /etc/init.d/monit start.
A telnet to the port shows a banner from monit 5.1.1, and the monit restart command works fine now.

MySQL through Netscalers not a good idea without connection pool.

May 20th, 2010

I like Netscalers.
I like MySQL.
But the two together sometimes do not play well.

One client had an issue where some parts of the application would stall when talking from mongrel to the database. The root cause was an interaction between the mysql protocol and netscaler DoS defences:

  • Mysql client sends a SYN to open a connection to the DB.
  • Netscaler responds with SYN ACK.
  • Client gets it, and sends ACK back. This was the packet that occasionally got lost. (Very rarely, but usually it would occur once in every 10000 new connections or so. Enough to bother the app.)

At this point, client will just sit there, thinking it has an open TCP connection. With the MySQL protocol, after the client has an open connection, it expects the first packet to come from the server, with the server announcing what version of mysql its running. So the client waits. (It doesn’t retransmit as it thinks if its packet got lost, the server side – the Netscaler – should retransmit its SYN-ACK.)

On the netscaler side, it sent a SYN-ACK, but never heard a corresponding ACK back. So it thinks it was a spoofed connection, and by design, never retransmits the SYN-ACK.

So the netscaler forgets its state (not that it had any – it uses SYN cookies). The client is waiting. So nothing happens.

If there are 1000s of connections, and there is 0.01% packet loss, you’ll run into this issue. That level of loss does not cause an issue for anything else – TCP will retransmit if the connection is terminated on any other device except the Netscaler, and the lack of retransmission would not matter for any other application except MySQL (which expects the server to be the first to respond once a connection is established, not the client like everything else. If the client made the first request, it would effectively be retransmitting the missing packet.)

So it wouldn’t happen unless you are using MySQL to a Netscaler with an application that does not use connection pooling. (Once the connection is really established, normal TCP retransmissions would address any loss. A connection pool will keep the TCP connection alive, rather than creating a new one for each request.)

One of the clues was the Netscaler monitoring showed the occasional increase in unacknowledged Syns. The rest was figured out using nstcpdump (a handy tool from the netscaler shell CLI that allows you to capture packets using TCPdump (regular tcp dump doesn’t work as it cant see the packets processed by VIPs.)

Nescaler Surge queue analysis

September 29th, 2009

We’ve been working with one client who uses a rails application behind Netscalers, and who has been having issues with connections going in to the Netscaler surge queue. The surge queue is where the Netscaler puts connections when the destination load balancing VIP does not have a service it can send them to – as all the services bound to the VIP are at their max connection limit.

Unfortunately, despite what you may think about how a surge queue works, it is not a single queue per load balancing vserver in the netscaler – instead, the Netscaler will associate a request with a service, and put it in the service’s surge queue.

So, if you have 2 services associated with a vserver, and both services have a connection limit of one, and 3 requests come in, each service will process one a request, and one service will have the extra connection placed into its surge queue. All well and good, but if the other service completes it’s request, while the one with the request in the surge queue is running a 30 second long report, the request in the queue will have to wait, even though there is an idle service.

So having traffic in the surge queue is something to avoid, if possible.

The customer in question is savvy enough to subscribe to LogicMonitor, whos Citrix Netscaler monitoring is very good – so they at least knew they were having surge queue issues.

Their main concern was being assured that production traffic was not going into the surge queue. They did not have the spare thousands to buy a separate Netscaler cluster for staging, so staging systems were running on the same Netscaler cluster as production. Given they were getting alerts for the global surge queue, the concern was whether these were production requests or not.

This customer had worked with LogicMonitor to get access to the next release of LogicMonitor, that tracks requests and surge queue levels down to the service (physical server). However, this did not seem to help. Netscaler reports surge queue levels as a gauge, rather than a counter – even polling every minute, it is quite possible to read all zero’s, when there was a spike of 50 in between the poll’s. (This is why counters are the better choice for most datapoints, if you have a choice – but in this case, Netscaler does not expose a counter.)

So, the customer was seeing global surge queue activity, but the individual services would often not show surge queue entries – so how to determine where the surges were going?

The monitoring was not reliably catching. Doing a “stat lb vserver” on the Netscaler is useless for catching transitory spikes.

The only reliable way seems to be log analysis.

So, the above is a long waffling introduction as to why good monitoring is not enough in some cases, and you still have to fall back on good old manual sysadmin skills. (Although you still need good monitoring to alert you to the problem.)

Netscalers log a bunch of information – as has been mentioned before, this can be accessed by nsconmsg.
So, ssh to the netscaler, drop to the shell, and run:

cd /var/tmp #usually this has a lot of space
nsconmsg -K /var/nslog/newnslog -s ConLb=3 -d oldconmsg > NetscalerLog

(You need to use log level 3 to get the surge queue information.)

Now, scp that NetscalerLog file to a linux host somewhere, as the grep on the netscaler is not powerful enough to process it with the useful flags.
To track each entry in the surge queue, and when it occurred, run:
cat NetscalerLog | grep -B2 'SQ([1-9])\|current' | grep -B4 'SQ' | grep 'current\|S(\|SQ'
This will give you a timestamp, then a line for each service and its surge queue, if it has a non-zero surge queue.

current time is Mon Sep 28 22:04:01 2009
S(10.1.1.213:13411:UP) Hits(23587, 0/sec, P[0, 0/sec]) ATr(2:2) Mbps(0.00) BWlmt(0 kbits) RspTime(0.00 ms) Load(0) LConn_Idx: (C:2; V:2,I:1)
Conn: CSvr(8, 0/sec) MCSvr(1) OE(1) E(1) RP(0) SQ(1)
S(10.1.1.212:13390:UP) Hits(23491, 0/sec, P[0, 0/sec]) ATr(2:2) Mbps(0.00) BWlmt(0 kbits) RspTime(0.00 ms) Load(0) LConn_Idx: (C:2; V:2,I:1)
Conn: CSvr(5, 0/sec) MCSvr(1) OE(1) E(1) RP(0) SQ(1)
current time is Mon Sep 28 22:04:57 2009
S(10.1.1.212:13390:UP) Hits(23508, 0/sec, P[0, 0/sec]) ATr(2:2) Mbps(0.00) BWlmt(0 kbits) RspTime(0.00 ms) Load(0) LConn_Idx: (C:2; V:2,I:1)
Conn: CSvr(7, 0/sec) MCSvr(1) OE(1) E(1) RP(0) SQ(1)
current time is Mon Sep 28 22:17:05 2009
S(10.1.1.237:13249:UP) Hits(68292, 0/sec, P[0, 0/sec]) ATr(2:2) Mbps(0.05) BWlmt(0 kbits) RspTime(100.16 ms) Load(0) LConn_Idx: (C:2; V:2,I:1)
Conn: CSvr(1, 0/sec) MCSvr(1) OE(1) E(1) RP(0) SQ(1)

Alternatively, if you want to see the distribution of which servers had how many queued connections in the current logfile:
cat nscon.log | grep -B2 'SQ([1-9])'   | grep -v "\-\-"| sed 's/SQ(\(.*\))/\1\n---/' | awk ' {RS="---"; print $1, "Surge queue:", $NF }'  | awk ' { services[$1]+=$NF }; END { for (i in services) { print services[i],i; }}  '
0
3 S(10.1.1.213:13411:UP)
1 S(10.1.1.212:13295:UP)
83 S(10.1.1.237:13249:UP)
0 Surge
72 S(10.1.1.236:13658:UP)
1 Other:
1 S(10.1.1.212:13764:UP)
4 S(10.1.1.212:13390:UP)
2 S(10.1.1.236:13657:UP)

(Yes, it has some garbage in there, and the awk commands could be combined, but it gets the job done, and we didn’t want to spend the clients time/money perfecting it.)

In this case it is clear now that the affected hosts where staging systems only, so the client can relax that production was not impacted, while their developers figure out why their staging systems are running slowly.

So anyway, hopefully those two command lines will help some people trace down which connections are being placed in the surge queue, and also demonstrate that good monitoring is necessary, but not sufficient.

FTPS (aka “FTP secure”, aka FTP-SSL (and not SFTP)) and apache proxy

August 13th, 2009

So FTPS is encrypted (via SSL) FTP. It’s a definite step up from regular unencrypted FTP. It is not to be confused with SFTP, which IMHO is a misnomer as it’s not really FTP at all but rather a sub-system of SSH. For the below we’re talking about ‘explicit’ FTPS, meaning it uses the normal FTP port and negotiates encryption (‘implicit’ uses a dedicated port). But anyways, if you’re still here you probably know most of this…

I like to proxy all outgoing connections from any infrastructure I set up. The reasoning is something along the lines of: why should your web server be browsing the web? Basically, servers should never be allowed to initiate connections to the outside world. I’ve had firsthand experience of servers being exploited (an OpenSSL hack comes to mind) where the exploit worked but the server was unable to connect back to the outside world, preventing the bad guy from being…bad. Where a server does need to need to initiate a connection to the outside world, it should be as restricted as possible, and via proxy. With a proxy you can control the outgoing connections of all your hosts at a single point, as well as monitor all traffic if need be. It recently came to be that an app was going ballistic in my current infrastructure, making thousands of requests a second to an outside service and effectively DoS’ing it. One change to the proxy server blocked all requests to that specific external host until the engineers could reign in the app. I didn’t need to find the app making the requests…just do the blocking at the proxy. I generally set up a load balanced VIP which has two proxies behind it, for redundancy.

Anyways, I use Apache proxy as it’s easy and free and works great for my simple proxy-ing needs. The issue came up recently that an app needed to make outgoing FTPS connections. I thought this would work without any changes as FTP outgoing proxy was already configured (mod_proxy_ftp) as was mod_proxy_connect, used for proxy-ing outgoing HTTPS requests. The one gotcha that prevented it from working and took a little bit to figure out was the AllowCONNECT directive:

The AllowCONNECT directive specifies a list of port numbers to which the proxy CONNECT method may connect. Today’s browsers use this method when a https connection is requested and proxy tunneling over HTTP is in effect.

By default, only the default https port (443) and the default snews port (563) are enabled. Use the AllowCONNECT directive to override this default and allow connections to the listed ports only.

As FTPS uses SSL and hence CONNECT when making a proxy request, we need to open up port 21. Basically:

AllowCONNECT 443 21

And all is good.

jeff

Netscaler Tips, Part 4

April 30th, 2009

Monitoring Netscalers
It is possible to monitor Netscalers yourself, but we strongly recommend LogicMonitor.com for Netscaler monitoring. It has predefined everything you need to monitor in a netscaler, requires no setup, automatically finds and monitors all your VIPs, integrated caching, GSLB, policies, etc. And keeps up to date automatically with changes. (And if you’ve ever tried to convert VIP names to snmp OIDs, you’ll appreciate how much time it saves – let alone eliminating the risk of not putting VIPs in monitoring.) Plus you can make cool dashboards easily (as well as monitor all your other devices. Netapp monitoring is also excellent.)

If you are writing your own monitors for Netscalers, once you have figured out which OIDs seem good to monitor, it helps to have some info on what they mean:
CPU goes to 100% during the gzipping of the log file, but this is no cause for concern. The NS process is in control of where the CPU allocates its cycles, and prioritizes traffic management first. Once traffic management has been taken care of, the NS process allows BSD processes to use the remaining cycles. Thus, if there were higher CPU demand from the NS process due to increases in network traffic, gzip would get a smaller percentage of the cycles.
Open Established: established connections between the NetScaler and the servers.
Active Transactions: how many of those connections are being used to handle request/response pairs
Reuse Pool: Open Established minus Active Transactions. In other words, these are connections that have not yet idled out, and are waiting to handle incoming requests.

clientConnRefused – “Client connections added the SurgeQ, and blocked from initiating a server connection to control op/s”
it refers to anytime that a connection is added to the surgeQ. This will increment whenever a client connections is temporarily queued due to SP kicking in, maxClients reached, or the client’s connection had to wait for a new server side TCP connection to be built. It does not indicate timeout issues, 5xx sent, or any other error condition. Seeing this increment is an indication of at least a short term inability of the servers to handle all the connections.

The response time of the server is measured for *every* HTTP request.
-The Least Response Time algorithm uses the average response time for the most recent complete 7-second polling interval. This provides some smoothing, but the algorithm does not strive for any greater complexity.

GSLB:
The GSLB redirects the HTTP request if the request contains the HOST (in host header) as the configured GSLB domain on the NetScaler. No host header, no redirect.

Syslog
Useful to have all netscaler events sent to syslog server.
Edit /nsconfig/syslog.conf to set up remote syslog as normal
*.* @10.1.1.1
However, the default syslog flags don’t work for remote logging.
rc.conf.defaults:syslogd_flags=”-b 127.0.0.1 -n”

That sources the packets from the loopback address when sending to a remote syslog server, which doesn’t work very well.
So add to /nsconfig/rc.conf
syslogd_flags=”-s -n”

NTP
Is not enabled by default.
Set up /nsconfig/ntp.conf
And add
ntpd_enable=”YES”
to /nsconfig/rc.conf

Netscaler implementation tips, Part 3

January 8th, 2008

Security Things
Validate Backend Servers
If you have secure data, and doing SSL to back end, should ensure netscaler checks validity of certs on services. By default it does not, which means basically its just doing ip address based authentication.
set ssl service -serverAuth ENABLED
Updating SSL keys:
Make sure you use:
Update ssl certkey
to update SSL certs – otherwise you need to unbind, remove the old certkey (as two identical certificates with the same “Subject-Identifier” and “Issuer-Identifier” cannot be loaded in the kernel), add new cert and bind again – this means a few seconds of downtime.
Header Insertions
If doing header insertion (for client IP, etc) should drop requests coming in that have that header. Netscaler will just add additional header if it exists, which could lead to insecure or indeterminate behaviour in app if it depends on header.
add service www1 -http www1 HTTP 80 -gslb NONE -maxClient 125 -maxReq 10000 -cacheable NO -cip ENABLED ClientHost
add policy expression ClientHostHead HTTPHEADER ClientHost EXISTS
add ns filter NoClientHost -reqRule ClientHostHead -reqAction RESET

Debugging things
What events did the netscaler see? Services passing/failing healthchecks? Very useful.
nsconmsg -K /var/nslog/newnslog -d event
2246 0 ’server_NSSVC_HTTP_216.52.45.145:80(test)’ UP Thu Jul 26 00:43:00 2007
2255 0 ’server_NSSVC_HTTP_216.52.45.174:80(test-vip)’ UP Thu Jul 26 00:44:29 2007
2257 58522 ’server_NSSVC_HTTP_216.52.45.145:80(test)’ Out Of Service Thu Jul 26 00:45:28 2007
2258 0 ’server_NSSVC_HTTP_216.52.45.174:80(test-vip)’ DOWN Thu Jul 26 00:45:28 2007

Was the netscaler sending traffic to various services?
nsconmsg -K /var/nslog/newnslog -s ConLb=1 -d oldconmsg | grep “time\|IP OF SERVICE or VIP”

See how things are doing:
nsconmsg –d oldconsmsg –s FIELD
Case sensitive for Field.
nsdebug_pe 1 = interface debug
ConDebug Connection info debug. 1= basic, 2= detailed, 3= all sorts of stuff about internal TCP parameters
ConLb 1= Load balancing debug
ConCSW 1=content switching debug
ConSSL 1=ssl Debug
ConCMP 1=compression debug
ConIC 1=integrated caching debug

e.g. Evaluate compression:
nsconmsg -s ConCMP=1 -d oldconmsg
CMPResps:CRes=547 Cin=20304690 Cout=6830730 Cratio=2.97(34%)
Response: Res=17649 Rin=161486642 Rout=148012682 Rratio=1.09(92%)

Compressible traffic being compressed by 66%; total only 8%

nsconmsg -s ConDebug=1 -d oldconmsg
Displaying debug performance information
Performance Data Record Version 2.0

current time is Thu Jul 19 11:56:23 2007
HTTP: Req(41580876512 1.1(39141733520) 1.0(1733429699)Get(38133042089) Postp(1966228573) Others(1481605850)) Res(41496471614 1.1(40630248963) 1.0(866222651) Pipe(11644297))
HTTP: Req/s(2623 1.1(92%) 1.0(5%) time=1) avgReq/s(0 1.1(0%) 1.0(0%) time=0)
HTTP: Res/s(2602 1.1(95%) 1.0(4%) time=1) avgRes/s(0 1.1(0%) 1.0(0%) time=0)

Note: 5% of requests are HTTP1.0. Oddly, so are 4% of responses. (Old servers?)

Examine response time (Time to first byte) of services, vservers:
To see live data:
nsconmsg -f “*svr_ttfb*” -d current
To see data in current log file, from start of log file:
nsconmsg -K /var/nslog/newnslog -f “*svr_ttfb*” -d current #historical

Nstcptrace.sh – very handy.
Can also use
/etc/nsapimgr -K nstrace3 -d netraces
to look at trace files saved with nstrace

SuSE 10 enterprise server RUG errors after SP1 upgrade

September 19th, 2007

After upgrading a bunch of SLES 10 servers to SP1, some (but not all) servers were now getting 401 Login Failed errors when doing a rug refresh (or any rug actions) from novell’s update servers. On some servers, deleting the update sources in Yast2 and reregistering via the customer center solved the issues, but on some it did not. (The issue only occurred on servers I upgraded to SP1, not where I did an install from SP1 sources.)
On the problematic ones:
Those update sources were listed as ZYPP sources, not NU. (I was letting yast create them via synchronizing with zenworks)
I had to:
remove old services:
rug sd SLES10-SP1-Online
rug sd SLES10-SP1-Updates
Add the new one (with the type specified, or it errored):
rug sa -t nu https://nu.novell.com

Subscribe to the new catalogs:
rug ca

Sub’d? | Name |
Service
——-+——————————————————–+——————————————————-
Yes | SUSE-Linux-Enterprise-Server-i386-10-0-20070605-044231 |
SUSE-Linux-Enterprise-Server-i386-10-0-20070605-044231
| SLES10-SP1-Online |

https://nu.novell.com

| SLES10-SP1-Updates |

https://nu.novell.com

| SLE10-SP1-Debuginfo-Updates |

https://nu.novell.com

#rug sub SLES10-SP1-Online
Subscribed to ‘SLES10-SP1-Online’

#rug sub SLES10-SP1-Updates
Subscribed to ‘SLES10-SP1-Updates’

rug worked then, but I still had 401 errors in Yast2 (when going to
online update, customer center, or installation sources)

To resolve them, I then had to remove the update sources in Yast2,
re-run the Novell customer center configuration, and then things seem OK.