Archive for the ‘Uncategorized’ Category

Why Apache Monitoring is not as easy as they say…

Wednesday, August 18th, 2010

So when you set up Apache Monitoring, it’s not really as easy as they say.
Or more accurately, getting apache monitored is easy – setting the appropriate alerts is not always.
While Logicmonitor, referenced above, has sensible defaults based on Apache’s httpd default MaxClients, many sites have to tune the MaxClients (either to increase it, if they are serving mostly static content, or decrease it if serving with php, perl or other backends, due to memory/scalability issues.) Monitoring systems have no way of knowing what the real limit of an httpd server has been set to – and while it’s easy enough to adjust the thresholds in the monitoring system, I just wonder why mod_status does not report the configured limits of MaxClients (& ThreadsPerChild).
Then the monitoring systems could read this and set their thresholds appropriately, automatically. (At least decent ones like LogicMonitor could. Any monitoring system that can’t do calculated alert thresholds is not worth deploying, IMHO. )

Running NetApp Simulator on a vSphere virtual machine.

Friday, August 13th, 2010

Setting up the NetApp simulator on a CentOS server seemed, from all accounts, like it should be simple.
So, provisioned another Centos server using cobbler on a vSphere host. Installed the NetApp simulator, ran setup:
first minor hiccup was that if you try to give the simulator lots of memory (I tried about 3G) it silently reports a segmentation fault. (Silent from the runsim.sh script – run the called command manually and you can see it.)
Change that to 512M, and it works fine.

Second hiccup – I now had a working NetApp running on my Linux host, but it could not communicate with any other host. I could see it’s ARP requests reaching the default gateway, and the gateway replying, by the replies were not being seen on the Linux host hosting the simulator, nor on the simulator itself.

This tickled the memory that VMware locks down the virtual switch from promiscuous mode (and the simulator puts the host Linux into promiscuous mode, so it can receive packets for the virtual NetApp, which has its own mac and IP).
so:
fire up vCenter, click the Host, Configuration, Networking, properties, and set the VM Network switch to accept Promiscuous mode, Mac address changes, and Forged transmits.

Now my virtual NetApp is reachable, I can mount its volumes from other hosts in the lab, and all is well.
It responds to NetApp monitoring software just like a real NetApp, too, with API monitoring, etc.

Monit and “unable to parse response” errors

Tuesday, May 25th, 2010

Just upgraded a client’s monit installation on staging systems to 5.1.1 from 4.8.
After the restart of the new monit via /etc/init.d/monit stop; /etc/init.d/monit start, everything was fine, but some interactive commands reported “unable to parse response”
e.g. /usr/bin/monit restart all -g my_group
monit: action failed — unable to parse response

A quick google didn’t reveal much except the source code that generated the message, so thought I’d post in case it helps anyone else.
The source code showed it was from a piece of code connecting to the monit http status page.
So..telnet to that port on localhost, showed a banner from monit 4.8.
Apparently /etc/init.d/monit stop doesn’t stop all the bits of monit.
So /etc/init.d/monit stop again; kill the pid of the remaining process of monit; /etc/init.d/monit start.
A telnet to the port shows a banner from monit 5.1.1, and the monit restart command works fine now.

MySQL through Netscalers not a good idea without connection pool.

Thursday, May 20th, 2010

I like Netscalers.
I like MySQL.
But the two together sometimes do not play well.

One client had an issue where some parts of the application would stall when talking from mongrel to the database. The root cause was an interaction between the mysql protocol and netscaler DoS defences:

  • Mysql client sends a SYN to open a connection to the DB.
  • Netscaler responds with SYN ACK.
  • Client gets it, and sends ACK back. This was the packet that occasionally got lost. (Very rarely, but usually it would occur once in every 10000 new connections or so. Enough to bother the app.)

At this point, client will just sit there, thinking it has an open TCP connection. With the MySQL protocol, after the client has an open connection, it expects the first packet to come from the server, with the server announcing what version of mysql its running. So the client waits. (It doesn’t retransmit as it thinks if its packet got lost, the server side – the Netscaler – should retransmit its SYN-ACK.)

On the netscaler side, it sent a SYN-ACK, but never heard a corresponding ACK back. So it thinks it was a spoofed connection, and by design, never retransmits the SYN-ACK.

So the netscaler forgets its state (not that it had any – it uses SYN cookies). The client is waiting. So nothing happens.

If there are 1000s of connections, and there is 0.01% packet loss, you’ll run into this issue. That level of loss does not cause an issue for anything else – TCP will retransmit if the connection is terminated on any other device except the Netscaler, and the lack of retransmission would not matter for any other application except MySQL (which expects the server to be the first to respond once a connection is established, not the client like everything else. If the client made the first request, it would effectively be retransmitting the missing packet.)

So it wouldn’t happen unless you are using MySQL to a Netscaler with an application that does not use connection pooling. (Once the connection is really established, normal TCP retransmissions would address any loss. A connection pool will keep the TCP connection alive, rather than creating a new one for each request.)

One of the clues was the Netscaler monitoring showed the occasional increase in unacknowledged Syns. The rest was figured out using nstcpdump (a handy tool from the netscaler shell CLI that allows you to capture packets using TCPdump (regular tcp dump doesn’t work as it cant see the packets processed by VIPs.)

Netscaler Tips, Part 4

Thursday, April 30th, 2009

Monitoring Netscalers
It is possible to monitor Netscalers yourself, but we strongly recommend LogicMonitor.com for Netscaler monitoring. It has predefined everything you need to monitor in a netscaler, requires no setup, automatically finds and monitors all your VIPs, integrated caching, GSLB, policies, etc. And keeps up to date automatically with changes. (And if you’ve ever tried to convert VIP names to snmp OIDs, you’ll appreciate how much time it saves – let alone eliminating the risk of not putting VIPs in monitoring.) Plus you can make cool dashboards easily (as well as monitor all your other devices. Netapp monitoring is also excellent.)

If you are writing your own monitors for Netscalers, once you have figured out which OIDs seem good to monitor, it helps to have some info on what they mean:
CPU goes to 100% during the gzipping of the log file, but this is no cause for concern. The NS process is in control of where the CPU allocates its cycles, and prioritizes traffic management first. Once traffic management has been taken care of, the NS process allows BSD processes to use the remaining cycles. Thus, if there were higher CPU demand from the NS process due to increases in network traffic, gzip would get a smaller percentage of the cycles.
Open Established: established connections between the NetScaler and the servers.
Active Transactions: how many of those connections are being used to handle request/response pairs
Reuse Pool: Open Established minus Active Transactions. In other words, these are connections that have not yet idled out, and are waiting to handle incoming requests.

clientConnRefused – “Client connections added the SurgeQ, and blocked from initiating a server connection to control op/s”
it refers to anytime that a connection is added to the surgeQ. This will increment whenever a client connections is temporarily queued due to SP kicking in, maxClients reached, or the client’s connection had to wait for a new server side TCP connection to be built. It does not indicate timeout issues, 5xx sent, or any other error condition. Seeing this increment is an indication of at least a short term inability of the servers to handle all the connections.

The response time of the server is measured for *every* HTTP request.
-The Least Response Time algorithm uses the average response time for the most recent complete 7-second polling interval. This provides some smoothing, but the algorithm does not strive for any greater complexity.

GSLB:
The GSLB redirects the HTTP request if the request contains the HOST (in host header) as the configured GSLB domain on the NetScaler. No host header, no redirect.

Syslog
Useful to have all netscaler events sent to syslog server.
Edit /nsconfig/syslog.conf to set up remote syslog as normal
*.* @10.1.1.1
However, the default syslog flags don’t work for remote logging.
rc.conf.defaults:syslogd_flags=”-b 127.0.0.1 -n”

That sources the packets from the loopback address when sending to a remote syslog server, which doesn’t work very well.
So add to /nsconfig/rc.conf
syslogd_flags=”-s -n”

NTP
Is not enabled by default.
Set up /nsconfig/ntp.conf
And add
ntpd_enable=”YES”
to /nsconfig/rc.conf