I like Netscalers.
I like MySQL.
But the two together sometimes do not play well.
One client had an issue where some parts of the application would stall when talking from mongrel to the database. The root cause was an interaction between the mysql protocol and netscaler DoS defences:
- Mysql client sends a SYN to open a connection to the DB.
- Netscaler responds with SYN ACK.
- Client gets it, and sends ACK back. This was the packet that occasionally got lost. (Very rarely, but usually it would occur once in every 10000 new connections or so. Enough to bother the app.)
At this point, client will just sit there, thinking it has an open TCP connection. With the MySQL protocol, after the client has an open connection, it expects the first packet to come from the server, with the server announcing what version of mysql its running. So the client waits. (It doesn’t retransmit as it thinks if its packet got lost, the server side – the Netscaler – should retransmit its SYN-ACK.)
On the netscaler side, it sent a SYN-ACK, but never heard a corresponding ACK back. So it thinks it was a spoofed connection, and by design, never retransmits the SYN-ACK.
So the netscaler forgets its state (not that it had any – it uses SYN cookies). The client is waiting. So nothing happens.
If there are 1000s of connections, and there is 0.01% packet loss, you’ll run into this issue. That level of loss does not cause an issue for anything else – TCP will retransmit if the connection is terminated on any other device except the Netscaler, and the lack of retransmission would not matter for any other application except MySQL (which expects the server to be the first to respond once a connection is established, not the client like everything else. If the client made the first request, it would effectively be retransmitting the missing packet.)
So it wouldn’t happen unless you are using MySQL to a Netscaler with an application that does not use connection pooling. (Once the connection is really established, normal TCP retransmissions would address any loss. A connection pool will keep the TCP connection alive, rather than creating a new one for each request.)
One of the clues was the Netscaler monitoring showed the occasional increase in unacknowledged Syns. The rest was figured out using nstcpdump (a handy tool from the netscaler shell CLI that allows you to capture packets using TCPdump (regular tcp dump doesn’t work as it cant see the packets processed by VIPs.)
This is actually not a “NetScaler’ problem, it is a problem with any server using SYN Cookies. In this case the NetScaler is terminating the client request, so it is a server.
The statement “TCP will retransmit if the connection is terminated on any other device except the NetScaler” is not accurate. What would be accurate is “TCP will retransmit if the connection is terminated on any device not implementing SYN Cookies, such as the NetScaler”.
True enough. Syn cookies are not unique to netscalers – although most other OS’s have a threshold of unacknowledged Syn’s before they start relying on cookies.
So the statement should really be “terminated on a device that did not use a SYN cookie to manage this specific connection.”
Thanks for the clarification.
Because netscaler does SYN flood protection on TCP virtual server, so it is normal that you sometimes would have this issue.
If you setup a type “Any” virtual server instead of TCP virtual server in netscaler, you should not have this problem anymore then.
Good point.