lighttpd is great (serves a lot, quickly), but under very heavy load it seems to act rather erratically. i'm not sure where the fault lies, though: in lighttpd, in fast-CGI (which is how lighttpd runs PHP), or in freeBSD, or some combination.
the symptoms are error messages like this:
2005-12-19 10:11:52: (mod_fastcgi.c.1561) connect failed: 242 Connection refused 61 0 /tmp/php-fastcgi.socket-0
2005-12-19 10:11:52: (mod_fastcgi.c.2663) socket failed: No buffer space available 263 4086
2005-12-19 10:11:52: (mod_fastcgi.c.2663) socket failed: No buffer space available 263 4086
2005-12-19 10:11:52: (mod_fastcgi.c.2663) socket failed: No buffer space available 263 4086
"no buffer space available" is supposed to suggest that the system is running out of network buffers. running netstat -m should reveal that buffer memory was exhausted, but the results of netstat show:
158/1552/18176 mbufs in use (current/peak/max):
127 mbufs allocated to data
31 mbufs allocated to packet headers
74/496/4544 mbuf clusters in use (current/peak/max)
1380 Kbytes allocated to network (10% of mb_map in use)
0 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines
which says that there's plenty of buffer space…
then there's this ticket on the lighttpd bug tracker… the bug is old, since lighttpd no longer consumes 100% cpu, but it still becomes unresponsive and hangs under the same conditions.
all in all i probably just need to add another server. i need to compile some statistics on exactly how many pets etc i serve up, but i stopped running a stats package on the image/swf servers because the log files were too large to analyze with the amount of RAM i've got. i guess i can just grep -c something appropriate, or heck, even just count the # of lines with wc -l.