samba lenny

Samba upgraded to lenny

Samba.grep.be, my main server (which occasionally also hosts a domU for p2) was upgraded to lenny this monday evening. I wish I could say there were no issues, but unfortunately that's not true.

At first, everything seemed to run smoothly; of course I did have to recompile a few locally written programs to run against some SONAME changes (programs that are too trivial to package, really, and too ugly too), but other than that, everything was fine.

It went wrong sometime during the night. Apparently something didn't like the new kernel or the new xen, because everything started to lock up. Since I only noticed this when I was at a customer the next day, I couldn't fix it until I got home again in the evening (meaning, downtime for a full day); and unfortunately, the exact failure mode meant that not much was written to the log -- certainly not enough to figure out what went wrong:

Mar 10 04:04:41 samba kernel: [ 7033.011523] INFO: task cron:21006 blocked for more than 120 seconds.
Mar 10 06:43:46 samba kernel: [ 7033.011614] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 10 17:35:04 samba kernel: [ 7033.011701] cron          D 7fffffffffffffff     0 21006   4147

After that, nothing before the reboot, which happened at around 21:00. Take good note of the timings on those messages, and the apparent discrepancies between the syslog timestamp and the kernel timestamp. 7033 seconds after boot was around 3:30 AM...

When I arrived at the physical server, this kind of message was rolling over the console several times a second, with just the name and PID of the blocking process changed. And with several more lines following the message, of course. As a result, the system was not responsive at all.

I didn't file a bug since, frankly, there's so little information that I wouldn't even know what to file a bug against (the kernel? Xen? Some rogue process that runs as root and changes settings which cause a driver to block? I wouldn't know). Instead, all I could do was to reboot the server and hope for the best.

Not that I like that kind of situation. Fortunately, the problem didn't reproduce itself today; here's for hoping that it never will.