NBD: not enough magic
There's a bug report that I keep receiving against NBD upstream, one that is important for people trying to use it in high availability situations, but also one I cannot locate the source of. This is rather annoying.
When people try to use NBD over a rather fast line (gigabit or above) it will work okay for a while; at some point, however, it suddenly stops working okay, and exits with a message 'Not enough magic'.
I know where that message comes from; the NBD protocol requires every packet to begin with a 'magic' number, and the connection is dropped if the magic number isn't correct. However, at the point where this error message appears, the connection has been up for quite a while already, and the magic number is the same for every packet; it's unlikely that it would suddenly be wrong. Thus, I can think of only two reasons why it being wrong would be the case:
- The network somehow mangled a few bits in the data; as a result, a packet's data is corrupt and the magic number isn't correct anymore. I would be surprised if this were the case, but then it isn't entirely impossible. This would mean that gigabit lines aren't entirely error-free[1], and that the TCP stack in the kernel doesn't handle corruption very well. I would find that quite strange, and don't think this is the case.
- There's a bug somewhere in the kernel (the TCP stack or the nbd client code) or in nbd-server that's only exhibited under high stress, and that results in it not properly handling the magic number or so. This would be even weirder.
If there's anyone with hardware knowledge enough to hint me as to whether the first is likely, I'd appreciate that.
[1]in contrast to (at least) 10Mbit, which is what I used to test it for quite a while, and which does not exhibit that behaviour; I have a 100Mbit network hub since recently, though, and I'll start testing using that thing as well soonishly.