So, I upgraded one of my personal boxes' kernel today for the first time in 2.5 years. It was running a rather shaky 2.6.17 vserver kernel. In order to support IPv6 on my vservers, I needed to upgrade to linux-vserver 2.3+. So that is what I did, upgrading my kernel to 2.6.28 in the process. Unfortunately as you'd expect when making such a massive version jump, there were lots and lots of driver changes and one of which can give you a severe headache.
This particular headache was centered around your favorite SATA card and mine... the Marvell MV88SX6081. I've written about this card before in years past. It's the model of card that is contained within the innards of Sun X4500s. I'm not aware of the whole sordid story but it definitely appears that Marvell sucks at implementation and apparently doesn't help at all with the Linux drivers. Who uses that Linux crap anyway? The error in question is random drive timeouts presenting themselves as such:
[11363.029485] ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[11363.029495] ata5.00: cmd 25/00:08:1f:ce:b7/00:00:1f:00:00/e0 tag 0 dma 4096 in
[11363.029496] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[11363.029500] ata5.00: status: { DRDY }
[11363.029510] ata5: hard resetting link
[11363.502233] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[11363.550271] ata5.00: configured for UDMA/133
[11363.550287] ata5: EH complete
It freezes up your I/O in the process. This server ran for 2.5 years straight prior to this on 2.6.17 and never exhibited this issue. After much Googling and a few dead-ends, I stumbled upon this gem of a thread. The current maintainer (or at minimum honorary) of all things Marvell, Mark Lord tracked down the problem with many user's reports. According to them, it's all in a single line of code in drivers/ata/sata_mv.c that results in a race condition.
The patch can be found here.
So if you're running 2.6.28, patch this single line and all will be fixed. At least, I hope so. I'm about to reboot with the effectively 6 byte patch. Big round of thanks to Mark Lord.
P.S. Thanks Alan Cox for adamantly rejecting it could have anything to do with the kernel or driver despite the fact that evidence pointed to the contrary. I also love how there was never a response after Mark discovered the issue really was in the driver.
Trackback address for this post
Trackback URL (right click and copy shortcut/link location)
3 comments
The issue was still there with a 2.6.29 prerelease (I think, guys that supplied the hardware tried a newer kernel but it was still in debian beta).
Leave a comment
