From a.travis at abdn.ac.uk Tue Oct 12 15:49:47 2010 From: a.travis at abdn.ac.uk (Tony Travis) Date: Tue, 12 Oct 2010 20:49:47 +0100 Subject: [Bio-linux-dev] bio-linux-backups 'sync' problem Message-ID: <4CB4BBDB.1000003@abdn.ac.uk> Hello, Bio-Linux and NBX developers. I've encountered a potentially serious problem with "bio-linux-backups" because someone unplugged a USB device from one of our NBX's without 'safely' removing it first. This left writes pending to the USB device making "sync" hang indefinitely waiting for the unplugged USB device to respond and effectively blocking "/etc/cron.daily/backup" from running. Many people take a rather cavalier approach to unplugging USB devices, but especially if they have NTFS filesystems on them you *must* either unmount or "Safely Remove" them before unplugging USB storage devices (sticks, keys, drives etc.). I think it might be useful for "backup" to set an alarm before running "sync" so that it is possible to continue execution past this type of potential deadlock after issuing a warning. Tony. -- Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk mailto:a.travis at abdn.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt From tbooth at ceh.ac.uk Thu Oct 14 06:20:37 2010 From: tbooth at ceh.ac.uk (Tim Booth) Date: Thu, 14 Oct 2010 11:20:37 +0100 Subject: [Bio-linux-dev] bio-linux-backups 'sync' problem In-Reply-To: <4CB4BBDB.1000003@abdn.ac.uk> References: <4CB4BBDB.1000003@abdn.ac.uk> Message-ID: <1287051637.20632.1651.camel@barsukas> Hi Tony, Interesting. I just took a USB stick, made 2 partitions and formatted one as NTFS and the other as VFAT. I then plugged it in and let the hotplug magic auto-mount the two partitions under /media. I then started 2 simultaneous processes doing "cat /dev/urandom > foo" on each partition and when they were under way I yanked the stick out. In this case, my system sorted itself out right away. "IO Error" is printed, the writing processes are killed and the mounts are cleared up. Sync works fine. This is what I hoped would happen on a modern Linux kernel. If I end up with my system in a state where "sync" hangs indefinitely then I generally reckon there is a big problem and an urgent reboot is required. What I don't know is how commonly this results from untimely removal of a USB stick or other factors (Novell network mounts are bad for this). Most probably triggering of the problem by untimely removal of a USB device is dependent on the exact hardware, the kernel drivers, DBUS quirks and any number of timing conditions. My inclination would be to not try and make the backup script work around this specific problem. Rather then pressing on after a failed sync the user should really be alerted that the machine is not behaving. Perhaps a more general solution would be to start a watchdog process at the start of the backup script. After ten minutes the watchdog looks for evidence that the backup is running properly and if not it shouts loudly for sysadmin intervention. What do you think? TIM On Tue, 010-10-12 at 20:49 +0100, Tony Travis wrote: > Hello, Bio-Linux and NBX developers. > > I've encountered a potentially serious problem with "bio-linux-backups" > because someone unplugged a USB device from one of our NBX's without > 'safely' removing it first. This left writes pending to the USB device > making "sync" hang indefinitely waiting for the unplugged USB device to > respond and effectively blocking "/etc/cron.daily/backup" from running. > > Many people take a rather cavalier approach to unplugging USB devices, > but especially if they have NTFS filesystems on them you *must* either > unmount or "Safely Remove" them before unplugging USB storage devices > (sticks, keys, drives etc.). I think it might be useful for "backup" to > set an alarm before running "sync" so that it is possible to continue > execution past this type of potential deadlock after issuing a warning. > > Tony. -- Tim Booth NERC Environmental Bioinformatics Centre Centre for Ecology and Hydrology Maclean Bldg, Benson Lane Crowmarsh Gifford Wallingford, England OX10 8BB +44 1491 69 2705 -- This message (and any attachments) is for the recipient only. NERC is subject to the Freedom of Information Act 2000 and the contents of this email and any reply you make may be disclosed by NERC unless it is exempt from release under the Act. Any material supplied to NERC may be stored in an electronic records management system. From a.travis at abdn.ac.uk Sun Oct 17 15:57:15 2010 From: a.travis at abdn.ac.uk (Tony Travis) Date: Sun, 17 Oct 2010 20:57:15 +0100 Subject: [Bio-linux-dev] bio-linux-backups 'sync' problem In-Reply-To: <1287051637.20632.1651.camel@barsukas> References: <4CB4BBDB.1000003@abdn.ac.uk> <1287051637.20632.1651.camel@barsukas> Message-ID: <4CBB551B.6080106@abdn.ac.uk> On 14/10/10 11:20, Tim Booth wrote: > Hi Tony, > > Interesting. I just took a USB stick, made 2 partitions and formatted > one as NTFS and the other as VFAT. I then plugged it in and let the > hotplug magic auto-mount the two partitions under /media. I then > started 2 simultaneous processes doing "cat /dev/urandom> foo" on each > partition and when they were under way I yanked the stick out. > > In this case, my system sorted itself out right away. "IO Error" is > printed, the writing processes are killed and the mounts are cleared up. > Sync works fine. This is what I hoped would happen on a modern Linux > kernel. If I end up with my system in a state where "sync" hangs > indefinitely then I generally reckon there is a big problem and an > urgent reboot is required. What I don't know is how commonly this > results from untimely removal of a USB stick or other factors (Novell > network mounts are bad for this). Most probably triggering of the > problem by untimely removal of a USB device is dependent on the exact > hardware, the kernel drivers, DBUS quirks and any number of timing > conditions. Hi, Tim. In our case, "mount.ntfs-3g" itself was the dead-locked process: It seems likely that the USB stick had been removed while it was still being auto-mounted, as in plug the stick in then yank it out because the user changed their mind or nothing seemed to be happening... However unlikely this scenario is, it actually happened on one of our NBX's and, as a consequence, the NBX concerned was not backed for ten days. I do monitor the systems, of course, but I didn't notice this failure because it looked like the dumps were still in progress! In fact, the dump had stalled, waiting for "sync" to complete... Actually, in the good old days, we used to run "sync; sync" before rebooting Unix because the first "sync" does not return until all the deferred writes are flushed to disk. The snag is that if "sync" can't flush the buffers to disk, it never returns. > My inclination would be to not try and make the backup script work > around this specific problem. Rather then pressing on after a failed > sync the user should really be alerted that the machine is not behaving. > Perhaps a more general solution would be to start a watchdog process at > the start of the backup script. After ten minutes the watchdog looks > for evidence that the backup is running properly and if not it shouts > loudly for sysadmin intervention. > > What do you think? I think that's a good idea, but it might be better to make the backup script check and report if /backups is already mounted and exit instead of failing silently if it can't mount /backups (for any reason) as it does now. This will also detect if the backups take longer than 24h! Bye, Tony. -- Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk mailto:a.travis at abdn.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt