Subsections

A. Other notes

The preceding notes are all you will need to get a fully functional and robust system. For those more ambitious, reckless or paranoid, or those with a system that no longer boots, I include the following supplementary notes regarding hardware, system recovery and stability.

A..1 Removing the Hard Drive

At some some point you may:

decide you want to upgrade the hard drive

do something or suffer something that results in the system being unbootable.

Since you are not able to boot from a rescue disc or any other sort of alternate media, both event require that you remove the hard drive. In the following I only supply one piece of information that is not otherwise completely obvious.

Removing the drive is straight-forward. Removing the obvious screws from the back of the case allows you to remove the back of the case. Remove the obvious screws holding the hard drive harness to the motherboard, slide the whole assembly over to disconnect it and pull it out. You may or may not have to remove the drive from the harness.

Plug the old drive or replacement drive into another system. Use whatever adaptors and jumper settings you need to support a secondary hard drive. Then run fsck, edit some init script, fix the kernel and install it, or install a new system via debootstrap and move you data, as appropriate.

If any of your activity required replacing the kernel you must use lilo to make it bootable. Create a lilo.conf separate from both the one on the Progear drive and the one on the boot drive of the system the Progear drive happens to be currently mounted on. This version will only be used in this setting, when the drive is reinstalled you will use the usual copy. This lilo.conf should be configured to install a boot sector on the Progear drive specifying the device it is currently assigned to, e.g., 'boot=/dev/hdb' since this information will be used now when running lilo to install the kernel, and the kernel image relative to its current location in the filesystem, e.g. 'image=/mnt/progear/boot/vmlinuz-2.6.5....'. But specify the root filesystem to use for booting relative to where it will be when mounted in the Progear, e.g., via 'root=/dev/hda2', as this path is not used until lilo is used to actually boot the system.

Also, here is the useful bit of information, after 'boot=...' include

: disk=/dev/hdc
bios=0x80

This corrects for the difference between being currently mounted as a secondary drive, but when being used to boot the Progear it will be the primary drive. I don't really understand the problem, or why this fixes it, but I have found that (sometimes?) it doesn't work if you don't do this.

Now put the harddrive back in the Progear and try again. Watch out for all those rubbery bits on the ends of the case, and good luck getting the lid flap back in right. You may want to put off putting all the screws back in for a while, it is frustrating to be pulling them out again right away because of some lame oversight.

A..2 Stability Notes

In the middle of a lot of mucking around I once had such a long string of trouble that my confidence in the robustness of the hardware fell dramatically. I devised a number of elaborate strategies for reducing the likelihood of a problem causing a failure to boot, and making it easier to recover should such happen anyway. In the meantime my trouble went away (hint: don't mount an ext3 partition as ext2 and then crash, this makes fsck highly likely to fail when trying to run non-interactively during the boot next time around) and I only continue to use the first two of these ideas.

A..2.1 Compiled in Modules

As already mentioned, the real trouble with fsck failing is if you don't have a useable keyboard. In this case you can't run fsck manually to repair the drive. The keyboard modules are not loaded until after the root filesystem is mounted, so even if you had the modules in an initrd, or another hard drive partition, the kernel would still never get to mounting them. Configure the kernel so that they are compiled in, this includes basic USB and HID support as well as the actual keyboard drivers. The keyboard will be functional immediately after the USB bus is detected, though it won't be good for anything until a bit later.

A..2.2 Use a Journaling Filesystem

Journaled filesystems start up more gracefully after being crashed on. I have never had a problem that didn't repair itself non-interactively. I have been using ext3 happily for the duration. Be careful though, that since ext3 partitions can be mounted as ext2 you will decide you want to do so, such as when booted into the factory Progear system which does not, by default, have support for ext3 filesystems. If this is not unmounted cleanly the result is worse than if is was just a plain ext2 partition. In my experience this always results in fsck failing during the next boot.

Figure out something to do about it. Compile in module support for ext3 on the progear systems, unmount the partition before doing anything suspicious, or put shared data onto an ext2 partition. Or just expect trouble, that is what I do since I don't use the factory Progear system much.

A..2.3 Rescue Partition

The central problem when facing a situation in which you can't boot the machine is that you can not direct lilo to boot using a different kernel, and you can't boot of off any rescue disk on external media. lilo -R lets you safely play with new kernel configurations, but later you will want to boot a new kernel by default so you don't have to regularly remember about manually setting it. At the same time you may be queasy about some unexpected problem coming up later that screws up booting (screwed up libc, messed up init scripts so login consoles are not started...) which would not be an issue for the previous kernel which is no longer booted by default.

To that end I have, in the past, kept lilo configured to boot off of a small statically linked kernel in a separate partition by default and run lilo -R in the init scripts to set the next boot for the working kernel provided it booted fine previously. This separate partition then acts as a rescue partition that is only ends up being used when the working kernel failed to boot far enough to run its final init script, or first shutdown script.

The first thing my shutdown scripts do is to use lilo to set my usual fully functioned kernel to boot next, via -R. Then if the system is not shutdown cleanly for some reason, in the next boot lilo uses the rescue partition that is otherwise never mounted so should always stay usable.

But maybe the last time the shutdown script was never run just because X locked up, or the battery died. In which case nothing was really wrong with the working system and you would want to boot into that again. It would probably be fine to set the next boot for the full system at the very end of the boot sequence. By then all the login terminals as well as networking and ftpd and sshd will have been started and about the only thing that can fail is X. Though I suppose there might be some reason that you will be unable to log in, in which case you would also be stuck, even if the network has started. Maybe should wait until after logged in, a cron job? wait 5 minutes and then do it since something must have happened by then?

In practice I don't consider it much of a burden to have to cycle through the rescue partition once if, for example, the battery dies while using the working system. Though in this case I happen to have the shutdown scripts in the rescue partition set lilo to boot the working system next in case I don't happen to have a keyboard around. My rescue kernel/system have no graphics/networking etc, or at least is not set to start them up automatically.

Finally if you don't happen to want to wait around, or you find a forced manual powering down annoying, you can add something like 'append =''panic=60''' to lilo.conf. Then after some time after a failed boot you are automatically taken to the rescue kernel.

A..2.4 Mount root and /usr read only

In principle the working system could be damaged during normal use from something like a mistyped command or even filesystem corruption. To guard against such thing, I for a while though about running with as much of the filesystem as possible mounted read only. With everything in at least /sbin, /lib, /boot and /etc intact you should always be able to at least boot to a console, if /usr is ok you should have the rest of the system.

This is a little difficult in practice as /var, /tmp must not be read-only and some services even require write access in /etc (mtab?). I already had /var on a separate partition. I decided that moving the rest around was too much work. It seemed that messing up the system was more likely when intentionally upgrading it, for while these same filesystems would need to be, at least temporarily, remounted read-write, than from any accident or damage that would be prevented by keeping them read-only by default.

A..2.5 X launch dependence in keyboard availability

Always wanted to arrange that X is started only if the keyboard is not plugged in. It is annoying to have to go through X when I do just want to use a keyboard at the console, and it is fatal if I forget to have X start automatically again before I take the thing out without a keyboard.

Never got to this.