December 26, 2008 - Ubuntu Intrepid Ibex Server Installation Doesn't Boot
Background
I host several web sites for non-profit organizations in the Seattle area since I have fixed IP's and bandwidth to spare. Of course, it always turns out to be more hassle than I expect because the hardware we've been using hasn't really been up to the task of 24/7 year after year uptime and thrashing by hackers. (A hint to those planning to do something like this: Do not use consumer rated hard drives for a server! You will regret it. Instead, go for industrial strength SATA drives like the Western Digital Raptor or SCSI drives. They will be more expensive and noisier but are designed to last longer under the constant stresses of use in a server.)
Details
Since our 1-year old rack mount server with cheap IDE drives was failing in multiple ways, I bought a used SuperMicro 6013L-8 dual Xeon server with 1 GB of RAM and Compaq SmartArray 431 hardware RAID and 3 SCSI drives. This machine is built like a tank. I'm just tired of debugging failing hard drives and recovering and reinstalling systems and this hardware should be better up to the task. The failing system is primarily an email server, hosting several mailboxes and a dozen or so email lists for the Northwest Mahler Festival. (In an ideal world, it really shouldn't be under much of a load, but when you consider the constant barrage of spam and hack attempts we get, it is constantly busy.) So, I did a bit of research on which version of Linux would have a good combination of ease of installation and maintenance, good support for GUI-less install and remote admin, built-in support for hardware RAID and general Linux neophyte friendliness. I settled on Ubuntu 8.10 Server (Intrepid Ibex) for all those reasons and because it is based on the Debian Linux which we were already using on the dying server.
The documentation of installation on the Ubuntu site sounded pretty easy and I downloaded and burned the CD-ROM. One piece of information that may be relevant is that the 3 SCSI hard drives are set up as a RAID-5 using the SmartArray controller. Support for this controller is built in to the default Ubuntu kernel. The RAID array was formatted as a single drive and clean. I expected the next steps to be pretty smooth and at first, they appeared to be:
- Place the Ubuntu CD in the drive and boot - great, setup screens appear.
- Go through the installation process, selecting language, network settings, hardware detection, etc. This all seemed normal except for one thing: the hardware detection said it detected SATA RAID and asked if I wanted to enable it... but I have SCSI RAID. Hmmm. I don't have SATA RAID so I said "no." Moving on, it detected the array and allowed me to partition just as I expected and installed the LAMP and mail servers without issue.
- Reboot. The BIOS POST runs, then nothing. Just a blinking cursor.
Well, that was annoying, but I figured it must have been the SATA RAID thing, so I reformatted the drive and started over, choosing "yes" at the SATA RAID prompt. Same thing, no boot. I verified that the boot partition was formatted as bootable. I verify that the RAID array was marked as bootable in the BIOS... yes to both. I tried again and again. I tried different partitioning schemes (all in one, default, single boot partition, LVM for the rest, etc.) Same result every time. Of course, it gets more annoying with each try because it takes at least 30 minutes to run through the entire setup process.
I spent many hours digging through Google searches trying to find someone else with the same problem. I found many people begging for help with the issue and a few Linux experts and developers trying to help, but nobody who had SOLVED it.
Booting a Linux live disk let me mount and browse the partitions and verify that files had been copied to them, so it seems that this wasn't an issue with improper mounting of the RAID drive.
At this point, I tried installing the normal, desktop version of Ubuntu 8.10. It installed and booted fine (although it also misrecognized the RAID as SATA.) But, I don't want all the workstation, gui and office applications that come with this version. I just want a bare-bones server installation with the tools I need. Anyway, this verified that the hardware is fine, the SCSI RAID works and can be recognized and used by Ubuntu, but left the question: What is different between installing and booting the server edition and standard edition? I never did find an answer to that question, but in my searching, I found someone who had an installation fail because the GRUB boot loader was not properly set up during installation. It sounded like a good possibility so I dug into this one a bit. Several people recommended using the SuperGRUBDisk to repair the GRUB installation. Normally, the idea of going to some independent site and downloading tools to make a secure server boot would be anathema to me, but I couldn't get the Ubuntu install to work on its own. So, I downloaded the SuperGRUBDisk and tried it. The menus are cryptic (translated from the original Spanish geek talk to English geek talk, I think) but I was able to figure it out. With a few menu selections, I had installed the proper GRUB configuration, rebooted and was up and running. Problem solved!
So, it appears that the cause of the failure was that the Ubuntu 8.10 Server install did not properly configure the GRUB boot loader. Was it related to the incorrect SATA RAID hardware detection? Was it related to RAID at all? Is it specific to SuperMicro motherboards of a certain age? Why does installation of the desktop version succeed where the server version fails? I just don't know the answer to any of these questions, but the solution for me was to use the SuperGRUBDisk after the normal installation process to properly configure the boot loader.
One More Thing
I also tried the same OS install on a similar, but different piece of hardware. The SuperMicro 6023P-8R also has dual Xeon's but has 2GB or RAM and an Adaptec ASR-2120S hardware RAID controller and 6 drives, again set up as a single RAID-5 array. I got the same exact behavior on this system as on the one above. So, it's not directly related to the Compaq 431 controller.
Maybe by posting this rant, someone else will come across it in their own Google search and shorten their own frustrating search for a solution. By the way, if you do come across this post and have questions that do not relate to this specific situation - I don't know the answer. I am clearly not a Linux expert and don't know what may be ailing your setup. On the other hand, if you have questions or comments directly relating to this issue, feel free to contact me - see the home page for details.
Now for a Proper Ranting...
Coming from the Windows world, I expect the occasional software glitch, really I do, but I do expect every operating system installation to succeed at it's most basic task - create a bootable system or tell the user exactly why it cannot. It seems to me that there is a systemic problem with open source software like Linux. Everyone wants to be a developer - "Look at me, I designed the FrackleGoober application!", "I coded the IP6 protocol to enable the future of the Internet," or especially, "I patched the Kernel." It's sexy and prestigious and in the geek world, it's certainly more appealing to the opposite sex than expensive fragrances, sports cars or even pills to enlarge body parts. Testing, documentation and support, though... that's another story. Nobody is impressed by the guy who wrote the most up-to-date and clearly worded man pages or found that obscure bug in the installation scripts and wrote up a perfectly detailed bug report. When you are spending weeks, months or your entire life working on software, there has to be some incentive to do it. In the world of commercial software, there are multiple incentives, but the biggie is usually the paycheck or some promise of riches down the road (stock options.) So, even if it is not particularly sexy, people are working on the testing and documentation because they are paid to do it. Of course, the developers also have the caché that goes along with directly creating the product, even if they are not paid, but there is nothing in the free software world to really encourage excellence in the supporting aspects of project development like testing and documentation. The open source movement would really help their cause if this could be addressed somehow. It won't happen with the current model - the incentives just aren't built into the system - but there must be a way to encourage excellence in the entirety of the software development process rather than just the act of writing code.
I would like to thank Adrian Gibanel Lopez, who wrote and maintains SuperGRUBDisk. It was very helpful!