CentOS 5.1 with software RAID 1 hard drives, and a nifty backup arrangement for the hard drive data

How I installed CentOS 5.1 with dual IDE hard drives in a software RAID 1 configuration, without requiring great expertise in Linux software RAID, or an already operating machine to prepare the hard drives.  The same principles should work for SATA drives.  I also describe a backup system by which I copy the entire byte-for-byte contents of one drive of the pair to a backup drive.  That drive boots fine in any machine (subject to the need to manually configure new NICs, but I am contemplating a modification to kudzu to remove this requirement) and can easily have its contents copied to another drive which will become its second drive in the reconstituted system.  I also describe shell scripts to alert me by email of any failure in the RAID system.

Robin Whittle rw@firstpr.com.au 2008-05-17  (Typos of "hdb" fixed - should be "hdc" - 2008-10-24, thanks Ned!)   (2008-11-13: note on horde being insecure unless properly password protected.)

../ Back to the parent directory concerning web-mail, how I used this server for Postfix, Courier IMAPD, Courier Maildrop etc.

Introduction

I chose CentOS because it is a free version of Red Hat, and it is likely that Red Hat and CentOS will be providing security updates for this system for quite a few years.  My impression is that these systems have a much longer "update" life than Debian, which I run on my desktop Linux machine.  Also, my server in the USA, which hosts this site, runs CentOS.  It is at http://www.layeredtech.com - who support the CentOS project with advertising.  I have been very happy with this server since getting it in early 2008.

Hard drive redundancy is a good idea for any kind of server.  (However these single head ST380215A drives supposedly have a mean time between failures of 700,000 hours - 80 years - and a failure rate of 0.34% per year.)

One approach is to use a hardware RAID controller board - but that introduces extra hardware dependencies, potential gotchas with software (Q: How could you tell remotely the RAID system is failing unless the board integrates well with your OS? A: http://linux-ata.org/faq-sata-raid.html ) and of course the extra financial cost of the hardware.  Also, a computer without extra hardware consumes minimal power.  A RAID board, or perhaps RAID functions on the motherboard might be better than software RAID, since maybe the machine will still boot if one of the drives is dead.  (This depends on how dead the drive is, and whether the motherboard will boot from /dev/hdc if it finds no drive at /dev/hda.

I can take these two drives (or just one of them, or a clone of one of them) and plug them into any motherboard (subject to kudzu_gotcha), on their two separate cables - and the server will come to life on that machine.  Also, it can boot when one drive is on /dev/hda.  Either the original /dev/hda or the original /dev/hdc drive boots fine on its own in /dev/hda.

I also have a way of taking the machine offline and cloning the entire contents of /dev/hda to another drive, which I temporarily plug into /dev/hdc.  That clone can be booted on its own, to become the server (again, subject to kudzu_gotcha).  This back-up approach should make for fast restoration of a server, compared to mucking around with backup media, reconstituting a system etc.

This system has worked well for me for years, with RH 7.2 since 2002.  I never had to recreate the server from a backup drive, but I did have one or two failed hard drives, without the server losing any data.


Documentation for software RAID

The installation information for Red Hat / CentOS software RAID is Chapter 4 of the Deployment Guide at http://www.centos.org/docs/5/ .  This needs to be read in conjunction with the Installation Guide.

The discussion list for developers of Linux software RAID is: http://vger.kernel.org/vger-lists.html#linux-raid

Before about 2004, software RAID was managed with the raidtools set of software.  Since then, mdadm has been used http://en.wikipedia.org/wiki/Mdadm .  mdadm is written by Neil Brown at the University of New South Wales (odd name for an Australian state) in Sydney: http://neil.brown.name/blog/front .  mdadm information can be obtained from "man mdadm" or from http://man-wiki.net/index.php/8:mdadm .

The best general information about Linux RAID - and Linux software RAID in particular - seems to be: http://linux-raid.osdl.org/index.php/Main_Page Some of that material is still based on raidtools rather than mdadm.

In particular, please read http://linux-raid.osdl.org/index.php/Preventing_against_a_failing_disk because it covers a similar scope to this page, and is written by someone who probably has a better understanding of RAID than I do.

Also, this might be useful: http://www.linuxhomenetworking.com/wiki/index.php/Quick_HOWTO_:_Ch26_:_Linux_Software_RAID.

The Linux software-RAID HOWTO http://tldp.org/HOWTO/Software-RAID-HOWTO.html was last updated in 2004 and I think it relates to the old raidtools system, rather than mdadm.

Installing CentOS 5.1 with Software RAID 1

I got two new 80G IDE Seagate drives and put them in removable drive bay caddies - each on its own cable to the motherboard (Asus CUSL2-C with a Pentium III 750MHz).  The second cable also goes to a third pluggable drive bay, and into that I plugged a CD-ROM drive.  This is all with IDE cables - but you should be able to do the same stuff with SATA, as far as I know.

So:
Cable 1:
    /dev/hda   "A" 80G HD as master

Cable 2:
   
/dev/hdc   "B" 80G HD as master
    /dev/hdd   CD-ROM drive as slave
 

The CD-ROM was on an old slow IDE cable, so it probably made /dev/hdc run slowly, but that was fine.  I only plug in the CD-ROM drive for installation and for backing up /dev/hdc.  But beware that some hard drives and some CD-ROM drives may not work on the same cable.  A better approach would be to somehow boot into Linux Rescue mode from a USB memory stick or similar.  That wouldn't involve any extra cables.  With SATA motherboards, probably an old IDE CD-ROM drive would work fine, since there is no slowing down of the SATA cables to the hard drives.

I downloaded the 6 CD-ROM ISO files and burnt them to CDs.  (A single DVD might have done the trick too.) These are from a local mirror server I found at "http://www.centos.org > Downloads > CentOS-5 Isos".  (The small net-install CD requires access to a server with a complete set of files from the entire CD set or DVD, mounted as files, so I didn't use that - but perhaps I could have if I found a public site with the files available.)  Booting the first CD enables that and the other CDs to be tested before installation.  This is a good idea, since perhaps they won't read well on this particular CD-ROM drive.

Booting from the CDs I selected the Graphic Installer at first.  The Graphic installer does things which are not possible with the text mode installer.  (128M RAM was not enough for the graphic installer - 256M was fine.)  I wanted to use the graphic installer to create software RAID partitions on each drive and then stitch them together to be ext3 partitions, on which I would install my system.  The text mode installer doesn't do this, but I think it can do the stitching together.

In the end, I made my software RAID partitions in another way, and then used the graphic installer to stitch them together.  So I could probably have done the stitching with the text mode installer.

I did not use the graphic installer to create the basic Linux software RAID partitions, because (as with RH 7.2 years ago) I found it mucked around with the partition numbers and did not allow me to explicitly control the results.  It looks OK at first, but later, it started messing things up.

So here is what I did instead.

I used CD1 to boot into Rescue mode, by giving the command "linux rescue" after the blue graphic screen appears.  From there I ran fdisk and gave commands as detailed below.  That resulted in the correct partitions being made on /dev/hda.  Then I copied that partition table to /dev/hdc.  Then I used the graphic installer to stitch these together into ext3 partitions for the actual installation.

Here is the result I arrived at.  Details on how I did it follow.

A RAID      B RAID     ext3       Mount        Size 
partition   partition  partition  point        final 
                                               Kbyte
                                               blocks
/dev/hda1   /dev/hdc1  /dev/md0   /boot        200,781      Primary

/dev/hda2   /dev/hdc2  /dev/md1   /         10,008,495      Primary

/dev/hda3   /dev/hdc3                                       Primary, not used.
/dev/hda4   /dev/hdc4                                       Primary, not used.

/dev/hda5   /dev/hdc5  /dev/md2   /home     20,008,957      Logical
/dev/hda6   /dev/hdc6  /dev/md3   (swap)     2,000,893      Logical
/dev/hda7   /dev/hdc7  /dev/md4   /var/log   2,000,893      Logical
/dev/hda8   /dev/hdc8  /dev/md5   /audio    43,913,646      Logical          


I have a separate /boot directory.  Maybe I don't need it with this motherboard, but some motherboards can't find the boot directory if it is beyond cylinder 1024.  I chose 200M on a whim, but I later read somewhere this was a good choice for /boot.  According to: http://linux-raid.osdl.org/index.php/Preventing_against_a_failing_disk  /boot is readable by both the  grub and lilo boot loaders when it is on an ext3 partition created from one or more Linux software RAID partitions (type FD) - but only for RAID 1.  grub or lilo just read one such FD partition directly as if it was an ordinary ext3 partition.  They don't actually run the RAID software AFAIK.

Then I have the / directory, which I chose to put on a primary partition as well, for no particular reason.

I have a separate partition for /home, which is where my email will accumulate.

/var/log gets its own partition so if things go wrong, the messages won't grow into other parts of the filing system.

I have a big partition at the end "/audio" for various things.  I wanted to keep the main partitions which matter (/ and /home mainly, but of course /boot , swap and /var also) near the outside of the disc.  Also, I think, limiting the size of each partition means that if errors appear on the drive in one location (which will cause the FD partition it is in not to be used any more by the RAID system) then this may limit the likelihood of losing important data on the partitions which matter.  I would have been happy to keep using 40G drives, but I wanted new ones and the 80G drives were the smallest I could get.

I chose a rather large swap file - 2G. This may be a bit excessive  I am only running 256M of RAM, though this may be increased later.  Guidance on swap file size is at Chapter 5 of the Deployment Guide.  However, I think various people have differing views on how best to choose the size of the swap file.

Having the swap file on a partition created by RAID 1 from two or more FD partitions on two drives means the computer can run fine if one drive fails.  


There can be 0 to 4 primary partitions 1 to 4.  There can be 1 extended partition, which is in the just beyond the highest numbered primary partition which is used directly for data, so it can be in location 1, 2, 3 or 4.  The extended partition (usually) extends all the way to the end of the drive. Within that, multiple logical partitions are created, with some limit to their numbers (but I am not sure what).

I could have made 1 the extended partition and put all these in logical partitions 5 onwards.

Initially, I make these ordinary Linux partitions in fdisk, but then I change them to be "Linux Software RAID" partitions of type FD.

The Barracuda 7200.10 ST380215A drive has 80,026,361,856 bytes.  (AUD$50 each, including 10% tax.)  It has a single head, which is good for reliability, I think.

The physical arrangement of the drive is not visible to the BIOS or operating system.  It presents itself to the OS as 63 512 byte sectors per track, with 16 "heads" meaning 16 such tracks per "cylinder".

fdisk sees that it has 8,225,280 bytes per cylinder

16065 512 byte sectors =  8032.5 blocks (1024 bytes) per cylinder.

If I want 10,000 blocks (just under 10GB, since a GB is 1024 MB, which is 1024 KB  - "K" means 1024 and "k" means 1000) I divide 10,000,000 by 8032.5 and round up to the next integer.  Since these block numbers are going to be staring at me for years when I run "df" to get a report on how full the partitions are, I chose to make them numbers which looked good in the 1K blocks df uses.

Here is a summary of what I did in fdisk:

fdisk /dev/hda

n         new

p         primary
1         Partition 1
1         first cylinder
25        end in cylinders         200,781 blocks /boot
p         print partition table

n         new
p         primary
2         Partition 2
enter     default start cylinder
+1245     size in cylinders     10,008,495 blocks /
p         print partition table

n         new
e         extended
3         Partition 3 will contain the rest as logical partitions  
enter     default start cylinder
enter     default size in cylinders    = rest of disk.
p         print partition table

n         new
l         logical  (creates /dev/hda5)
enter     default start cylinder
+2490     size in cylinders    20,008,957 blocks /home
p         print partition table

n         new
l         logical  (creates /dev/hda6)
enter     default start cylinder
+249      size in cylinders    2,000,893 blocks (swap)
p         print partition table

n         new
l         logical  (creates /dev/hda7)
enter     default start cylinder
enter     The rest of the disk  43,913,646 blocks /audio
p         print partition table

w         Writes all this to the drive.

                    Now to edit the partition types and make them all software RAID, type FD

fdisk     /dev/hda

t         Change partition type
1
FD

Repeat for 2, 5, 6, 7, 8

w         Writes all this to the drive.

This was labour-intensive, but I got exactly what I wanted.

Now to copy this partition table to /dev/hdc:

sfdisk -d /dev/hda > blah
sfdisk /dev/hdc
< blah

Then check /dev/hdc has the right partitions.

fdisk /dev/hdc  
p


Now, I could run the Graphic Installer (CD1).  At the point where it presents the two drives on which the system can be installed, there is a pull-down list (or whatever) which only shows one option, but which can be altered to select:

Create Custom Layout

Do not tick Review and modify partitioning layout.

At the next screen, there is a RAID button.  I used that to join two FD (Linux Software RAID) partitions into one partition as listed in the table above.

In each case, there is a list of all the FD partitions.  I need to deselect all but the two I want to join.  I select RAID 1 for all these, not the default RAID 0.

First, I create /dev/md0, from /dev/hda1 and /dev/hdc1, with a type of ext3 and a mount point of /boot.

Then, /dev/md1, from /dev/hda2 and /dev/hdc2, with a type of ext3 and a mount point of /.

Then, /dev/md2, from /dev/hda5 and /dev/hdc5, with a type of ext3 and a mount point of /home.

Then, /dev/md3, from /dev/hda6 and /dev/hdc6, with a type of swap and no mount point.

Then, /dev/md4, from /dev/hda7 and /dev/hdc7, with a type of ext3 and a mount point of /var/log.

Then, /dev/md5, from /dev/hda8 and /dev/hdc8, with a type of ext3 and a mount point of /audio.

From this, I could proceed to the rest of the installation process.

See below for my backup system for /dev/hda


Main installation


I won't comment on most of the installation, since I don't have any special insights except perhaps the following:

I selected the use of the CentOS Extras repository.  I don't know much about these, but it enabled me to install the Horde IMP webmail program.  

Note: 2008-11-13: See below on how this is insecure unless you password protect the /horde/ directory.  Anyone can run command line commands as user apache!  

I want this machine to run X Windows, so I can run Thunderbird for email access when my desktop machines are down - such as during a power failure and when this server runs on a UPS.  

Normally, this server is not used directly with a keyboard or screen, but I have a Logitech Trackman, keyboard and screen connected to it at all times.

The machine will later become gair.firstpr.com.au.  For now, it is nair.firstpr.com.au and like the other machines here (apart from gair) has a private address on the LAN.

The old gair (actually, another PC running the old gair's hard drives, since I am doing all this on the PC for the past and future gair) is connected to the Net via a fixed IP address ADSL service via eth1.  Its eth0 is for the LAN, in which it has a fixed address 10.0.0.1, and other machines and printers have their own fixed addresses.  I am not using WiFi, DHCP etc.  It is just Ethernet cables and two switches.

Initially I make eth0 on nair have its own private address.  Once I have got most things installed, I will turn off the ADSL connection (to stop new emails arriving) and then copy over all the email and other data from (old) gair to nair.  Then I will change nair into gair, with a new name and with the 10.0.0.1 IP address, taking the old gair off the system.  Then I need to configure eth1 to do the ADSL stuff, and configure a firewall, NAT etc.  Then, the new machine will be the new gair.

Most of this installation doesn't need an Internet connection, but I put the new machine on the LAN so it could get other packages, such as Horde IMP.

The system installs by default with a kernel which supports the Xen virtualization approach to running multiple operating systems at the same time.  I don't ever want to do this, so it would probably be best not to use Xen.  I understand that to stop using Xen, I can install a non-Xen kernel.

Here are some rough notes on how I selected packages for installation.  

Install GRUB on /dev/md0.  The default operating system is in /dev/md1 ( the / file system)

eth0   10.0.0.2  255.255.255.0

Enable IPv6 support?  Yes - automatic neighbour discovery.

eth1   Disable for now.

nair.firstpr.com.au

Gateway 10.0.0.1
Primary DNS 10.0.0.1

Tick System clock uses UTC.

Select packages:

Desktop Gnome
Server
Server GUI

CentOS Extra?   Yes.  See:      http://wiki.centos.org/Repositories

CentOSPlus is already included.

Customize now.

Applications
Graphical Internet
Turn on GFTP (SFTP file transfer to my server in the USA), Firefox and Thunderbird. Turn others off.

No games, engineering, graphics, office productivity, sound, video etc.

Text based Internet:
Keep:
elinks (text mode browser)

Development
Dev libraries Turn on - keep defaults
Likewise Development tools
All of Legacy Software Development
Servers

Servers

DNS
No FTP
Legacy network servers - only xinetd

Mail Server
Turn off sendmail - in order to use Postfix
Keep only:
Postfix
Spamassasin
Squirrelmail  (I later installed it separately and am happy with it - I haven't yet documented how I installed it.)
Network servers
dhcp only
Turn off MySQL and select PostgreSQL.  I previously determined that PostgreSQL was better (but I am a complete database newbie).  I figure any program which only works with MySQL and not with PostgreSQL is probably suspect.  See: http://www.google.com/search?q=%22MySQL+vs.+PostgreSQL . http://www.wikivs.com/wiki/MySQL_vs_PostgreSQL etc.

PostgreSQL - defaults

Base System
Admin tools - defaults
Base.
 Turn on AIDE - intrusion detection.  (I used to use TripWire.  It was difficult to set up, but I thought it was good.

Turn off sendmail, wireless tools,
Legacy software support - defaults

System tools - turn on, but turn these off:
Off bluez---
(What is hwbrowser?  A GUI thing to view hardware devices.)
On - mc  Midnight Commander  (I know the real geeks get by with pure command-line control, but I like Midnight Commander.)

X-Window - On with defaults

CentOS Extras  

2008-11-13 Turning on Horde was a mistake, because I did not realise the /horde/ directory enables anyone to run command line commands, as user apache.  For months, I was getting hit by people downloading a tgz file into my /tmp/ directory, unzipping it and then running a bunch of ssh-scan programs, peppering computers all over the Net with attempts to get in, using the lists of usernames and passworks in the tgz file.   If you use Horde, make sure there is password protection on this.  They can also run PHP commands.  It is completely insecure out of the box.
Turn on Horde, but only for gollem (web based file management) and Imp


The installation went fine and I rebooted the machine.  It is running X-windows at 1024 x 768 (with an old Trident 4M AGP board, and an obscure 1024 x 768 LCD monitor).

For now, I turned off the firewall.  There is no need for a firewall on eth0 later, since this is the LAN.  I will need a firewall on eth1 - the ADSL port, but will do that later.

SELinux:  Keep the default "Enforcing".  This turned out to be a bad idea.  I don't have time to figure out how SELinux works, and it later screwed up the Courier IMAPD server, so I later turned it off.

Kdump: I was going to reserve 64 Megs of RAM for this Kernel dump facility.  I do want to know what happened if the system crashes this badly.  However, there was a warning message I didn't have time to think about to do with XEN (virtualization) and other complexities.  It turned out that with 256M RAM, there was not enough space anyway.  Kdump disabled.

Date and Time . .  Network Time Protocol.  NTP is great.  I enabled it and selected some local time servers, from  http://support.ntp.org/bin/view/Servers/NTPPoolServers  http://www.pool.ntp.org/zone/oceania :

0.oceania.pool.ntp.org
1.oceania.pool.ntp.org
2.oceania.pool.ntp.org
3.oceania.pool.ntp.org

Turn off "Use Local Time Source".

After adding a non-root user, the machine rebooted into X Windows and I was able to log in with that username.

Within a minute, a thing appeared saying: Updates available There are 44 package updates available.

I installed them all - but there was no sign this process was actually running.  After a few minutes, it displayed a box: "Resolving dependencies for updates".  This is a 750 MHz machine, so these things take a while.  Then it started downloading packages.  System > Administration > System Monitor showed network activity.


[I did document more of the installation process, but lost that version of the file.  I can't remember all that happened, but it went smoothly from here, and I was able to boot the machine, have X Windows run etc.]


Cloning: Whole Hard Drive backup with Software RAID 1

The purpose of this technique is to have three or so spare drives (all, ideally, the same model), ideally stored offsite, each of which is made to be a bit-for-bit clone of /dev/hda.

Then, that drive can be plugged into any motherboard, to become the server (subject to the kudzu_gotcha).  It will complain about not having its /dev/hdc, so all the partitions will be complaining of RAID failure, but the machine will run fine.  Making a /dev/hdc drive for it to recreate all the data onto could be a little tricky, but later I investigated and found that by cloning /dev/hda to /dev/hdc, the system seemed to work fine with both sets of partitions working as normal with software RAID 1.

The idea is to boot from CD-ROM (or a USB Flash drive . . that would not slow down the second IDE cable) and use a simple cp command to copy all bytes from one drive to the other.  

Here is what I do:

Shut the machine down ("shutdown -h now" or "reboot -f") and turn off the power.

Remove the drive from /dev/hdc and plug in another drive there which will be the backup.

Plug in a CD-ROM drive as noted above for installation, and boot from CD1 of the CentOS 5.1 set.

When this boots, give the command "linux rescue" and then take the following steps, which result in a single-user mode shell environment, in which the hard drives are not mounted to the file system, and in which the software RAID process is not running.  The drives are physical devices and can be read and written as a linear file.

Choose the language english and the keyboard US..

Do not turn on the network connection - No.

The next step is not so obvious. Do not use "Continue" or "Read-Only".  Use "Skip". This gives a shell account in single user mode.  There is no software raid, no mounting of the hard-drive's data in the current file system (that is via "Continue").  Just a shell account, with no other processes running, and the two drives appearing to be like files at /dev/hda and /dev/hdc.

Now, copy the contents of one drive to the other: partition table, RAID partitions etc. all together:

cp /dev/hda /dev/hdc

This took about 50 minutes for my 80G drives.  It would probably be faster if I didn't have the old 40 wire ATA cable slowing down the 80 wire Ultra-ATA cable to /dev/hdc.

Then, reset the machine or turn it off and take out the drive in /dev/hdc.  Replace the original /dev/hdc drive.

The machine will now reboot in RAID 1 mode and operate as usual.


In this backup entire procedure, the RAID software has not run.  Therefore the counters in the superblock of each RAID FD partition have not been altered, and the drives boot as they would have before - so the machine returns to normal operation as if nothing happened.  

(If I had booted the machine normally (not single user rescue mode) with just the /dev/hda drive, the counters in some or all of the superblocks would be higher than the counters in the corresponding superblocks in /dev/hdc.  Then, rebooting normally with the two drives would cause the data in those partitions in /dev/hda to take precedence over the data in those partitions in /dev/hdc.  The RAID system would automatically copy the /dev/hda data to /dev/hdc for those partitions.)


System restoration from a single drive


Not counting the kudzu_gotcha, here is how I restore the server starting from a single drive, either:
In all these cases, I have a drive which has the complete set of current data.  I may be restoring the system on the same motherboard, or on another machine.  This doesn't cope with having only two drives, where one drive has a failure in one or more partitions and the other has failures on some others.  (I think the best way to handle that is to mount each drive on some other system which can do software RAID, and which won't be confused by these FD partitions and accidentally stitch them into an already-existing ext3 partition, and then read the data from them.)

All these procedures work fine, with absolute minimal fuss:
  1. Use just one drive, plugged into /dev/hda.  It will boot and the machine will run fine - but without any RAID backup if something goes wrong.
  2. Use the backup procedure to clone the copy of the good drive over to a second drive, and then boot with them both.  At least for me, the system is perfectly happy and is running with full RAID 1 protection.  With my setup at least, it doesn't matter whether the drive I use with the data was in /dev/hda or /dev/hdc, or a clone of either.
This makes for system restoration with minimal reading of the complex software RAID documentation!

However, see the next section for some nasty gotchas if you run the drive with different Ethernet boards than those of the original system.



Kudzu Gotcha: Running the system with different NICs

Please see a separate page "kudzu-mods" from the parent page ../ for how I planned to modify kudzu to solve this problem.  However, I have not yet done this.  That page has a detailed analysis of some parts of the kudzu source code.

If I change one or both of the two Ethernet NICs in this machine, this Gotcha applies.  "Change" means a different type of card, or the same type, but a different physical unit - and therefore with a different Ethernet MAC address.  (According to my understanding of kudzu, based on reading the source code, if I kept the same two NICs A and B, previously eth0 and eth1 and plugged them in differing slots so the kernel discovered them in a different order, kudzu would swap over the configuration information so they still functioned as A = eth0 and B = eth1, despite B being discovered first and so otherwise being known as eth0.  I am not sure if this is what it is supposed to do, or how useful it would be.)

Let's say I have two perfectly good files:

/etc/sysconfig/network-scripts/ifcfg-eth0
/etc/sysconfig/network-scripts/ifcfg-eth1

which work with particular cards A and B, where A is in a lower number PCI slot than B.  (How are these things ordered in a motherboard with on-board NICs? - I guess the on-board one is found by the kernel before any plugged into a PCI slot.)

The machine runs fine, since these files were created by the installer, or were manually edited by me.

However, if I replace one or both A and B with different cards, or if I take this hard drive data (in the drive, or two RAID 1 drives, or in clones of them) and boot it on a machine which does not have the exact cards A and B in the same order in its PCI arrangement, then kudzu will do some stuff which in these circumstances is not helpful.

It renames the ifcfg-eth0 (or 1) to be ifcfg-eth1.bak (and overwrites any file of that name).  Then it creates a new ifcfg-eth0 file set up to be a DHCP client.  It tries for a while to get this going, during boot up - and if this is what we want, then I guess there is no problem.  However it is not what I want.

In my setup, eth0 is to the LAN, and this machine has a fixed IP address 10.0.0.1 on it.  (I might also make it the DHCP server.)  So kudzu's attempt to get it to work as a DHCP client is pointless.  Now, if I restart the machine with the old board A as eth0, it still won't work, because my original ifcfg-eth0 file has been renamed to a backup file.  Probably this time, that will be overwritten by copying the DHCP mode current ifcfg-eth0 file to ifcfg-eth0.bak!

So this is not just unhelpful, it is data loss - loss of vital configuration information.

According to some things I read at: (Oops, where was it?) this action occurs at boot time when kudzu finds a device which is not mentioned in /etc/sysconfig/hwconf  That file contains details of devices kudzu found last time the machine booted.

If it is impossible to boot, I use Linux Rescue as discussed below to edit out the sections of /etc/sysconfig/hwconf which mention NICs from the past, and to get rid of any /etc/sysconfig/network-scripts/ifcfg-ethx files I don't want, probably all of them.


When I have set the machine up with good ifcfg-ethx files, I make copies of them into the /root directory as: good-ifcfg-eth0 and good-ifcfg-eth1. So I might can manually copy them to ifcfg-ethx if I need to (such as after a boot which gives me console access, or with Linux Rescue).

So bringing the machine up with different NICs, or booting a copy of this machine's hard drive in another machine with different NICs will require some careful work to get the machine to boot correctly.  The initial attempt will clobber the eth0 setup, making it unworkable.  My eth1 setup is minimal.  The ifcfg-eth1 only tells the machine to enable it at boot time.  It doesn't specify DHCP or anything else - and in my case, since the ADSL modem has a DHCP server, eth1 certainly should not be running a DHCP client.  eth1 is only used by ppp0, and all that is needed is that the kernel installs the driver for the NIC.  The software which is part of the rp-pppoe package handles everything else.  

This would be a tricky thing if the server was not accessible via keyboard and screen - to manually copy files and reboot the machine after the initial attempt fails to get eth0 going.  If administrative access to the machine was via eth0, then this would be impossible.  Since many machines use an Ethernet interface on the motherboard, running the disk image on another motherboard would cause this problem.  How then to copy the files if access via the network was the only approach???  

What if some screwy stuff with NICs and these ifcfg files prevents the machine from booting sufficiently to run a terminal session?  Then it would be necessary (after this boot attempt presumably writes the current NIC details to /etc/sysconfig/hwconf) to boot from CD-ROM with the first disc of the set, and use Rescue to fix the problem.  After typing "linux rescue" at the blue CD-ROM boot screen, and selecting the language and keyboard type, it might be best to allow it to start the network interfaces, perhaps without any cables plugged into them:  

Ah-Ha . . . Assuming these things are currently "unconfigured" (as they will be, since the rescue system has not looked at the hard drives yet), then we can get to see what type of NIC it is, with the Edit function.  I have multiple similar looking NICs which look physically identical to the "Intel Corporation 82557/8/9 Ethernet Pro 100" (except for serial number stickers), but which which are labeled as IBM, HP or whatever.  The are all recognised as being the one type by this Linux system.

Maybe I don't want to configure these NICs to do anything right now, but it is good to see that the system (the rescue system at least) recognises them.

Now the Rescue system asks if I want to Continue (it will try to mount the hard-drives' file system under /mnt/sysimage) or to do this "Read-Only".  (Skip is what I use when copying the entire drive's contents, as documented in the software RAID 1 page via the parent directory ../ , but now I want to mount the file system and edit some files.)  I don't want "Read-Only" since I want to change some files.

This seems to work OK with software RAID 1.  It mounts the / directory from the hard drives, at least, at /mnt/sysimage/ - not other partitions mentioned in /etc/fstab.  This is single user mode, and I can't run Midnight Commander (mc) because there is a segmentation fault.  I can run joe, so I can edit files

My general procedure for coping with new NICs is, after installing them:

1 - Delete all files /etc/sysconfig/network-scripts/ifcfg-ethx.  However, I keep my backup files such as good-ifcfg-eth0, or backups somewhere else, such as /root.  Keeping any name starting with "ifcfg-eth" in the /etc/sysconfig/network-scripts/ directory or any subdirectory thereof will cause them to be seen by the boot up stuff and it may try to set up eth0, eth1 etc. more than once.

2 - Edit /etc/sysconfig/hwconf to remove the sections concerning NICs.

3 - Exit the rescue system (exit) and turn the machine off.

4 -The new NICs are installed, but are not connected to anything yet.

5 - Reboot the machine.  If the boot is not sufficient to edit the files manually, power it down with "reboot -f" and do the following:

6 - Reboot with Linux Rescue and edit the ifcfg-ethx files, based on backup files.  Power down.

7 - Reboot and hopefully the system will run normally.



See the page (from ../) "CentOS 5.1 Miscellaneous configuration items" for some samples of the ifcfg-ethx files I used.

Reporting RAID failures


There is probably a better way of doing this, but it works for me.

I want to get an email if there is a failure of some RAID partition.  For instance, if there is an unrecoverable read error in one of the two FD partitions which are used as a RAID 1 pair to create an ext3 partition (or the swap partition!) then I want to know about it.  The computer will run fine with a single such failure.

The principle is to have an hourly cron job (maybe do it more frequently than this) to look into "cat /proc/mdstat" for any underscore characters.

Normally, there are none, since it looks like this:

Personalities : [raid1]
md0 : active raid1 hdc1[1] hda1[0]
      200704 blocks [2/2] [UU]

md2 : active raid1 hdc5[1] hda5[0]
      20008832 blocks [2/2] [UU]

md3 : active raid1 hdc6[1] hda6[0]
      2008000 blocks [2/2] [UU]

md4 : active raid1 hdc7[1] hda7[0]
      2008000 blocks [2/2] [UU]

md5 : active raid1 hdc8[1] hda8[0]
      43913536 blocks [2/2] [UU]

md1 : active raid1 hdc2[1] hda2[0]
      10008384 blocks [2/2] [UU]

unused devices: <none>

However, when one or two FD partitions have unrecoverable read errors, the lines look like this:

      10080384 blocks [2/1] [_U] 
      10080384 blocks [2/1] [U_] 
      10080384 blocks [2/0] [__] 

I create a file raid-test.sh in /etc/cron.hourly:

#!/bin/bash
/root/raid-hourly-test.sh

Then the test script is in /root/raid-hourly-test.sh:

#!/bin/bash
# For debugging, comment the next line and doctor the file to be bad.

cat /proc/mdstat > /root/mdstat-for-test.txt

# I want to look for any occurrence of underscore, except for the one
# in "read_ahead".
#
# One way is to look for the word "block" and then for any underscore
# which follows, but I can't see how to do that with grep.
#
# Instead, looking for any character not "d" followed by an underscore.

if grep [^d]_ /root/mdstat-for-test.txt > /dev/null; then

   echo Mailing RAID hard disk error report!!!
   /root/raid-mail-problem.sh
fi  

The other script is called when there is a problem: /root/raid-mail-problem.sh:

#!/bin/bash
#
# Generate an email to me reporting a problem with hard drives
# in the RAID system.
#
# Include the current /proc/mdstat and some recent lines from
# /var/log/messages.  Assemble the email in a file.

echo To: xx@yy.zz     > /root/mdstat-for-email.txt
echo Subject: !! Hard drive failure in xxxx RAID array !! >> /root/mdstat-for-email.txt
echo                          >> /root/mdstat-for-email.txt

cat /proc/mdstat              >> /root/mdstat-for-email.txt
echo                          >> /root/mdstat-for-email.txt

echo Some recent /var/log/messages lines: >> /root/mdstat-for-email.txt
echo                          >> /root/mdstat-for-email.txt
tail -n 50 /var/log/messages  >> /root/mdstat-for-email.txt

sendmail < /root/mdstat-for-email.txt

The sendmail command is the one which comes with Postfix - not the ordinary sendmail MTA program.



Some useful RedHat Package Manager query commands

rpm -ql packagename > list.txt  Lists all the files in a currently installed package - there's no need to use the .rpm extension or its version number.  So if the package's full file name is: wget-1.8.2-4.72.i386.rpm   then just use wget .

rpm -qpl packagename.rpm > list.txt  Lists all the files in an RPM file irrespective of whether it is installed.. Use the full file name, or, for instance, a shortened version wget* .

rpm -qa > rpmlot.txt   Lists the names of all installed packages, I assume in order of their installation.

rpm -qa | sort > rpmlot.txt   Ditto alphabetically sorted.

rpm -qal > rpmlotfiles.txt   Lists all the files of all installed packages. There is no clear indication of what package the files belong to - but it is not too hard to figure it out:

rpm -qf file-name   Lists the package a file belongs to.

Conclusion

The above account explains how I got my basic server software installed.

Please see the CentOS-misc-config page (via the parent directory ../) for a bunch of configuration items I did before tackling the mailserver stuff.