CentOS 5.1 with software
RAID 1 hard drives, and a nifty backup arrangement for the hard drive
data
How I installed CentOS 5.1 with
dual IDE hard drives in a software RAID 1 configuration, without
requiring great expertise in Linux software RAID, or an already
operating machine to prepare the hard drives. The same principles
should work for SATA drives. I
also describe a backup system by which I copy the entire byte-for-byte
contents of one drive of the pair to a backup drive. That drive
boots
fine in any machine (subject to the need to manually configure new
NICs, but I am contemplating a modification to
kudzu to remove this requirement) and can easily have its contents
copied to another
drive which will become its second drive in the reconstituted system.
I also describe shell scripts to alert me by email of any failure
in
the RAID system.
Robin
Whittle
rw@firstpr.com.au 2008-05-17 (Typos of "hdb" fixed -
should be "hdc" - 2008-10-24, thanks Ned!) (2008-11-13: note on
horde being insecure unless properly password protected.)
../
Back to the parent directory concerning web-mail, how I used this
server for Postfix, Courier IMAPD, Courier Maildrop etc.
Introduction
I chose
CentOS because it is a free version of Red Hat, and it is likely that
Red Hat and CentOS will be providing security updates for this system
for quite a few years. My impression is that these systems have a
much longer "update" life than Debian, which I run on my desktop Linux
machine. Also, my server in the USA, which hosts this site, runs
CentOS. It is at
http://www.layeredtech.com
- who support the CentOS project with advertising. I have been
very happy with this server since getting it in early 2008.
Hard
drive redundancy is a good idea for any kind of server.
(However these single head
ST380215A
drives supposedly have a mean time between failures of 700,000 hours -
80 years - and a failure rate of 0.34% per year.)
One approach
is to use a hardware RAID controller board - but
that introduces extra hardware dependencies, potential gotchas with
software (Q: How could you tell remotely the RAID system is failing
unless
the board integrates well with your OS? A:
http://linux-ata.org/faq-sata-raid.html
) and of course the extra financial cost of the hardware. Also, a
computer without extra hardware consumes minimal power. A
RAID board, or perhaps RAID functions on the motherboard might be
better than software RAID, since maybe the machine will still boot if
one of the drives is dead. (This depends on how dead the drive
is, and whether the motherboard will boot from /dev/hdc if it finds no
drive at /dev/hda.
I can take these two drives (or just
one of them, or a clone of one of them) and plug them into any
motherboard (subject to
kudzu_gotcha), on
their two separate cables - and the server will come to
life on that machine. Also, it can boot when one drive is on
/dev/hda. Either the original /dev/hda or the original /dev/hdc
drive boots fine on its own in /dev/hda.
I also have a way of
taking the machine
offline and cloning the entire contents of /dev/hda to another drive,
which I temporarily plug into /dev/hdc. That clone can be booted
on its own, to become the server (again, subject to
kudzu_gotcha). This back-up approach
should
make for fast restoration of a server, compared to mucking around with
backup media, reconstituting a system etc.
This system has
worked well for me for years,
with RH 7.2 since 2002. I never had to recreate the server from a
backup drive, but I did have one or two failed hard drives, without the
server losing any data.
Documentation for software
RAID
Installing
CentOS 5.1 with Software RAID 1
I
got two new 80G IDE Seagate drives and put them in removable drive bay
caddies - each on its own cable to the motherboard (Asus CUSL2-C with a
Pentium III 750MHz). The second cable
also goes to a third pluggable drive bay, and into that I plugged a
CD-ROM drive. This is all with IDE cables - but you should be
able to do the same stuff with SATA, as far as I know.
So:
Cable
1:
/dev/hda "A" 80G HD as master
Cable 2:
/dev/hdc "B"
80G HD as master
/dev/hdd CD-ROM drive as slave
The CD-ROM was
on an old slow IDE cable, so it probably made /dev/hdc run slowly, but
that was fine. I only plug in the CD-ROM drive for installation
and for backing up /dev/hdc. But beware that some hard drives and
some CD-ROM drives may not work on the same cable. A better
approach would be to somehow boot into Linux Rescue mode from a USB
memory stick or similar. That wouldn't involve any extra cables.
With SATA motherboards, probably an old IDE CD-ROM drive would
work fine, since there is no slowing down of the SATA cables to the
hard drives.
I downloaded
the 6 CD-ROM ISO files and burnt them to CDs. (A single DVD might
have done the trick too.) These are from a local mirror server I found
at "
http://www.centos.org >
Downloads > CentOS-5 Isos". (The small net-install CD requires
access
to a server with a complete set of files from the entire CD set or DVD,
mounted as files, so I didn't use that - but perhaps I could have if I
found a public site with the files available.) Booting the first
CD
enables that and the other CDs to be tested before installation.
This is a good idea, since perhaps they won't read well on this
particular CD-ROM drive.
Booting from the CDs I selected the
Graphic Installer at first. The Graphic installer does things
which are not possible with the text mode installer. (128M RAM
was not enough for the graphic installer - 256M was fine.) I
wanted to use the graphic installer to create software RAID partitions
on each drive and then stitch them together to be ext3 partitions, on
which I would install my system. The text mode installer doesn't
do this, but I think it can do the stitching together.
In the
end, I made my software RAID partitions in another way, and then used
the graphic installer to stitch them together. So I could
probably have done the stitching with the text mode installer.
I
did not use the graphic installer to create the basic Linux software
RAID partitions, because (as with RH 7.2 years ago) I found it mucked
around with the partition numbers and did not allow me to explicitly
control the results. It looks OK at first, but later, it started
messing things up.
So here is what I did instead.
I used
CD1 to boot into Rescue mode, by giving the command "linux rescue"
after the blue graphic screen appears. From there I ran fdisk and
gave
commands as detailed below. That resulted in the correct
partitions being made on /dev/hda. Then I copied that partition
table to /dev/hdc. Then I used the graphic installer to stitch
these together into ext3 partitions for the actual installation.
Here
is the result I arrived at. Details on how I did it follow.
A
RAID B RAID ext3
Mount Size
partition partition
partition point final
Kbyte
blocks
/dev/hda1 /dev/hdc1
/dev/md0 /boot 200,781 Primary
/dev/hda2 /dev/hdc2
/dev/md1 /
10,008,495 Primary
/dev/hda3 /dev/hdc3
Primary,
not used.
/dev/hda4 /dev/hdc4
Primary, not
used.
/dev/hda5 /dev/hdc5
/dev/md2 /home
20,008,957 Logical
/dev/hda6 /dev/hdc6
/dev/md3 (swap)
2,000,893 Logical
/dev/hda7 /dev/hdc7
/dev/md4 /var/log
2,000,893 Logical
/dev/hda8 /dev/hdc8
/dev/md5 /audio 43,913,646 Logical
I have a separate /boot directory. Maybe I
don't need it with this motherboard, but some motherboards can't find
the boot directory if it is beyond cylinder 1024. I chose 200M on
a whim, but I later read somewhere this was a good choice for /boot.
According to:
http://linux-raid.osdl.org/index.php/Preventing_against_a_failing_disk
/boot is readable by both the grub and lilo boot loaders when it
is on an ext3 partition created from one or more Linux software RAID
partitions (type FD) - but only for RAID 1. grub or lilo just
read one such FD partition directly as if it was an ordinary ext3
partition. They don't actually run the RAID software AFAIK.
Then
I have
the / directory, which I chose to put on a primary partition as well,
for no particular reason.
I have a separate partition for /home,
which is where my email will accumulate.
/var/log gets its own
partition so if things go wrong, the messages won't grow into other
parts of the filing system.
I have a big partition at the end
"/audio" for various things. I wanted to keep the main partitions
which matter (/ and /home mainly, but of course /boot , swap and /var
also) near the outside of the disc. Also, I think, limiting the
size of each partition means that if errors appear on the drive in one
location (which will cause the FD partition it is in not to be used
any more by the RAID system) then this may limit the likelihood of
losing important data
on the partitions which matter. I would have been happy to keep
using 40G drives, but I wanted new ones and the 80G drives were the
smallest I could get.
I chose a rather large swap file - 2G.
This may be a bit excessive I am only running 256M of RAM, though
this may be increased later. Guidance on swap file size is at
Chapter 5 of the Deployment Guide. However, I think various
people have differing views on how best to choose the size of the swap
file.
Having the swap file on a partition created by RAID 1 from
two or more FD partitions on two drives means the computer can run fine
if one drive fails.
There
can be 0 to 4 primary partitions 1 to 4. There can be 1 extended
partition, which is in the just beyond the highest numbered primary
partition which is used directly for data, so it can be in location
1, 2, 3 or 4. The extended
partition (usually) extends all the way to the end of the drive. Within
that, multiple logical partitions are created, with some limit to their
numbers (but I am not sure what).
I could have made 1 the
extended partition and put all these in logical partitions 5 onwards.
Initially,
I make these ordinary Linux partitions in fdisk, but then I change them
to be "Linux Software RAID" partitions of type FD.
The Barracuda
7200.10 ST380215A drive has
80,026,361,856 bytes. (AUD$50 each, including 10% tax.) It
has a single head, which is good for reliability, I think.
The
physical arrangement of the drive is not visible to the BIOS or
operating system. It presents itself to the OS as 63 512 byte
sectors per track, with 16 "heads" meaning 16 such tracks per
"cylinder".
fdisk sees that it has 8,225,280
bytes per cylinder
16065 512 byte sectors = 8032.5
blocks (1024 bytes) per cylinder.
If I want 10,000 blocks (just
under 10GB, since a GB is 1024 MB, which is 1024 KB - "K" means
1024 and "k" means 1000) I divide 10,000,000 by 8032.5 and round up to
the next integer. Since these block numbers are going to be
staring at me for years when I run "df" to get a report on how full the
partitions are, I chose to make them numbers which looked good in the
1K blocks df uses.
Here is a summary of what I did in fdisk:
fdisk
/dev/hda
n new
p
primary
1
Partition 1
1
first cylinder
25
end in cylinders
200,781 blocks /boot
p print
partition table
n new
p
primary
2
Partition 2
enter
default start cylinder
+1245
size in cylinders 10,008,495
blocks /
p
print partition table
n new
e
extended
3
Partition 3 will contain the rest as
logical partitions
enter default start
cylinder
enter
default size in cylinders = rest of disk.
p print
partition table
n
new
l
logical (creates /dev/hda5)
enter default start
cylinder
+2490 size in
cylinders 20,008,957 blocks /home
p print
partition table
n
new
l
logical (creates /dev/hda6)
enter default start
cylinder
+249 size in
cylinders 2,000,893 blocks (swap)
p print
partition table
n
new
l
logical (creates /dev/hda7)
enter default start
cylinder
enter The rest of
the disk 43,913,646 blocks /audio
p print
partition table
w
Writes all this to the drive.
Now to
edit the partition types and make them all software RAID, type FD
fdisk
/dev/hda
t Change
partition type
1
FD
Repeat for 2, 5, 6, 7, 8
w
Writes all this to the drive.
This was
labour-intensive, but I got exactly what I wanted.
Now to copy
this partition table to /dev/hdc:
sfdisk -d /dev/hda > blah
sfdisk /dev/hdc < blah
Then
check /dev/hdc has the right partitions.
fdisk /dev/hdc
p
Now, I could run the Graphic Installer (CD1).
At the point where it presents the two drives on which the system
can be installed, there is a pull-down list (or whatever) which only
shows one option, but which can be altered to select:
Create Custom Layout
Do not
tick Review and modify partitioning layout.
At the next screen,
there is a RAID button. I used that to join two FD (Linux
Software RAID) partitions into one partition as listed in the table
above.
In
each case, there is a list of all the FD partitions. I need to
deselect all but the two I want to join. I select RAID 1 for all
these, not the default RAID 0.
First, I create /dev/md0, from
/dev/hda1 and /dev/hdc1, with a type of ext3 and a mount point of /boot.
Then,
/dev/md1, from /dev/hda2 and /dev/hdc2, with a type of ext3 and a mount
point of /.
Then, /dev/md2, from /dev/hda5 and /dev/hdc5, with a
type of ext3 and a mount point of /home.
Then, /dev/md3, from
/dev/hda6 and /dev/hdc6, with a type of swap and no mount point.
Then,
/dev/md4, from /dev/hda7 and /dev/hdc7, with a type of ext3 and a mount
point of /var/log.
Then, /dev/md5, from /dev/hda8 and /dev/hdc8,
with a type of ext3 and a mount point of /audio.
From this, I
could proceed to the rest of the installation process.
See below
for my backup system for /dev/hda
Main installation
I won't comment on most of the
installation, since I don't have any special insights except perhaps
the following:
I selected the use of the CentOS Extras
repository. I don't know much about these, but it enabled me to
install the Horde IMP
webmail program.
Note:
2008-11-13: See below on how this is insecure unless you password
protect the /horde/ directory. Anyone can run command line
commands as user apache!
I want this machine to
run X Windows, so I can run Thunderbird for email access when my
desktop machines are down - such as during a power failure and when
this server runs on a UPS.
Normally, this server is not
used directly with a keyboard or screen, but I have a Logitech
Trackman, keyboard and screen connected to it at all times.
The
machine will later become gair.firstpr.com.au. For now, it is
nair.firstpr.com.au and like the other machines here (apart from gair)
has a private address on the LAN.
The old gair (actually,
another PC running the old gair's hard drives, since I am doing all
this on the PC for the past and future gair) is connected to the Net
via a fixed IP address ADSL service via eth1. Its eth0 is for the
LAN, in which it has a fixed address 10.0.0.1, and other machines and
printers have their own fixed addresses. I am not using WiFi,
DHCP etc. It is just Ethernet cables and two switches.
Initially
I make eth0 on nair have its own private address. Once I have got
most things installed, I will turn off the ADSL connection (to stop new
emails arriving) and then copy over all the email and other data from
(old) gair to nair. Then I will change nair into gair, with a new
name and with the 10.0.0.1 IP address, taking the old gair off the
system. Then I need to configure eth1 to do the ADSL stuff, and
configure a firewall, NAT etc. Then, the new machine will be the
new gair.
Most of this installation doesn't need an Internet
connection, but I put the new machine on the LAN so it could get other
packages, such as Horde IMP.
The system installs by default with
a kernel which supports the Xen virtualization approach to running
multiple operating systems at the same time. I don't ever want to
do this, so it would probably be best not to use Xen. I
understand that to stop using Xen, I can install a non-Xen kernel.
Here
are some rough notes on how I
selected packages for installation.
Install GRUB on /dev/md0. The default
operating system is in /dev/md1 ( the / file system)
eth0
10.0.0.2 255.255.255.0
Enable IPv6 support? Yes -
automatic neighbour discovery.
eth1 Disable for now.
nair.firstpr.com.au
Gateway
10.0.0.1
Primary DNS 10.0.0.1
Tick System clock uses UTC.
Select
packages:
Desktop Gnome
Server
Server
GUI
CentOS Extra? Yes. See:
http://wiki.centos.org/RepositoriesCentOSPlus
is already included.
Customize now.
Applications
Graphical Internet
Turn on GFTP (SFTP file transfer to my
server in the USA), Firefox and Thunderbird. Turn others off.
No
games, engineering, graphics, office productivity, sound, video etc.
Text
based Internet:
Keep:
elinks
(text mode browser)
Development
Dev libraries Turn on - keep defaults
Likewise
Development tools
All of Legacy Software Development
Servers
Servers
DNS
No FTP
Legacy network servers -
only xinetd
Mail Server
Turn off sendmail - in order to use Postfix
Keep
only:
Postfix
Spamassasin
Squirrelmail (I later
installed it separately and am happy with it - I haven't yet documented
how I installed it.)
Network
servers
dhcp only
Base System
Admin tools - defaults
Base.
Turn on AIDE - intrusion detection.
(I used to use TripWire. It was difficult to set up, but I
thought it was good.
Turn off sendmail, wireless tools,
Legacy
software support - defaults
System tools - turn on, but turn
these off:
Off bluez---
(What is
hwbrowser? A GUI thing to view hardware devices.)
On - mc Midnight
Commander (I know the real geeks get by with pure
command-line control, but I like Midnight Commander.)
X-Window
- On with defaults
CentOS Extras
2008-11-13 Turning on Horde was a mistake,
because I did not realise the /horde/ directory enables anyone to run
command line commands, as user apache. For months, I was getting
hit by people downloading a tgz file into my /tmp/ directory, unzipping
it and then running a bunch of ssh-scan programs, peppering computers
all over the Net with attempts to get in, using the lists of usernames
and passworks in the tgz file. If you use Horde, make sure there
is password protection on this. They can also run PHP commands.
It is completely insecure out of the box.
Turn on
Horde, but only for gollem (web
based file management) and Imp
The installation
went
fine and I rebooted the machine. It is running X-windows at 1024
x 768 (with an old Trident 4M AGP board, and an obscure 1024 x 768 LCD
monitor).
For now, I turned off the firewall. There is no
need for a firewall on eth0 later, since this is the LAN. I will
need a firewall on eth1 - the ADSL port, but will do that later.
SELinux:
Keep the default "Enforcing". This turned out to be a bad
idea. I don't have time to figure out how SELinux works, and it
later screwed up the Courier IMAPD server, so I later turned it off.
Kdump:
I was going
to reserve 64 Megs of RAM for this Kernel dump facility. I
do want to know what happened if the system crashes this badly.
However, there was a warning message I didn't have time to think
about to do with XEN (virtualization) and other complexities. It
turned out that with 256M RAM, there was not enough space anyway.
Kdump disabled.
Date and Time . . Network Time
Protocol. NTP is great. I enabled it and selected some
local time servers, from
http://support.ntp.org/bin/view/Servers/NTPPoolServers
http://www.pool.ntp.org/zone/oceania
:
0.oceania.pool.ntp.org
1.oceania.pool.ntp.org
2.oceania.pool.ntp.org
3.oceania.pool.ntp.org
Turn
off "Use Local Time Source".
After adding a non-root user, the
machine rebooted into X Windows and I was able to log in with that
username.
Within a minute, a thing appeared saying: Updates
available There are 44 package updates available.
I installed
them all - but there was no sign this process was actually running.
After a few minutes, it displayed a box: "Resolving dependencies
for updates". This is a 750 MHz machine, so these things take a
while. Then it started downloading packages. System >
Administration > System Monitor showed network activity.
[I
did document more of the installation process, but lost that version of
the file. I can't remember all that happened, but it went
smoothly from here, and I was able to boot the machine, have X Windows
run etc.]
Cloning: Whole
Hard Drive backup with Software RAID 1
The purpose of this technique is to have
three or so spare drives (all, ideally, the same model), ideally stored
offsite, each of which
is made to be a bit-for-bit clone of /dev/hda.
Then, that drive
can be plugged into any motherboard, to become the server (subject to
the
kudzu_gotcha). It
will complain about not having its /dev/hdc, so all the partitions will
be complaining of RAID failure, but the machine will run fine.
Making a /dev/hdc drive for it to recreate all the data onto
could be a little tricky, but later I investigated and found that by
cloning /dev/hda to /dev/hdc, the system seemed to work fine with both
sets of partitions working as normal with software RAID 1.
The
idea is to boot from
CD-ROM (or a USB Flash drive . . that would not slow down the second
IDE cable) and use a simple cp command to copy all bytes from one drive
to the other.
Here is what I do:
Shut the machine down ("shutdown -h now" or
"reboot -f") and
turn off the power.
Remove the drive from /dev/hdc and plug in
another drive there which will be the backup.
Plug in a CD-ROM
drive as noted above for installation, and boot from CD1 of the CentOS
5.1 set.
When this boots, give the command "linux rescue" and then take the
following steps, which result in a single-user mode shell environment,
in which the hard drives are not mounted to the file system, and in
which the software RAID process is not running. The drives are
physical devices and can be read and written as a linear file.
Choose
the language english and the
keyboard US..
Do not
turn on the network connection - No.
The
next step is not so obvious. Do not use "Continue" or "Read-Only".
Use "Skip". This gives a
shell account in single user mode. There is no software raid, no
mounting of the hard-drive's data in the current file system (that is
via "Continue"). Just a shell account, with no other processes
running, and the two drives appearing to be like files at /dev/hda and
/dev/hdc.
Now,
copy the contents of one drive to the other: partition table, RAID
partitions etc. all together:
cp
/dev/hda /dev/hdc
This took about 50 minutes for my 80G
drives. It would probably be faster if I didn't have the old 40
wire ATA cable slowing down the 80 wire Ultra-ATA cable to /dev/hdc.
Then,
reset the machine or turn it off and take out the drive in /dev/hdc.
Replace the original /dev/hdc drive.
The machine will now
reboot in RAID 1 mode and operate as usual.
In this
backup
entire procedure, the RAID software has not run. Therefore the
counters in the superblock of each RAID FD partition have not been
altered, and the drives boot as they would have before - so the machine
returns to normal operation as if nothing happened.
(If I had booted the machine normally
(not single user rescue mode) with just the /dev/hda drive, the
counters in some or all of the superblocks would be higher than the
counters in the corresponding superblocks in /dev/hdc. Then,
rebooting normally with the two drives would cause the data in those
partitions in /dev/hda to take precedence over the data in those
partitions in /dev/hdc. The RAID system would automatically copy
the /dev/hda data to /dev/hdc for those partitions.)
System
restoration from a single drive
Not
counting the
kudzu_gotcha, here
is how I restore the server starting from a single drive, either:
- The
original /dev/hda drive, which has no problems with any of its
partitions. (For instance, /dev/hdc failed entirely, or has one
or more partitions which have errors and so were no longer used by the
RAID system. Therefore, the computer kept running for a while,
depending only on the partition on /dev/hda.)
- A clone of
the original /dev/hda drive - as per the previous section.
- The
original /dev/hdc drive. (For instance, /dev/hda failed in some
way.)
- A clone of the original /dev/hdc drive (though my backup
procedure only creates clones of the original /dev/hda drive.)
In
all these cases, I have a drive which has the complete set of current
data. I may be restoring the system on the same motherboard, or
on another machine. This doesn't cope with having only two
drives, where one drive has a failure in one or more partitions and
the other has failures on some others. (I think the best way to
handle that is to mount each drive on some other system which can do
software RAID, and which won't be confused by these FD partitions and
accidentally stitch them into an already-existing ext3 partition, and
then read the data from them.)
All these procedures work fine,
with absolute minimal fuss:
- Use just one drive, plugged into
/dev/hda. It will boot and the machine will run fine - but
without any RAID backup if something goes wrong.
- Use the backup
procedure to clone the copy of the good drive over to a second drive,
and then boot with them both. At least for me, the system is
perfectly happy and is running with full RAID 1 protection. With
my setup at least, it doesn't matter whether the drive I use with the
data was in /dev/hda or /dev/hdc, or a clone of either.
This
makes for system restoration with minimal reading of the complex
software RAID documentation!
However, see the next section for
some nasty gotchas if you run the drive with different Ethernet boards
than those of the original system.
Kudzu Gotcha:
Running the
system with different NICs
Please
see a separate page "kudzu-mods" from the parent page
../
for how I planned to modify kudzu to solve this problem. However,
I have not yet done this. That page has a detailed analysis of
some parts of the kudzu source code.
If I
change one or both of the two Ethernet NICs in this machine, this
Gotcha applies. "Change" means a different type of card, or the
same type, but a different physical unit - and therefore with a
different Ethernet MAC address. (According to my understanding of
kudzu, based on reading the source code, if I kept the same two NICs A
and B, previously eth0 and eth1 and plugged them in differing slots so
the kernel discovered them in a different order, kudzu would swap over
the configuration information so they still functioned as A = eth0 and
B = eth1, despite B being discovered first and so otherwise being
known as eth0. I am not sure if this is what it is supposed to
do, or how useful it would be.)
Let's
say I have two perfectly good files:
/etc/sysconfig/network-scripts/ifcfg-eth0
/etc/sysconfig/network-scripts/ifcfg-eth1
which
work with particular cards A and B, where A is in a lower number PCI
slot than B. (How are these things ordered in a motherboard with
on-board NICs? - I guess the on-board one is found by the kernel before
any plugged into a PCI slot.)
The machine runs fine, since these
files were created by the installer, or were manually edited by me.
However,
if I replace one or both A and B with different cards, or if I take
this hard drive data (in the drive, or two RAID 1 drives, or in clones
of them) and boot it on a machine which does not have the exact cards A
and B in the same order in its PCI arrangement, then kudzu will do some
stuff which in these circumstances is not helpful.
It renames
the ifcfg-eth0 (or 1) to be ifcfg-eth1.bak (and overwrites any file of
that name). Then it creates a new ifcfg-eth0 file set up to be a
DHCP client. It tries for a while to get this going, during boot
up - and if this is what we want, then I guess there is no problem.
However it is not what I want.
In my setup, eth0 is to the
LAN, and this machine has a fixed IP address 10.0.0.1 on it. (I
might also make it the DHCP server.) So kudzu's attempt to get it
to work as a DHCP client is pointless. Now, if I restart the
machine with the old board A as eth0, it still won't work, because my
original ifcfg-eth0 file has been renamed to a backup file.
Probably this time, that will be overwritten by copying the DHCP
mode current ifcfg-eth0 file to ifcfg-eth0.bak!
So this is not
just unhelpful, it is data loss - loss of vital configuration
information.
According to some things I read at:
(Oops, where was it?) this action occurs at boot time when kudzu
finds a device which is not mentioned in /etc/sysconfig/hwconf
That file contains details of devices kudzu found last time the
machine booted.
If it is impossible to boot, I use Linux Rescue
as discussed below to edit out the sections of /etc/sysconfig/hwconf
which mention NICs from the past, and to get rid of any
/etc/sysconfig/network-scripts/ifcfg-ethx files I don't want, probably
all of them.
When I have set the machine up with good
ifcfg-ethx files, I make copies of them into the /root directory
as: good-ifcfg-eth0 and good-ifcfg-eth1. So I might can manually
copy them to ifcfg-ethx if I need to (such as after a boot which gives
me console access, or with Linux Rescue).
So bringing the
machine up with different NICs, or booting a copy of this machine's
hard drive in another machine with different NICs will require some
careful work to get the machine to boot correctly. The initial
attempt will clobber the eth0 setup, making it unworkable. My
eth1 setup is minimal. The ifcfg-eth1 only tells the machine to
enable it at boot time. It doesn't specify DHCP or anything else
- and in my case, since the ADSL modem has a DHCP server, eth1
certainly should not be running a DHCP client. eth1 is only used
by ppp0, and all that is needed is that the kernel installs the driver
for the NIC. The software which is part of the rp-pppoe package
handles
everything else.
This would be a tricky thing if the
server was not accessible via keyboard and screen - to manually copy
files and reboot the machine after the initial attempt fails to get
eth0 going. If administrative access to the machine was via eth0,
then this would be impossible. Since many machines use an
Ethernet interface on the motherboard, running the disk image on
another motherboard would cause this problem. How then to copy
the files if access via the network was the only approach???
What
if some screwy stuff with NICs and these ifcfg files prevents the
machine from booting sufficiently to run a terminal session? Then
it would be necessary (after this boot attempt presumably writes the
current NIC details to /etc/sysconfig/hwconf) to boot from CD-ROM with
the first disc of the set, and use Rescue to fix the problem.
After typing "linux rescue" at the blue CD-ROM boot screen, and
selecting the language and keyboard type, it might be best to allow it
to start the network interfaces, perhaps without any cables plugged
into them:
Ah-Ha . . .
Assuming these things are
currently "unconfigured" (as they will be, since the rescue system has
not looked at the hard drives yet), then we can get to see what type of
NIC it is, with the Edit function. I have multiple similar
looking NICs which look physically identical to the "Intel Corporation
82557/8/9 Ethernet Pro 100" (except for serial number stickers), but
which which are labeled as IBM, HP or whatever. The are all
recognised as being the one type by this Linux system.
Maybe I
don't want to configure these NICs to do anything right now, but it is
good to see that the system (the rescue system at least) recognises
them.
Now the Rescue system asks if I want to Continue (it
will
try to mount the hard-drives' file system under /mnt/sysimage) or to do
this "Read-Only". (Skip is what I use when copying the entire
drive's contents, as documented in the software RAID 1 page via the
parent directory
../ , but now I want to mount the
file system and edit some
files.) I don't want "Read-Only" since I want to change some
files.
This seems to work OK with software RAID 1. It
mounts the / directory from the hard drives, at least, at
/mnt/sysimage/ - not other partitions mentioned in /etc/fstab.
This is single user mode, and I can't run Midnight Commander (mc)
because there is a segmentation fault. I can run joe, so I can
edit files
My general procedure for coping with new NICs is,
after installing them:
1 -
Delete all files /etc/sysconfig/network-scripts/ifcfg-ethx.
However, I keep my backup files such as good-ifcfg-eth0, or
backups somewhere else, such as /root. Keeping any name starting
with "ifcfg-eth" in the /etc/sysconfig/network-scripts/ directory or
any subdirectory thereof will cause them to be seen by the boot up
stuff and it may try to set up eth0, eth1 etc. more than once.
2
- Edit /etc/sysconfig/hwconf to remove the sections concerning NICs.
3
- Exit the rescue system (exit) and turn the machine off.
4
-The new NICs are installed, but are not connected to anything yet.
5
- Reboot the machine. If the boot is not sufficient to edit the
files manually, power it down with "reboot -f" and do the following:
6
- Reboot with Linux Rescue and edit the ifcfg-ethx files, based on
backup files. Power down.
7 - Reboot and hopefully the
system will run normally.
See the page (from
../) "CentOS 5.1 Miscellaneous configuration items" for
some samples of the ifcfg-ethx files I used.
Reporting
RAID failures
There is probably
a better way of doing
this, but it works for me.
I want to get an email if there is a
failure of some RAID partition. For instance, if there is an
unrecoverable read error in one of the two FD partitions which are used
as a RAID 1 pair to create an ext3 partition (or the swap partition!)
then I want to know about it. The computer will run fine with a
single such failure.
The principle is to have an hourly cron job
(maybe do it more frequently than this) to look into "cat /proc/mdstat"
for any underscore characters.
Normally, there are none, since
it looks like this:
Personalities : [raid1]
md0
: active raid1 hdc1[1] hda1[0]
200704
blocks [2/2] [UU]
md2
: active raid1 hdc5[1] hda5[0]
20008832 blocks [2/2] [UU]
md3
: active raid1 hdc6[1] hda6[0]
2008000
blocks [2/2] [UU]
md4
: active raid1 hdc7[1] hda7[0]
2008000
blocks [2/2] [UU]
md5
: active raid1 hdc8[1] hda8[0]
43913536 blocks [2/2] [UU]
md1
: active raid1 hdc2[1] hda2[0]
10008384 blocks [2/2] [UU]
unused
devices: <none>
However, when one or two FD
partitions have unrecoverable read errors, the lines look like this:
10080384 blocks [2/1] [_U]
10080384 blocks [2/1] [U_]
10080384 blocks [2/0] [__]
I
create a file raid-test.sh in /etc/cron.hourly:
#!/bin/bash
/root/raid-hourly-test.sh
Then
the test script is in /root/raid-hourly-test.sh:
#!/bin/bash
#
For debugging, comment the next line and doctor the file to be bad.
cat /proc/mdstat >
/root/mdstat-for-test.txt
#
I want to look for any occurrence of underscore, except for the one
#
in "read_ahead".
#
#
One way is to look for the word "block" and then for any underscore
#
which follows, but I can't see how to do that with grep.
#
#
Instead, looking for any character not "d" followed by an underscore.
if grep [^d]_
/root/mdstat-for-test.txt > /dev/null; then
echo Mailing RAID hard
disk error report!!!
/root/raid-mail-problem.sh
fi
The other script is called when there is a problem:
/root/raid-mail-problem.sh:
#!/bin/bash
#
#
Generate an email to me reporting a problem with hard drives
# in
the RAID system.
#
# Include the current /proc/mdstat and some
recent lines from
# /var/log/messages. Assemble the email in
a file.
echo To: xx@yy.zz >
/root/mdstat-for-email.txt
echo Subject: !! Hard drive failure in
xxxx RAID array !! >> /root/mdstat-for-email.txt
echo
>> /root/mdstat-for-email.txt
cat
/proc/mdstat
>> /root/mdstat-for-email.txt
echo
>> /root/mdstat-for-email.txt
echo Some recent
/var/log/messages lines: >> /root/mdstat-for-email.txt
echo
>> /root/mdstat-for-email.txt
tail -n 50
/var/log/messages >> /root/mdstat-for-email.txt
sendmail
< /root/mdstat-for-email.txt
The sendmail command is the one which
comes with Postfix - not the ordinary sendmail MTA program.
Some
useful RedHat Package Manager query commands
rpm -ql
packagename
>
list.txt Lists all the files in a currently
installed package - there's no need to use the .rpm extension or its
version
number. So if the package's full file name is: wget-1.8.2-4.72.i386.rpm
then just use wget . rpm -qpl packagename.rpm > list.txt Lists all
the
files in an RPM file irrespective of whether it is installed.. Use the
full
file name, or, for instance, a shortened version wget* .
rpm -qa > rpmlot.txt
Lists the names of all installed packages, I assume in order of their
installation. rpm -qa | sort >
rpmlot.txt
Ditto alphabetically sorted. rpm
-qal >
rpmlotfiles.txt
Lists all the files of all installed packages. There is no clear
indication of what package the files belong to - but it is not too
hard to figure it out: rpm -qf
file-name
Lists the package a file belongs to. |
Conclusion
The above account explains how I got my
basic server software installed.
Please see the
CentOS-misc-config page (via the parent directory
../)
for a bunch of configuration items I did before tackling the mailserver
stuff.