Category Archives: Sysadmin

Diagnosing a slow server

Our server at Alzatex has been acting slow for a while now, but I have been unable to find a cause until now. With what I found out today, I’m beginning to think that we just have an old, slow hard drive in our RAID6. What do you think?

$ sudo hdparm -tT --direct /dev/sd[abcd]
/dev/sda:
 Timing O_DIRECT cached reads:   420 MB in  2.00 seconds = 209.87 MB/sec
 Timing O_DIRECT disk reads:  318 MB in  3.01 seconds = 105.60 MB/sec

/dev/sdb:
 Timing O_DIRECT cached reads:   492 MB in  2.00 seconds = 245.53 MB/sec
 Timing O_DIRECT disk reads:  268 MB in  3.10 seconds =  86.40 MB/sec

/dev/sdc:
 Timing O_DIRECT cached reads:   408 MB in  2.01 seconds = 203.34 MB/sec
 Timing O_DIRECT disk reads:  146 MB in  3.12 seconds =  46.76 MB/sec

/dev/sdd:
 Timing O_DIRECT cached reads:   478 MB in  2.01 seconds = 238.25 MB/sec
 Timing O_DIRECT disk reads:  272 MB in  3.01 seconds =  90.50 MB/sec

sdc’s cached read time looks ok, but it’s raw disk read is less than half of the next slowest drive. Next up, I tried looking at the S.M.A.R.T. attributes of each drive. I’ve trimmed the output to only the most interesting attributes.

$ sudo smartctl -A /dev/sda
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   240   236   021    Pre-fail  Always       -       1000
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       23
  9 Power_On_Hours          0x0032   082   082   000    Old_age   Always       -       13603
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0

$ sudo smartctl -A /dev/sdb
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   239   236   021    Pre-fail  Always       -       1016
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       30
  9 Power_On_Hours          0x0032   067   067   000    Old_age   Always       -       24607
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0

$ sudo smartctl -A /dev/sdc
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       2
  3 Spin_Up_Time            0x0003   225   224   021    Pre-fail  Always       -       5741
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       48
  9 Power_On_Hours          0x0032   042   042   000    Old_age   Always       -       42601
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       2

$ sudo smartctl -A /dev/sdd
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   239   238   021    Pre-fail  Always       -       1025
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       37
  9 Power_On_Hours          0x0032   064   064   000    Old_age   Always       -       26475
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0

First, a quick review of S.M.A.R.T. attributes for the uninitiated. RAW_VALUE is the actual value of the attribute such as degrees Celsius or hours powered-on whereas VALUE is the attribute normalized to a scale of 1-253. The normalized value is used for easy analysis without much regard to the meaning of the attribute and uses 1 to represent the worst case scenario and 253 for the best case. RAW_VALUE, on the other hand, often increases when growing worse, like number of bad sectors. WORST is the lowest recorded value for VALUE. There are two flavors of attributes, old age and pre-fail. When a pre-fail attribute VALUE crosses the manufacturer’s defined threshold (THRESH), it means failure is imminent. Old age attributes just indicate wear and tear and don’t usually have a threshold.

After reviewing the S.M.A.R.T. attributes, there are definitely issues with sdc. This is the only drive showing a non-zero Raw_Read_Error_Rate as well as Offline_Uncorrectable. The Power_On_Hours, which really just indicates age, is nearly double of the second oldest drive. That’s not necessarily a problem, but what really worries me is that the Spin_Up_Time is much higher than other drives. The Start_Stop_Count, is also the highest on sdc.

Through the normal wear and tear of a hard drive, sectors can go bad and become unusable. Modern hard drives normally have a number of hidden, unused sectors reserved for this situation. When the hard drive controller detects a failure to update a sector, it will remap the sector address to one of the unused reserved sectors instead. When this happens, but there are no more free reserved sectors for remapping, you get an uncorrectable sector. When the value of Offline_Uncorrectable goes above zero, it means you have more bad sectors than were originally reserved by the manufacture and so can no longer be used.

Currently, none of the attributes have crossed a threshold indicating potential failure, and, interestingly, attributes 1 and 198 don’t appear to be reflected in the VALUE field. I also run regular, nightly S.M.A.R.T. tests on all drives and no issues have been reported from the tests. There is also nothing in the system logs of any recent errors or warnings; the only issue is the server is, overall, a little sluggish. Here’s a snippet of the self-test log as retrieved from the hard drive itself:

$ sudo smartctl -l selftest /dev/sdc
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
...
# 8  Short offline       Completed without error       00%     42414         -
# 9  Extended offline    Completed without error       00%     42395         -
...

If there were any failed attributes or tests, I would normally get an email about it as well as have the event logged by syslog. I have the smartd daemon installed and configured to run regular tests which something similar to this:

$ cat /etc/smartd.conf
/dev/sda -a -o on -S on -s (S/../.././02|L/../../6/03) -m root
/dev/sdb -a -o on -S on -s (S/../.././02|L/../../6/03) -m root
/dev/sdc -a -o on -S on -s (S/../.././02|L/../../6/03) -m root
/dev/sdd -a -o on -S on -s (S/../.././02|L/../../6/03) -m root

This says enable S.M.A.R.T and monitor hard drives /dev/sd[a-d]. Run a short test every night at 2 AM and an extended (long) test every Sunday at 3 AM. If there are any warnings or errors, report via syslog and send an email to root. I normally have most daemons send critical mail like this to root and make root an alias that redirects the email to all the primary system administrators.

Well, next up, I plan to take a look at running a full-fledged benchmark on the drives and replacing any drives that don’t quite meet expectations.

Monitoring Dæmons with CFEngine 3

I’ve been looking for a nice, simple method for verifying that all key services and dæmons are running on my UNIX servers. It’s pretty rare for a service to die, but, when it does, I want it restarted as soon as possible and I want to be notified about it! I’ve looked at process supervisors like daemontools and runit that are designed to run continuously and monitor the dæmons they start, but they tend to require a little more effort to maintain than I like. Normally, the sysadmin has to write the start-up script that sets up the environment and starts the dæmon, but it also must somehow coerce the dæmon to not do it’s normal double-fork and disassociate from it’s parent process and foreground terminal. I mainly just want a way to check for a process that’s not running and run the appropriate command to restart it.

CFEngine 3 seems to be a little closer to what I want and after reading through the excellent Learning CFEngine 3 book from O’Reilly, I think I’ve finally figured out the right recipe. All I want to do is to specify a process to look for and, if that process is not running, to specify a command to run that will restart the process. I would also like a report if a process ever needs to be restarted since that normally represents an abnormal event. Here’s the basic configuration I have to monitor a few services:

  any::
    "services[ssh][name]"           string => "OpenSSH";
    "services[ssh][process]"        string => "/usr/sbin/sshd";
    "services[ssh][restart]"        string => "/usr/sbin/invoke-rc.d ssh restart";

  web|dns::
    "services[ldap][name]"          string => "OpenLDAP";
    "services[ldap][process]"       string => "/usr/sbin/slapd";
    "services[ldap][restart]"       string => "/usr/sbin/invoke-rc.d slapd restart";
    "services[bind][name]"          string => "BIND";
    "services[bind][process]"       string => "/usr/sbin/named";
    "services[bind][restart]"       string => "/usr/sbin/invoke-rc.d bind9 restart";

  web::
    "services[apache][name]"        string => "Apache";
    "services[apache][process]"     string => "/usr/sbin/apache2";
    "services[apache][restart]"     string => "/usr/sbin/invoke-rc.d apache2 restart";

And that’s it! The above says that everyone must be running OpenSSH, servers web and dns must be running LDAP and BIND, and server web must be running Apache.  It also gives the name of the process to look for and the command necessary to restart the process if it’s not running.  I just repeat the same three lines for each service that I want to monitor and place the correct classes in front to select which servers will run those services.  Here’s the full file to run this:

body common control
{
    bundlesequence => { "services" };
    inputs => { "cfengine_stdlib.cf" };
}

bundle agent services
{
vars:
  any::
    "services[ssh][name]"           string => "OpenSSH";
    "services[ssh][process]"        string => "/usr/sbin/sshd";
    "services[ssh][restart]"        string => "/usr/sbin/invoke-rc.d ssh restart";

  web|dns::
    "services[ldap][name]"          string => "OpenLDAP";
    "services[ldap][process]"       string => "/usr/sbin/slapd";
    "services[ldap][restart]"       string => "/usr/sbin/invoke-rc.d slapd restart";
    "services[bind][name]"          string => "BIND";
    "services[bind][process]"       string => "/usr/sbin/named";
    "services[bind][restart]"       string => "/usr/sbin/invoke-rc.d bind9 restart";

  web::
    "services[apache][name]"        string => "Apache";
    "services[apache][process]"     string => "/usr/sbin/apache2";
    "services[apache][restart]"     string => "/usr/sbin/invoke-rc.d apache2 restart";

  any::
    "services" slist => getindices("services");

processes:
    "$(services[$(services)][process])"
        restart_class => "service_$(services)_restart";

commands:
    "$(services[$(services)][restart])"
        classes => if_notkept("service_$(services)_failed"),
        ifvarclass => "service_$(services)_restart";

reports:
    "$(services[$(services)][name]) is not running, restarting..."
        ifvarclass => "service_$(services)_restart";

    "$(services[$(services)][name]) failed to start!"
        ifvarclass => "service_$(services)_failed";
}

All the configuration of each service takes place entirely in the vars: section.  The processes: section takes care of searching for each process from the services array and declaring a class if the process was not found.  The commands: section runs the appropriate command to restart the service if the corresponding class was set by processes:.  Normally, CFEngine is silent, but the reports: section will generate output that goes into an email and log file if the process needed to be restarted and if there were any errors restarting the process.

This solution isn’t perfect, unlike a real process supervisor, CFEngine does not get immediate notification when a process dies, much less knowing the cause of death, but it does offer reasonable response time.  My example also doesn’t handle any rate limiting or limiting how many attempts there are to restart a dæmon, but CFEngine does have some amount of built in rate limiting.  Overall, I’ve found this solution simple to maintain and scale, and it’s now live and running on my servers.

Making System Administration Easy for the UNIX Sysadmin

I think it’s about time I start collecting my random thoughts and arranging them into something useful.  I’ve been collecting various scripts, configuration templates, and procedures I’ve written into an internal Wiki for my own benefit, but I’d love to publish it in blog format and get some feedback.  To start off this, I’m planning on starting a series on making system administration easier for the UNIX sysadmin.  My first post will be on automating the installation of a Linux distribution.

I’ve been working on making it easier to deploy additional Linux and Windows workstations with a goal of treating them as disposable machines.  After seeing what my users can do to their machines in about 5 minutes, I’m thinking just replacing it may be better for both of us.  Other topics will include automating Windows installation, deploying software updates, and managing system configuration.  Later on I’d like to post some guides to setting up LDAP, Samba, NFS, and Kerberos.  There are also some less common topics I can go over like setting up OpenAFS, deploying WPA2 Enterprise with FreeRadius, and managing a private Public-Key Infrastructure within an organization without shelling out money or relying on self-signed certificates.  I also have some experience with a few different backup systems and have done two bare-metal restorations plus a number of selective restores when a user decides to experiment a little.

If you have any requests or recommendations on topics I should cover, please leave a comment below.