Diagnosing a slow server

Our server at Alzatex has been acting slow for a while now, but I have been unable to find a cause until now. With what I found out today, I’m beginning to think that we just have an old, slow hard drive in our RAID6. What do you think?

$ sudo hdparm -tT --direct /dev/sd[abcd]
/dev/sda:
 Timing O_DIRECT cached reads:   420 MB in  2.00 seconds = 209.87 MB/sec
 Timing O_DIRECT disk reads:  318 MB in  3.01 seconds = 105.60 MB/sec

/dev/sdb:
 Timing O_DIRECT cached reads:   492 MB in  2.00 seconds = 245.53 MB/sec
 Timing O_DIRECT disk reads:  268 MB in  3.10 seconds =  86.40 MB/sec

/dev/sdc:
 Timing O_DIRECT cached reads:   408 MB in  2.01 seconds = 203.34 MB/sec
 Timing O_DIRECT disk reads:  146 MB in  3.12 seconds =  46.76 MB/sec

/dev/sdd:
 Timing O_DIRECT cached reads:   478 MB in  2.01 seconds = 238.25 MB/sec
 Timing O_DIRECT disk reads:  272 MB in  3.01 seconds =  90.50 MB/sec

sdc’s cached read time looks ok, but it’s raw disk read is less than half of the next slowest drive. Next up, I tried looking at the S.M.A.R.T. attributes of each drive. I’ve trimmed the output to only the most interesting attributes.

$ sudo smartctl -A /dev/sda
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   240   236   021    Pre-fail  Always       -       1000
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       23
  9 Power_On_Hours          0x0032   082   082   000    Old_age   Always       -       13603
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0

$ sudo smartctl -A /dev/sdb
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   239   236   021    Pre-fail  Always       -       1016
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       30
  9 Power_On_Hours          0x0032   067   067   000    Old_age   Always       -       24607
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0

$ sudo smartctl -A /dev/sdc
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       2
  3 Spin_Up_Time            0x0003   225   224   021    Pre-fail  Always       -       5741
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       48
  9 Power_On_Hours          0x0032   042   042   000    Old_age   Always       -       42601
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       2

$ sudo smartctl -A /dev/sdd
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   239   238   021    Pre-fail  Always       -       1025
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       37
  9 Power_On_Hours          0x0032   064   064   000    Old_age   Always       -       26475
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0

First, a quick review of S.M.A.R.T. attributes for the uninitiated. RAW_VALUE is the actual value of the attribute such as degrees Celsius or hours powered-on whereas VALUE is the attribute normalized to a scale of 1-253. The normalized value is used for easy analysis without much regard to the meaning of the attribute and uses 1 to represent the worst case scenario and 253 for the best case. RAW_VALUE, on the other hand, often increases when growing worse, like number of bad sectors. WORST is the lowest recorded value for VALUE. There are two flavors of attributes, old age and pre-fail. When a pre-fail attribute VALUE crosses the manufacturer’s defined threshold (THRESH), it means failure is imminent. Old age attributes just indicate wear and tear and don’t usually have a threshold.

After reviewing the S.M.A.R.T. attributes, there are definitely issues with sdc. This is the only drive showing a non-zero Raw_Read_Error_Rate as well as Offline_Uncorrectable. The Power_On_Hours, which really just indicates age, is nearly double of the second oldest drive. That’s not necessarily a problem, but what really worries me is that the Spin_Up_Time is much higher than other drives. The Start_Stop_Count, is also the highest on sdc.

Through the normal wear and tear of a hard drive, sectors can go bad and become unusable. Modern hard drives normally have a number of hidden, unused sectors reserved for this situation. When the hard drive controller detects a failure to update a sector, it will remap the sector address to one of the unused reserved sectors instead. When this happens, but there are no more free reserved sectors for remapping, you get an uncorrectable sector. When the value of Offline_Uncorrectable goes above zero, it means you have more bad sectors than were originally reserved by the manufacture and so can no longer be used.

Currently, none of the attributes have crossed a threshold indicating potential failure, and, interestingly, attributes 1 and 198 don’t appear to be reflected in the VALUE field. I also run regular, nightly S.M.A.R.T. tests on all drives and no issues have been reported from the tests. There is also nothing in the system logs of any recent errors or warnings; the only issue is the server is, overall, a little sluggish. Here’s a snippet of the self-test log as retrieved from the hard drive itself:

$ sudo smartctl -l selftest /dev/sdc
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
...
# 8  Short offline       Completed without error       00%     42414         -
# 9  Extended offline    Completed without error       00%     42395         -
...

If there were any failed attributes or tests, I would normally get an email about it as well as have the event logged by syslog. I have the smartd daemon installed and configured to run regular tests which something similar to this:

$ cat /etc/smartd.conf
/dev/sda -a -o on -S on -s (S/../.././02|L/../../6/03) -m root
/dev/sdb -a -o on -S on -s (S/../.././02|L/../../6/03) -m root
/dev/sdc -a -o on -S on -s (S/../.././02|L/../../6/03) -m root
/dev/sdd -a -o on -S on -s (S/../.././02|L/../../6/03) -m root

This says enable S.M.A.R.T and monitor hard drives /dev/sd[a-d]. Run a short test every night at 2 AM and an extended (long) test every Sunday at 3 AM. If there are any warnings or errors, report via syslog and send an email to root. I normally have most daemons send critical mail like this to root and make root an alias that redirects the email to all the primary system administrators.

Well, next up, I plan to take a look at running a full-fledged benchmark on the drives and replacing any drives that don’t quite meet expectations.

North Winds

A Look into the Life of a Developer and Sysadmin

Diagnosing a slow server

Leave a Reply Cancel reply