Author Archives: Loren M. Lang

NFC Tags and Nexus 4 Woes

For those who prefer fortune cookies to novels: “If you want a happy life, only buy NFC Tag Types 1, 2, 3, or 4. MIFARE Classic NFC tags don’t talk with all devices.”

A while back, when I was first researching all about NFC and NFC tags, I was a little disappointed when I realized that typical NFC tags used a proprietary protocol called MIFARE Classic on top of the standard base NFC protocol.

"It uses an NXP proprietary security protocol...This means, only devices
with an NXP nfc controller chip can read/write these tags."

-- http://en.wikipedia.org/wiki/MIFARE#MIFARE_Classic

Well, it turns out we now have our first casualty of war; the Nexus 4 (and Nexus 10) do not use an NXP NFC controller chip unlike previous devices such as the Nexus 7 and Galaxy Nexus. Instead, it uses a Broadcom NFC controller chip which doesn’t support NXP proprietary protocol. I’ve also heard that the NFC chips used in both Windows Phone 8 and Blackberry devices do not support MIFARE Classic either.

Most other NFC communication seems to be pretty standardized. Android Beam relies on pure NFC Forum protocols with an Android-specific protocol on top and still works. PayPass also works through Google Wallet as it’s based on existing, published EMV smartcard standards. I have sucessfully used both with my Nexus 4. However, if I want to use NFC Tags with my Nexus 4, I need to stick with “standard” NFC Forum type 1-4 tags which are a part of the official NFC specification.

Unfortunately, the MIFARE family of tags are much more common and might even be a little cheaper. They also tend to include more storage space; A 1K MIFARE Classic tag has 752 bytes of usable storage, whereas a typical NFC Tag Type 1-2 offers 96-144 bytes of storage. However, NFC Tags up 4096 bytes are supported by the standard NFC Forum protocol and I have seen Topaz Type 1 tags as large as 512 bytes for sale.

Searching Amazon, I can find a lot of NFC Tags, but it’s harder to reliably weed out MIFARE Classic tags from the results. I’ve read that any tags claiming to specifically be for Windows Phone 8 are safe as WP8 doesn’t support MIFARE Classic, but something in me just doesn’t like that solution. I have switched to purchasing my NFC Tags from tagstand; Any of their tags that don’t say MIFARE are standard NFC Forum tags. I’ve successfully used their tags with my Nexus 4. Andy Tags also seems to be a good place to look, but I have not yet tried them out.

Tags labeled NTAG203, Topaz, DESFire and Felica are standard NFC Forum tags and should work. Tags label MIFARE Ultralight C (the C in the name is important) also work, but be careful, MIFARE Ultralight (without the C) uses the same protocol as MIFARE Classic. Some of the well-known proprietary MIFARE Classic tags to avoid include Samsung TecTiles, Tags for Droid, and the Movaluate tag.

Diagnosing a slow server

Our server at Alzatex has been acting slow for a while now, but I have been unable to find a cause until now. With what I found out today, I’m beginning to think that we just have an old, slow hard drive in our RAID6. What do you think?

$ sudo hdparm -tT --direct /dev/sd[abcd]
/dev/sda:
 Timing O_DIRECT cached reads:   420 MB in  2.00 seconds = 209.87 MB/sec
 Timing O_DIRECT disk reads:  318 MB in  3.01 seconds = 105.60 MB/sec

/dev/sdb:
 Timing O_DIRECT cached reads:   492 MB in  2.00 seconds = 245.53 MB/sec
 Timing O_DIRECT disk reads:  268 MB in  3.10 seconds =  86.40 MB/sec

/dev/sdc:
 Timing O_DIRECT cached reads:   408 MB in  2.01 seconds = 203.34 MB/sec
 Timing O_DIRECT disk reads:  146 MB in  3.12 seconds =  46.76 MB/sec

/dev/sdd:
 Timing O_DIRECT cached reads:   478 MB in  2.01 seconds = 238.25 MB/sec
 Timing O_DIRECT disk reads:  272 MB in  3.01 seconds =  90.50 MB/sec

sdc’s cached read time looks ok, but it’s raw disk read is less than half of the next slowest drive. Next up, I tried looking at the S.M.A.R.T. attributes of each drive. I’ve trimmed the output to only the most interesting attributes.

$ sudo smartctl -A /dev/sda
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   240   236   021    Pre-fail  Always       -       1000
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       23
  9 Power_On_Hours          0x0032   082   082   000    Old_age   Always       -       13603
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0

$ sudo smartctl -A /dev/sdb
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   239   236   021    Pre-fail  Always       -       1016
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       30
  9 Power_On_Hours          0x0032   067   067   000    Old_age   Always       -       24607
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0

$ sudo smartctl -A /dev/sdc
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       2
  3 Spin_Up_Time            0x0003   225   224   021    Pre-fail  Always       -       5741
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       48
  9 Power_On_Hours          0x0032   042   042   000    Old_age   Always       -       42601
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       2

$ sudo smartctl -A /dev/sdd
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   239   238   021    Pre-fail  Always       -       1025
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       37
  9 Power_On_Hours          0x0032   064   064   000    Old_age   Always       -       26475
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0

First, a quick review of S.M.A.R.T. attributes for the uninitiated. RAW_VALUE is the actual value of the attribute such as degrees Celsius or hours powered-on whereas VALUE is the attribute normalized to a scale of 1-253. The normalized value is used for easy analysis without much regard to the meaning of the attribute and uses 1 to represent the worst case scenario and 253 for the best case. RAW_VALUE, on the other hand, often increases when growing worse, like number of bad sectors. WORST is the lowest recorded value for VALUE. There are two flavors of attributes, old age and pre-fail. When a pre-fail attribute VALUE crosses the manufacturer’s defined threshold (THRESH), it means failure is imminent. Old age attributes just indicate wear and tear and don’t usually have a threshold.

After reviewing the S.M.A.R.T. attributes, there are definitely issues with sdc. This is the only drive showing a non-zero Raw_Read_Error_Rate as well as Offline_Uncorrectable. The Power_On_Hours, which really just indicates age, is nearly double of the second oldest drive. That’s not necessarily a problem, but what really worries me is that the Spin_Up_Time is much higher than other drives. The Start_Stop_Count, is also the highest on sdc.

Through the normal wear and tear of a hard drive, sectors can go bad and become unusable. Modern hard drives normally have a number of hidden, unused sectors reserved for this situation. When the hard drive controller detects a failure to update a sector, it will remap the sector address to one of the unused reserved sectors instead. When this happens, but there are no more free reserved sectors for remapping, you get an uncorrectable sector. When the value of Offline_Uncorrectable goes above zero, it means you have more bad sectors than were originally reserved by the manufacture and so can no longer be used.

Currently, none of the attributes have crossed a threshold indicating potential failure, and, interestingly, attributes 1 and 198 don’t appear to be reflected in the VALUE field. I also run regular, nightly S.M.A.R.T. tests on all drives and no issues have been reported from the tests. There is also nothing in the system logs of any recent errors or warnings; the only issue is the server is, overall, a little sluggish. Here’s a snippet of the self-test log as retrieved from the hard drive itself:

$ sudo smartctl -l selftest /dev/sdc
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
...
# 8  Short offline       Completed without error       00%     42414         -
# 9  Extended offline    Completed without error       00%     42395         -
...

If there were any failed attributes or tests, I would normally get an email about it as well as have the event logged by syslog. I have the smartd daemon installed and configured to run regular tests which something similar to this:

$ cat /etc/smartd.conf
/dev/sda -a -o on -S on -s (S/../.././02|L/../../6/03) -m root
/dev/sdb -a -o on -S on -s (S/../.././02|L/../../6/03) -m root
/dev/sdc -a -o on -S on -s (S/../.././02|L/../../6/03) -m root
/dev/sdd -a -o on -S on -s (S/../.././02|L/../../6/03) -m root

This says enable S.M.A.R.T and monitor hard drives /dev/sd[a-d]. Run a short test every night at 2 AM and an extended (long) test every Sunday at 3 AM. If there are any warnings or errors, report via syslog and send an email to root. I normally have most daemons send critical mail like this to root and make root an alias that redirects the email to all the primary system administrators.

Well, next up, I plan to take a look at running a full-fledged benchmark on the drives and replacing any drives that don’t quite meet expectations.

Monitoring Dæmons with CFEngine 3

I’ve been looking for a nice, simple method for verifying that all key services and dæmons are running on my UNIX servers. It’s pretty rare for a service to die, but, when it does, I want it restarted as soon as possible and I want to be notified about it! I’ve looked at process supervisors like daemontools and runit that are designed to run continuously and monitor the dæmons they start, but they tend to require a little more effort to maintain than I like. Normally, the sysadmin has to write the start-up script that sets up the environment and starts the dæmon, but it also must somehow coerce the dæmon to not do it’s normal double-fork and disassociate from it’s parent process and foreground terminal. I mainly just want a way to check for a process that’s not running and run the appropriate command to restart it.

CFEngine 3 seems to be a little closer to what I want and after reading through the excellent Learning CFEngine 3 book from O’Reilly, I think I’ve finally figured out the right recipe. All I want to do is to specify a process to look for and, if that process is not running, to specify a command to run that will restart the process. I would also like a report if a process ever needs to be restarted since that normally represents an abnormal event. Here’s the basic configuration I have to monitor a few services:

  any::
    "services[ssh][name]"           string => "OpenSSH";
    "services[ssh][process]"        string => "/usr/sbin/sshd";
    "services[ssh][restart]"        string => "/usr/sbin/invoke-rc.d ssh restart";

  web|dns::
    "services[ldap][name]"          string => "OpenLDAP";
    "services[ldap][process]"       string => "/usr/sbin/slapd";
    "services[ldap][restart]"       string => "/usr/sbin/invoke-rc.d slapd restart";
    "services[bind][name]"          string => "BIND";
    "services[bind][process]"       string => "/usr/sbin/named";
    "services[bind][restart]"       string => "/usr/sbin/invoke-rc.d bind9 restart";

  web::
    "services[apache][name]"        string => "Apache";
    "services[apache][process]"     string => "/usr/sbin/apache2";
    "services[apache][restart]"     string => "/usr/sbin/invoke-rc.d apache2 restart";

And that’s it! The above says that everyone must be running OpenSSH, servers web and dns must be running LDAP and BIND, and server web must be running Apache.  It also gives the name of the process to look for and the command necessary to restart the process if it’s not running.  I just repeat the same three lines for each service that I want to monitor and place the correct classes in front to select which servers will run those services.  Here’s the full file to run this:

body common control
{
    bundlesequence => { "services" };
    inputs => { "cfengine_stdlib.cf" };
}

bundle agent services
{
vars:
  any::
    "services[ssh][name]"           string => "OpenSSH";
    "services[ssh][process]"        string => "/usr/sbin/sshd";
    "services[ssh][restart]"        string => "/usr/sbin/invoke-rc.d ssh restart";

  web|dns::
    "services[ldap][name]"          string => "OpenLDAP";
    "services[ldap][process]"       string => "/usr/sbin/slapd";
    "services[ldap][restart]"       string => "/usr/sbin/invoke-rc.d slapd restart";
    "services[bind][name]"          string => "BIND";
    "services[bind][process]"       string => "/usr/sbin/named";
    "services[bind][restart]"       string => "/usr/sbin/invoke-rc.d bind9 restart";

  web::
    "services[apache][name]"        string => "Apache";
    "services[apache][process]"     string => "/usr/sbin/apache2";
    "services[apache][restart]"     string => "/usr/sbin/invoke-rc.d apache2 restart";

  any::
    "services" slist => getindices("services");

processes:
    "$(services[$(services)][process])"
        restart_class => "service_$(services)_restart";

commands:
    "$(services[$(services)][restart])"
        classes => if_notkept("service_$(services)_failed"),
        ifvarclass => "service_$(services)_restart";

reports:
    "$(services[$(services)][name]) is not running, restarting..."
        ifvarclass => "service_$(services)_restart";

    "$(services[$(services)][name]) failed to start!"
        ifvarclass => "service_$(services)_failed";
}

All the configuration of each service takes place entirely in the vars: section.  The processes: section takes care of searching for each process from the services array and declaring a class if the process was not found.  The commands: section runs the appropriate command to restart the service if the corresponding class was set by processes:.  Normally, CFEngine is silent, but the reports: section will generate output that goes into an email and log file if the process needed to be restarted and if there were any errors restarting the process.

This solution isn’t perfect, unlike a real process supervisor, CFEngine does not get immediate notification when a process dies, much less knowing the cause of death, but it does offer reasonable response time.  My example also doesn’t handle any rate limiting or limiting how many attempts there are to restart a dæmon, but CFEngine does have some amount of built in rate limiting.  Overall, I’ve found this solution simple to maintain and scale, and it’s now live and running on my servers.

Revocation doesn’t work: an alternate take

I’ve never been a big fan of the Certificate Authority system used by Web Browsers for HTTPS, but we can’t just throw it away just yet.  My preferred alternative I’d like to see take place is for a certificate or public key fingerprint to be stored in DNS along with the A/AAAA/SRV record for the service being accessed.  This record would be secured using DNSSEC and must be fresh.  With this, the certificate has the same lifetime as the associated DNS record and revocation happens by rotating public keys.  This eliminates the need for maintaining separate hierarchies for DNS and SSL certificates and puts revocation into the hands of the person in charge of DNS.

Unfortunately, this is not a viable option yet so we still need to maintain the existing system in place.  Adam Langley wrote an interesting article on the certificate revocation support in existing browsers.  I do agree with him that is should be improved upon, but I believe there is a better solution.  He proposed reducing the lifetime of certificates down to a matter of days as an alternative to supplying up-to-date revocation information.  The problem with this is that the CA must continually resign updated certificates for all their users which requires more computing power, but also will need methods to automate it and to distribute those updated certificates to users.  If something goes wrong, a user may be stuck with an expired certificate which is worse than the current situation.

A way to better distribute the load would be to rely on more intermediate Certificate Authorities.  Unfortunately, with current software there is no way to further restrict the authority of an intermediate CA and so they need to be controlled as tightly as the issuing CA as to who can sign with the private key.  Now, if there was a new critical extension created for CA certificates that, say, only allowed it to sign certificates for a specific domain name, then the issuing CA could delegate the task of signing SSL certificates to the site using them.  That user could then have a secure off-line machine with this restricted intermediate CA that signs a new SSL certificate daily without requiring regular communication with the CA.

Again, this idea is also not very viable as all kinds of software would need to be updated to support this new critical extension, and in lieu of proper support would be a failed SSL connection due to an unsupported critical extension.  I believe the best solution is to just improve the distribution and caching of the existing revocation information.  Certificates normally include one or more URLs of either CRLs or OCSP servers.  At least one CRL or OCSP server must be accessible to validate a certificate.

OCSP is a protocol to query the revocation status of individual certificates.  The OCSP server will either assert that a certificate is valid, revoked, or that nothing is known about the certificate in question.  This assertion has a lifetime associated with it and can be stored in a cache until either it’s life has expired or the browser decides it wants fresh data.

CRLs are much simpler, they are nothing more than a file signed by the CA containing a list of all revoked certificates, but like OCSP, it also has a lifetime stored with it and included within the CA’s signature.  Lifetimes for CRLs can be as much as 30 days or more or down to a few days or less.  A CRL only has to be signed once during it’s lifetime for all certificates issued by a CA which puts much less strain on the CA then resigning all it’s issued certificates.  Also, since a CRL is a simple file, it can be distributed in many different ways; the URL included in the certificate is merely for convenience in finding the CRL.  Proxy servers can be set up to cache old, but still valid copies of the CRL and alternative repositories can also be set up to search for missing CRLs.  Also, a mechanism could be set up to auto-detect and download a copy of the CRL from the end-site the browser is connecting to.   Perhaps an optional TLS extension to present a cached copy of all relevant CRLs to the browser or a standard, non-secure HTTP URL where a site could host copies of the CRLs.  A web server can then be set up to retrieve any CRLs relevant to it’s SSL service.

Say, a given CRL is regenerated once a day and has a lifetime of 7 days, a web server could attempt to retrieve the latest copy once each day with a retry on failure once an hour.  Even after six days of continual failure, the web server will still have a valid CRL to offer.  This should leave plenty of margin for a website to retrieve the latest CRLs for it’s HTTPS service.  When a client connects, it could retrieve those CRLs directly from the web server during a DoS attack against it’s CA.  This method is resilient as all the necessary information is now retrievable from the same location, backwards-compatible with the existing system since CAs will still host their CRLs, and secure since all critical information is signed directly by the CA.

Making System Administration Easy for the UNIX Sysadmin

I think it’s about time I start collecting my random thoughts and arranging them into something useful.  I’ve been collecting various scripts, configuration templates, and procedures I’ve written into an internal Wiki for my own benefit, but I’d love to publish it in blog format and get some feedback.  To start off this, I’m planning on starting a series on making system administration easier for the UNIX sysadmin.  My first post will be on automating the installation of a Linux distribution.

I’ve been working on making it easier to deploy additional Linux and Windows workstations with a goal of treating them as disposable machines.  After seeing what my users can do to their machines in about 5 minutes, I’m thinking just replacing it may be better for both of us.  Other topics will include automating Windows installation, deploying software updates, and managing system configuration.  Later on I’d like to post some guides to setting up LDAP, Samba, NFS, and Kerberos.  There are also some less common topics I can go over like setting up OpenAFS, deploying WPA2 Enterprise with FreeRadius, and managing a private Public-Key Infrastructure within an organization without shelling out money or relying on self-signed certificates.  I also have some experience with a few different backup systems and have done two bare-metal restorations plus a number of selective restores when a user decides to experiment a little.

If you have any requests or recommendations on topics I should cover, please leave a comment below.