Simplified SMART and Btrfs Reporting

Maintaining one data storage device is trivial. Maintaining twenty is tedious. This outlines my solution for getting simple view of SMART and Btrfs attributes from many disks in many servers with one command.

The Report

Source: github.com/markmcb/storage-scripts

Let's start with the result. Here's what I get emailed to me twice a week, or on-demand if I'm ever investigating something storage-related.

CORE up 2 days on 5.4.12-200.fc31.x86_64

DEV  CAPAC  AT  TMP  AGE  SLFTEST  PREFAIL
sda  466Gi  8   24C  1.8  6/043/.  ....
sdh  466Gi  15  27C  1.8  6/104/.  ....
sdd  932Gi  3   33C  1.6  3/012/.  ....
sdb  7.3Ti  0   30C  2.7  5/072/.  ........
sdc  7.3Ti  2   33C  1.3  4/103/.  .......
sde  7.3Ti  4   29C  2.7  1/010/.  ........
sdf  7.3Ti  6   29C  2.7  1/042/.  ........
sdg  7.3Ti  10  29C  2.7  7/072/.  ........

BTRFS_PATH  SIZE  AVAIL  USE%  SCB  ERR
/           111G  75G    32%   3    .
/mnt/ops    932G  356G   54%   28   .
/mnt/store  19T   4.4T   76%   18   .

BTRFS_PATH  MAP        DEVICE   WRFCG
/           luks-8506  nvme0n1  .....
/mnt/ops    luks-a4aa  sda      .....
/mnt/ops    luks-fd8b  sdh      .....
/mnt/ops    luks-0b9c  sdd      .....
/mnt/store  luks-6e45  sde      .....
/mnt/store  luks-f5bd  sdf      .....
/mnt/store  luks-fe18  sdb      .....
/mnt/store  luks-d375  sdg      .....
/mnt/store  luks-cb43  sdc      .....

In the actual report that block of text would get repeated several times, once for each multi-disk server I monitor.

Plain Text

As you probably noticed, the report is all text. This is important for two reasons:

Mobile: With text only, if I keep the output 45'ish characters wide, I can use a non-microscopic fixed-width font in an html email report that I look at on my phone. This avoids lines that wrap and break the column alignment.
Headless: In addition to getting a mobile report, I'd like to be able to run the reports whenever I like. As I have no graphical desktop environment (i.e., headless) on my servers, they must produce only text output such that I can run them from my shell.

With that in mind, let's break down what we're looking at.

Recycled Content

This report is a bash script using recycled command output. It's 100% generated by manipulating the output of common commands like lsblk, smartctl, btrfs and others, and then using some simple logic and formatting to get them into a denser view. The sections below walk through how the data is captured and assembled.

The first line is basic server info. No real magic here.

CORE up 2 days on 5.4.12-200.fc31.x86_64

The value of the $HOST environment variable with the domain removed is shown first and identifies which server the tables below belong to. The uptime shown after the server's name is the result of uptime | sed -E 's/.*(up[^,]*),.*/\1/'. No real utility, just interesting to know. The last part is the Linux kernel version, which is a result of uname -r. This is very important in the context of Btrfs. The pace of development on that file system is fast and furious, so the kernel version will clarify what features are/aren't available on that server.

Table 1: Block Device & SMART Information

The first table focuses on block devices (and not file systems).

DEV  CAPAC  AT  TMP  AGE  SLFTEST  PREFAIL
sda  466Gi  8   24C  1.8  6/043/.  ....
sdh  466Gi  15  27C  1.8  6/104/.  ....
sdd  932Gi  3   33C  1.6  3/012/.  ....
sdb  7.3Ti  0   30C  2.7  5/072/.  ........
sdc  7.3Ti  2   33C  1.3  4/103/.  .......
sde  7.3Ti  4   29C  2.7  1/010/.  ........
sdf  7.3Ti  6   29C  2.7  1/042/.  ........
sdg  7.3Ti  10  29C  2.7  7/072/.  ........

DEV

The device (DEV) column is gathered from lsblk -l -o NAME,TYPE -n | grep disk. This gives us the values we'll loop through in the script to collect the remaining information.

CAPAC, AT

The capacity of the device (CAPAC) and path to the device (AT) are accessible from Linux in the /dev and /sys locations.

To get the capacity, I use this one-liner: echo "($(cat /sys/block/sda/size)*512)" | bc | numfmt --to=iec-i, which grabs the raw value from /sys, then uses bc to multiply by 512, and finally numfmt to give a nice Gi or Ti suffix depending on the value.

To get the path, there is a little bit of complexity. I collect the value at /dev/disk/by-path and then use a regex (that I pass via an argument so it can be server specific) to reduce it to a single number. find -L /dev/disk/by-path/ -samefile /dev/sda | sed -E "s/.*\///" | sed -E "s${pathsubst}". For example, the find command will return /dev/disk/by-path/pci-0000:02:00.0-sas-phy8-lun-0, which has a lot of info that's not useful. The first sed reduces it to pci-0000:02:00.0-sas-phy8-lun-0, and the second when given /.*phy([[:digit:]]+).*/\1/' as an arg will result in 8.

Why is this useful? When somthing fails, you usually want to act with certainty, especially if you're pulling it "hot," i.e., while the system is operational. If I wanted to pull sda for whatever reason I could reference this report, physically locate the disk and pull it.

TMP, AGE, PREFAIL

The temperature (TMP), years of power-on time (AGE), and failure flags (PREFAIL) are all derived from the smartctl -A /dev/sda command, which produces this output:

smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.12-200.fc31.x86_64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       15645
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       44
177 Wear_Leveling_Count     0x0013   097   097   000    Pre-fail  Always       -       38
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   076   050   000    Old_age   Always       -       24
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       36
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       48496243602

The temperature is gathered simply by looking for ID 190 or 194, grabbing the 10th column with awk with smartctl -A /dev/sda | grep -m1 -E "^19(0|4)" | awk '{print $10}' and then appending a "C".

A similar approach is taken for the age, but with some division to go from hours to years.

The Pre-fail flags work like this. For each "Pre-fail" SMART attribute, take VALUE and subtract THRESH. If the result is 0 or less, then display x, otherwise display a .. So in the report we see .... which corresponds to the 4 pre-fail attributes we see above. They're all . because all values are well about the pre-failure threshold. If the report showed x... then I'd know to go check the smart attributes and see what's below the threshold.

Alternatively, I can run the report and pass an argument to show SMART pre-fail attributes for all disks side-by-side.

$ storage_report -d -C -F
DEV  CAPAC  RRE  TP   SUT  RSC  SER  STP  SRC  HL   WLC  URB  RBB
sda  466Gi  ---  ---  ---  90   ---  ---  ---  ---  97   90   90
sdh  466Gi  ---  ---  ---  90   ---  ---  ---  ---  97   90   90
sdd  932Gi  ---  ---  ---  90   ---  ---  ---  ---  99   90   90
sdb  7.3Ti  84   76   127  95   33   108  40   75   ---  ---  ---
sdc  7.3Ti  84   73   160  95   33   108  40   ---  ---  ---  ---
sde  7.3Ti  84   78   129  95   33   108  40   75   ---  ---  ---
sdf  7.3Ti  84   77   128  95   33   108  40   75   ---  ---  ---
sdg  7.3Ti  84   77   130  95   33   108  40   75   ---  ---  ---

I like this view. The numbers shown are VALUE minus THRESH, i.e., 0 or less is bad. --- means this attribute isn't relevant for the device. I sort by storage capacity as it puts similar drives near each other. So if the pre-fail values for my 7.3Ti drives were 80, 80, 80, 80, 20 then I'd might investigate the last disk as it would appear to be trending toward failure faster than the others.

Since those headers can be hard to remember, I added an easy way to transpose the table and use long names for each row with -y. In this example I've also included the Old_age attributes with -O.

$ storage_report -d -C -F -O -y
DEVICE                   sda    sdh    sdd    sdb    sdc    sde    sdf    sdg
CAPACITY                 466Gi  466Gi  932Gi  7.3Ti  7.3Ti  7.3Ti  7.3Ti  7.3Ti
Raw_Read_Error_Rate      ---    ---    ---    84     84     84     84     84
Throughput_Performance   ---    ---    ---    76     73     78     77     77
Spin_Up_Time             ---    ---    ---    127    160    129    128    130
Reallocated_Sector_Ct    90     90     90     95     95     95     95     95
Seek_Error_Rate          ---    ---    ---    33     33     33     33     33
Seek_Time_Performance    ---    ---    ---    108    108    108    108    108
Spin_Retry_Count         ---    ---    ---    40     40     40     40     40
Helium_Level             ---    ---    ---    75     ---    75     75     75
Wear_Leveling_Count      97     97     99     ---    ---    ---    ---    ---
Used_Rsvd_Blk_Cnt_Tot    90     90     90     ---    ---    ---    ---    ---
Runtime_Bad_Block        90     90     90     ---    ---    ---    ---    ---
Start_Stop_Count         ---    ---    ---    100    100    100    100    100
Power_On_Hours           96     96     97     97     99     97     97     97
Power_Cycle_Count        99     99     99     100    100    100    100    100
Program_Fail_Cnt_Total   90     90     90     ---    ---    ---    ---    ---
Erase_Fail_Count_Total   90     90     90     ---    ---    ---    ---    ---
Uncorrectable_Error_Cnt  100    100    100    ---    ---    ---    ---    ---
Airflow_Temperature_Cel  76     73     68     ---    ---    ---    ---    ---
Power-Off_Retract_Count  ---    ---    ---    62     70     63     62     62
Load_Cycle_Count         ---    ---    ---    62     70     63     62     62
Temperature_Celsius      ---    ---    ---    206    203    206    200    214
ECC_Error_Rate           200    200    200    ---    ---    ---    ---    ---
Reallocated_Event_Count  ---    ---    ---    100    100    100    100    100
Current_Pending_Sector   ---    ---    ---    100    100    100    100    100
Offline_Uncorrectable    ---    ---    ---    100    100    100    100    100
CRC_Error_Count          100    100    100    200    200    200    200    200
POR_Recovery_Count       99     99     99     ---    ---    ---    ---    ---
Total_LBAs_Written       99     99     99     ---    ---    ---    ---    ---

This detailed view is great while investigating, but to answer "should I investigate anything?" I came up with the . or x approach.

SLFTEST

The self-test (SLFTEST) column uses the command smartctl -l selftest /dev/sda, which produces:

smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.12-200.fc31.x86_64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     15487         -
# 2  Short offline       Completed without error       00%     15319         -
# 3  Short offline       Completed without error       00%     15151         -
# 4  Short offline       Completed without error       00%     14983         -
# 5  Short offline       Completed without error       00%     14815         -
# 6  Short offline       Completed without error       00%     14647         -
# 7  Extended offline    Completed without error       00%     14605         -
# 8  Short offline       Completed without error       00%     14479         -
# 9  Short offline       Completed without error       00%     14310         -
#10  Short offline       Completed without error       00%     14142         -

I have my short tests scheduled to run once a week and extended tests scheduled every 3 months, such that every day and every month some disk is running its self-tests.

The first two values of 6/043/. are days since the last short and extended tests respectively. The days are calculated by getting the Power_on_Hours SMART attribute (shown in previous section) and subtracting this from LifeTime(hours) for the last short and extended test in the table. If no test has been run, then - or --- are shown.

As with the pre-fail flags, the last character is a . if the last test result is Completed without error and an x otherwise.

So 6/043/. means a short test ran 6 days ago, an extended test ran 43 days ago, and the last test completed with out error.

Table 2: Btrfs Filesystem Information

This table focuses on the Btrfs file systems, reporting where they're mounted, the size and available space, the last time they were scrubbed and if any errors were reported during the last scrub.

BTRFS_PATH  SIZE  AVAIL  USE%  SCB  ERR
/           111G  75G    32%   3    .
/mnt/ops    932G  356G   54%   28   .
/mnt/store  19T   4.4T   76%   18   .

For this report we loop through all Btrfs mount points. We get them all into an array with paths=( $(df -T | grep btrfs | sed -E "s/.*% //" | tr '\n' ' ') ).

BTRFS_PATH, SIZE, AVAIL, USE%

The first four columns all come from a single command, df -h -t btrfs --output=size,avail,pcent ${path} | tail -n1. That gets returns the path, the size of that mount point, remaining available space, and usage as a percentage.

SCB, ERR

The last two columns come from variations of btrfs scrub status. The days since last scrub (SCB) is obtained with "$(( ($(date +%s) - $(date --date="$(btrfs scrub status ${path} | grep "started" | sed -E "s/Scrub started: *//")" +%s) )/(60*60*24) ))", which converts today's date to seconds since the epoch, converts the last scrub start to seconds since the epoch, subtracts the two, the converts the resulting seconds to days. The errors (ERR) column simple checks if "no errors found" is in the scrub status and reports a . if so and an x otherwise with [[ $(btrfs scrub status ${path} | grep "summary") == *"no errors found"* ]] && echo -n "." || echo -n "x".

Table 3: Btrfs Device Information

This table focuses on a few per-device statuses Btrfs tracks for each physical device that is part of the file system.

BTRFS_PATH  MAP        DEVICE   WRFCG
/           luks-8506  nvme0n1  .....
/mnt/ops    luks-a4aa  sda      .....
/mnt/ops    luks-fd8b  sdh      .....
/mnt/ops    luks-0b9c  sdd      .....
/mnt/store  luks-6e45  sde      .....
/mnt/store  luks-f5bd  sdf      .....
/mnt/store  luks-fe18  sdb      .....
/mnt/store  luks-d375  sdg      .....
/mnt/store  luks-cb43  sdc      .....

To better understand this report table, let's look at the raw output of btrfs device stats. This command expects 1 arg so let's consider our first multi-disk path:

btrfs device stats /mnt/ops
[/dev/mapper/luks-a4aa4856-393b-4620-8c76-88c883cdb632].write_io_errs    0
[/dev/mapper/luks-a4aa4856-393b-4620-8c76-88c883cdb632].read_io_errs     0
[/dev/mapper/luks-a4aa4856-393b-4620-8c76-88c883cdb632].flush_io_errs    0
[/dev/mapper/luks-a4aa4856-393b-4620-8c76-88c883cdb632].corruption_errs  0
[/dev/mapper/luks-a4aa4856-393b-4620-8c76-88c883cdb632].generation_errs  0
[/dev/mapper/luks-fd8b8b99-7171-4c66-a8b9-bffe96e2c8af].write_io_errs    0
[/dev/mapper/luks-fd8b8b99-7171-4c66-a8b9-bffe96e2c8af].read_io_errs     0
[/dev/mapper/luks-fd8b8b99-7171-4c66-a8b9-bffe96e2c8af].flush_io_errs    0
[/dev/mapper/luks-fd8b8b99-7171-4c66-a8b9-bffe96e2c8af].corruption_errs  0
[/dev/mapper/luks-fd8b8b99-7171-4c66-a8b9-bffe96e2c8af].generation_errs  0
[/dev/mapper/luks-0b9c674e-1520-4c6c-aa8d-291f3b304a1f].write_io_errs    0
[/dev/mapper/luks-0b9c674e-1520-4c6c-aa8d-291f3b304a1f].read_io_errs     0
[/dev/mapper/luks-0b9c674e-1520-4c6c-aa8d-291f3b304a1f].flush_io_errs    0
[/dev/mapper/luks-0b9c674e-1520-4c6c-aa8d-291f3b304a1f].corruption_errs  0
[/dev/mapper/luks-0b9c674e-1520-4c6c-aa8d-291f3b304a1f].generation_errs  0

This output is simple, but very verbose when several disks are involved. Five counters are shown for each device. In my case, the devices are all /dev/mapper devices since I've used full disk encryption with LUKS. It's not immediately obvious which /dev/sdX devices they map to. As far as the numbers go, 0=good. Anything else is bad.

Just like in the previous table, we use df to get a list of all the Btrfs mount points.

BTRFS_PATH, MAP

As we loop though the devices, we output them as BTRFS_PATH. We then collect all the devices listed in [square brackets] from the btrfs device stats command and remove duplicates. To keep output minimal, I decided to keep only the first 9 characters, e.g., luks-a4aa, in the report, but a -v option exists to show the full path.

DEVICE

In order to show sda alongside luks-a4aa we need something that associates the too. Once again lsblk has what we need, though it requires some critical manipulation. Let's consider just one device to illustrate the challenge.

lsblk -n -i -p -o NAME,TYPE,UUID /dev/sda
/dev/sda                                                  disk  a4aa4856-393b-4620-8c76-88c883cdb632
`-/dev/mapper/luks-a4aa4856-393b-4620-8c76-88c883cdb632   crypt 709f7252-4bdb-43a2-9601-7ee94c15d501

You can see the graphical connection, but it's much easier to match and lookup things when it's all on one line. lsblk -n -i -p -o NAME,TYPE,UUID | while read line; do printf "%s" "$([[ "${line}" == *"disk"* ]] && printf $'\n'"%s" "$line" || printf "%s " "$line")"; done does exactly that. With that output, we can grep a line for luks-a4aa... and then pipe the result through awk '{printf $1}' to get the first column of the results.

WRFCG

This column is just the first letter of write_io_errs, read_io_errs, flush_io_errs, corruption_errs, and generation_errs. As with other tables in this report, I check the values of each and report . if the value is 0 and x if it's greater than 0.

More Than One Server In The Report

{% include image.html file="ios_report.png" link_to_self="true" alt="The report as viewed on an iPhone." figure_style="float: right; max-width: 40%; margin: 10px 0 10px 30px" caption="The report as viewed on an iPhone." %}

So everything above describes the steps to get a report locally on a single server. But I have three. One solution would be to have each server run its own report and email it to me, but then I'd get three emails. While not terrible, I'm far more likely to scan over the report if it's just one email.

To accomplish this single email approach, I have one server scheduled via a cronjob to run a script that runs the report locally, and then ssh into each server I'm interested in, run the report, and pull the results into a single email. I also leverage this script to make a multi-part email as I've found most email clients will render plain text email with a non-fixed-width font. I use the html portion of the email to explicitly declare a monospace font-family, which ensures the report columns stay aligned and readable.

To ssh into all my servers in the cron'ed script, I use the keychain command, which interfaces with ssh-agent and allows me to ssh to my other servers without entering my ssh key's password each time.

Consider this snippet. The source calls ensure I'm using an existing keychain instance. The first plain_text_report assignment is a local one. The second is over ssh.

source "/home/${USER}/.keychain/${HOSTNAME}-sh"
plain_text_report="CORE $(uptime -p) on $(uname -r)"$'\n\n'
plain_text_report+="ECHO $(source "/home/${USER}/.keychain/${HOSTNAME}-sh"; 
  ssh -l ${USER} -i /home/${USER}/.ssh/id_ed25519 10.0.1.202 "uptime -p")

Final Notes

This isn't a particularly fast report. For my purposes, which is running this every few days or on demand, it works just fine and it's very hackable. There are some obvious areas that could be optimized (e.g., caching smartctl calls) that maybe I'll get to someday. If you'd like to improve it, it's on github.