Mark McBride

Simplified SMART and Btrfs Reporting

Maintaining one data storage device is trivial. Maintaining twenty is tedious. This outlines my solution for getting simple view of SMART and Btrfs attributes from many disks in many servers with one command.

The Report

Source: github.com/markmcb/storage-scripts

Let’s start with the result. Here’s what I get emailed to me twice a week, or on-demand if I’m ever investigating something storage-related.

CORE up 2 days on 5.4.12-200.fc31.x86_64

DEV  CAPAC  AT  TMP  AGE  SLFTEST  PREFAIL
sda  466Gi  8   24C  1.8  6/043/.  ....
sdh  466Gi  15  27C  1.8  6/104/.  ....
sdd  932Gi  3   33C  1.6  3/012/.  ....
sdb  7.3Ti  0   30C  2.7  5/072/.  ........
sdc  7.3Ti  2   33C  1.3  4/103/.  .......
sde  7.3Ti  4   29C  2.7  1/010/.  ........
sdf  7.3Ti  6   29C  2.7  1/042/.  ........
sdg  7.3Ti  10  29C  2.7  7/072/.  ........

BTRFS_PATH  SIZE  AVAIL  USE%  SCB  ERR
/           111G  75G    32%   3    .
/mnt/ops    932G  356G   54%   28   .
/mnt/store  19T   4.4T   76%   18   .

BTRFS_PATH  MAP        DEVICE   WRFCG
/           luks-8506  nvme0n1  .....
/mnt/ops    luks-a4aa  sda      .....
/mnt/ops    luks-fd8b  sdh      .....
/mnt/ops    luks-0b9c  sdd      .....
/mnt/store  luks-6e45  sde      .....
/mnt/store  luks-f5bd  sdf      .....
/mnt/store  luks-fe18  sdb      .....
/mnt/store  luks-d375  sdg      .....
/mnt/store  luks-cb43  sdc      .....

In the actual report that block of text would get repeated several times, once for each multi-disk server I monitor.

Plain Text

As you probably noticed, the report is all text. This is important for two reasons:

  1. Mobile: With text only, if I keep the output 45’ish characters wide, I can use a non-microscopic fixed-width font in an html email report that I look at on my phone. This avoids lines that wrap and break the column alignment.
  2. Headless: In addition to getting a mobile report, I’d like to be able to run the reports whenever I like. As I have no graphical desktop environment (i.e., headless) on my servers, they must produce only text output such that I can run them from my shell.

With that in mind, let’s break down what we’re looking at.

Recycled Content

This report is a bash script using recycled command output. It’s 100% generated by manipulating the output of common commands like lsblk, smartctl, btrfs and others, and then using some simple logic and formatting to get them into a denser view. The sections below walk through how the data is captured and assembled.

The first line is basic server info. No real magic here.

CORE up 2 days on 5.4.12-200.fc31.x86_64

The value of the $HOST environment variable with the domain removed is shown first and identifies which server the tables below belong to. The uptime shown after the server’s name is the result of uptime | sed -E 's/.*(up[^,]*),.*/\1/'. No real utility, just interesting to know. The last part is the Linux kernel version, which is a result of uname -r. This is very important in the context of Btrfs. The pace of development on that file system is fast and furious, so the kernel version will clarify what features are/aren’t available on that server.

Table 1: Block Device & SMART Information

The first table focuses on block devices (and not file systems).

DEV  CAPAC  AT  TMP  AGE  SLFTEST  PREFAIL
sda  466Gi  8   24C  1.8  6/043/.  ....
sdh  466Gi  15  27C  1.8  6/104/.  ....
sdd  932Gi  3   33C  1.6  3/012/.  ....
sdb  7.3Ti  0   30C  2.7  5/072/.  ........
sdc  7.3Ti  2   33C  1.3  4/103/.  .......
sde  7.3Ti  4   29C  2.7  1/010/.  ........
sdf  7.3Ti  6   29C  2.7  1/042/.  ........
sdg  7.3Ti  10  29C  2.7  7/072/.  ........

DEV

The device (DEV) column is gathered from lsblk -l -o NAME,TYPE -n | grep disk. This gives us the values we’ll loop through in the script to collect the remaining information.

CAPAC, AT

The capacity of the device (CAPAC) and path to the device (AT) are accessible from Linux in the /dev and /sys locations.

To get the capacity, I use this one-liner: echo "($(cat /sys/block/sda/size)*512)" | bc | numfmt --to=iec-i, which grabs the raw value from /sys, then uses bc to multiply by 512, and finally numfmt to give a nice Gi or Ti suffix depending on the value.

To get the path, there is a little bit of complexity. I collect the value at /dev/disk/by-path and then use a regex (that I pass via an argument so it can be server specific) to reduce it to a single number. find -L /dev/disk/by-path/ -samefile /dev/sda | sed -E "s/.*\///" | sed -E "s${pathsubst}". For example, the find command will return /dev/disk/by-path/pci-0000:02:00.0-sas-phy8-lun-0, which has a lot of info that’s not useful. The first sed reduces it to pci-0000:02:00.0-sas-phy8-lun-0, and the second when given /.*phy([[:digit:]]+).*/\1/' as an arg will result in 8.

Why is this useful? When somthing fails, you usually want to act with certainty, especially if you’re pulling it “hot,” i.e., while the system is operational. If I wanted to pull sda for whatever reason I could reference this report, physically locate the disk and pull it.

TMP, AGE, PREFAIL

The temperature (TMP), years of power-on time (AGE), and failure flags (PREFAIL) are all derived from the smartctl -A /dev/sda command, which produces this output:

smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.12-200.fc31.x86_64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       15645
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       44
177 Wear_Leveling_Count     0x0013   097   097   000    Pre-fail  Always       -       38
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   076   050   000    Old_age   Always       -       24
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       36
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       48496243602

The temperature is gathered simply by looking for ID 190 or 194, grabbing the 10th column with awk with smartctl -A /dev/sda | grep -m1 -E "^19(0|4)" | awk '{print $10}' and then appending a “C”.

A similar approach is taken for the age, but with some division to go from hours to years.

The Pre-fail flags work like this. For each “Pre-fail” SMART attribute, take VALUE and subtract THRESH. If the result is 0 or less, then display x, otherwise display a .. So in the report we see .... which corresponds to the 4 pre-fail attributes we see above. They’re all . because all values are well about the pre-failure threshold. If the report showed x... then I’d know to go check the smart attributes and see what’s below the threshold.

Alternatively, I can run the report and pass an argument to show SMART pre-fail attributes for all disks side-by-side.

storage_report -d -C -F
DEV CAPAC RRE TP SUT RSC SER STP SRC HL WLC URB RBB sda 466Gi --- --- --- 90 --- --- --- --- 97 90 90 sdh 466Gi --- --- --- 90 --- --- --- --- 97 90 90 sdd 932Gi --- --- --- 90 --- --- --- --- 99 90 90 sdb 7.3Ti 84 76 127 95 33 108 40 75 --- --- --- sdc 7.3Ti 84 73 160 95 33 108 40 --- --- --- --- sde 7.3Ti 84 78 129 95 33 108 40 75 --- --- --- sdf 7.3Ti 84 77 128 95 33 108 40 75 --- --- --- sdg 7.3Ti 84 77 130 95 33 108 40 75 --- --- ---

I like this view. The numbers shown are VALUE minus THRESH, i.e., 0 or less is bad. --- means this attribute isn’t relevant for the device. I sort by storage capacity as it puts similar drives near each other. So if the pre-fail values for my 7.3Ti drives were 80, 80, 80, 80, 20 then I’d might investigate the last disk as it would appear to be trending toward failure faster than the others.

Since those headers can be hard to remember, I added an easy way to transpose the table and use long names for each row with -y. In this example I’ve also included the Old_age attributes with -O.

storage_report -d -C -F -O -y
DEVICE sda sdh sdd sdb sdc sde sdf sdg CAPACITY 466Gi 466Gi 932Gi 7.3Ti 7.3Ti 7.3Ti 7.3Ti 7.3Ti Raw_Read_Error_Rate --- --- --- 84 84 84 84 84 Throughput_Performance --- --- --- 76 73 78 77 77 Spin_Up_Time --- --- --- 127 160 129 128 130 Reallocated_Sector_Ct 90 90 90 95 95 95 95 95 Seek_Error_Rate --- --- --- 33 33 33 33 33 Seek_Time_Performance --- --- --- 108 108 108 108 108 Spin_Retry_Count --- --- --- 40 40 40 40 40 Helium_Level --- --- --- 75 --- 75 75 75 Wear_Leveling_Count 97 97 99 --- --- --- --- --- Used_Rsvd_Blk_Cnt_Tot 90 90 90 --- --- --- --- --- Runtime_Bad_Block 90 90 90 --- --- --- --- --- Start_Stop_Count --- --- --- 100 100 100 100 100 Power_On_Hours 96 96 97 97 99 97 97 97 Power_Cycle_Count 99 99 99 100 100 100 100 100 Program_Fail_Cnt_Total 90 90 90 --- --- --- --- --- Erase_Fail_Count_Total 90 90 90 --- --- --- --- --- Uncorrectable_Error_Cnt 100 100 100 --- --- --- --- --- Airflow_Temperature_Cel 76 73 68 --- --- --- --- --- Power-Off_Retract_Count --- --- --- 62 70 63 62 62 Load_Cycle_Count --- --- --- 62 70 63 62 62 Temperature_Celsius --- --- --- 206 203 206 200 214 ECC_Error_Rate 200 200 200 --- --- --- --- --- Reallocated_Event_Count --- --- --- 100 100 100 100 100 Current_Pending_Sector --- --- --- 100 100 100 100 100 Offline_Uncorrectable --- --- --- 100 100 100 100 100 CRC_Error_Count 100 100 100 200 200 200 200 200 POR_Recovery_Count 99 99 99 --- --- --- --- --- Total_LBAs_Written 99 99 99 --- --- --- --- ---

This detailed view is great while investigating, but to answer “should I investigate anything?” I came up with the . or x approach.

SLFTEST

The self-test (SLFTEST) column uses the command smartctl -l selftest /dev/sda, which produces:

smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.12-200.fc31.x86_64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     15487         -
# 2  Short offline       Completed without error       00%     15319         -
# 3  Short offline       Completed without error       00%     15151         -
# 4  Short offline       Completed without error       00%     14983         -
# 5  Short offline       Completed without error       00%     14815         -
# 6  Short offline       Completed without error       00%     14647         -
# 7  Extended offline    Completed without error       00%     14605         -
# 8  Short offline       Completed without error       00%     14479         -
# 9  Short offline       Completed without error       00%     14310         -
#10  Short offline       Completed without error       00%     14142         -

I have my short tests scheduled to run once a week and extended tests scheduled every 3 months, such that every day and every month some disk is running its self-tests.

The first two values of 6/043/. are days since the last short and extended tests respectively. The days are calculated by getting the Power_on_Hours SMART attribute (shown in previous section) and subtracting this from LifeTime(hours) for the last short and extended test in the table. If no test has been run, then - or --- are shown.

As with the pre-fail flags, the last character is a . if the last test result is Completed without error and an x otherwise.

So 6/043/. means a short test ran 6 days ago, an extended test ran 43 days ago, and the last test completed with out error.

Table 2: Btrfs Filesystem Information

This table focuses on the Btrfs file systems, reporting where they’re mounted, the size and available space, the last time they were scrubbed and if any errors were reported during the last scrub.

BTRFS_PATH  SIZE  AVAIL  USE%  SCB  ERR
/           111G  75G    32%   3    .
/mnt/ops    932G  356G   54%   28   .
/mnt/store  19T   4.4T   76%   18   .

For this report we loop through all Btrfs mount points. We get them all into an array with paths=( $(df -T | grep btrfs | sed -E "s/.*% //" | tr ' ' ' ') ).

BTRFS_PATH, SIZE, AVAIL, USE%

The first four columns all come from a single command, df -h -t btrfs --output=size,avail,pcent ${path} | tail -n1. That gets returns the path, the size of that mount point, remaining available space, and usage as a percentage.

SCB, ERR

The last two columns come from variations of btrfs scrub status. The days since last scrub (SCB) is obtained with "$(( ($(date +%s) - $(date --date="$(btrfs scrub status ${path} | grep "started" | sed -E "s/Scrub started: *//")" +%s) )/(60*60*24) ))", which converts today’s date to seconds since the epoch, converts the last scrub start to seconds since the epoch, subtracts the two, the converts the resulting seconds to days. The errors (ERR) column simple checks if “no errors found” is in the scrub status and reports a . if so and an x otherwise with [[ $(btrfs scrub status ${path} | grep "summary") == *"no errors found"* ]] && echo -n "." || echo -n "x".

Table 3: Btrfs Device Information

This table focuses on a few per-device statuses Btrfs tracks for each physical device that is part of the file system.

BTRFS_PATH  MAP        DEVICE   WRFCG
/           luks-8506  nvme0n1  .....
/mnt/ops    luks-a4aa  sda      .....
/mnt/ops    luks-fd8b  sdh      .....
/mnt/ops    luks-0b9c  sdd      .....
/mnt/store  luks-6e45  sde      .....
/mnt/store  luks-f5bd  sdf      .....
/mnt/store  luks-fe18  sdb      .....
/mnt/store  luks-d375  sdg      .....
/mnt/store  luks-cb43  sdc      .....

To better understand this report table, let’s look at the raw output of btrfs device stats. This command expects 1 arg so let’s consider our first multi-disk path:

btrfs device stats /mnt/ops
[/dev/mapper/luks-a4aa4856-393b-4620-8c76-88c883cdb632].write_io_errs 0 [/dev/mapper/luks-a4aa4856-393b-4620-8c76-88c883cdb632].read_io_errs 0 [/dev/mapper/luks-a4aa4856-393b-4620-8c76-88c883cdb632].flush_io_errs 0 [/dev/mapper/luks-a4aa4856-393b-4620-8c76-88c883cdb632].corruption_errs 0 [/dev/mapper/luks-a4aa4856-393b-4620-8c76-88c883cdb632].generation_errs 0 [/dev/mapper/luks-fd8b8b99-7171-4c66-a8b9-bffe96e2c8af].write_io_errs 0 [/dev/mapper/luks-fd8b8b99-7171-4c66-a8b9-bffe96e2c8af].read_io_errs 0 [/dev/mapper/luks-fd8b8b99-7171-4c66-a8b9-bffe96e2c8af].flush_io_errs 0 [/dev/mapper/luks-fd8b8b99-7171-4c66-a8b9-bffe96e2c8af].corruption_errs 0 [/dev/mapper/luks-fd8b8b99-7171-4c66-a8b9-bffe96e2c8af].generation_errs 0 [/dev/mapper/luks-0b9c674e-1520-4c6c-aa8d-291f3b304a1f].write_io_errs 0 [/dev/mapper/luks-0b9c674e-1520-4c6c-aa8d-291f3b304a1f].read_io_errs 0 [/dev/mapper/luks-0b9c674e-1520-4c6c-aa8d-291f3b304a1f].flush_io_errs 0 [/dev/mapper/luks-0b9c674e-1520-4c6c-aa8d-291f3b304a1f].corruption_errs 0 [/dev/mapper/luks-0b9c674e-1520-4c6c-aa8d-291f3b304a1f].generation_errs 0

This output is simple, but very verbose when several disks are involved. Five counters are shown for each device. In my case, the devices are all /dev/mapper devices since I’ve used full disk encryption with LUKS. It’s not immediately obvious which /dev/sdX devices they map to. As far as the numbers go, 0=good. Anything else is bad.

Just like in the previous table, we use df to get a list of all the Btrfs mount points.

BTRFS_PATH, MAP

As we loop though the devices, we output them as BTRFS_PATH. We then collect all the devices listed in [square brackets] from the btrfs device stats command and remove duplicates. To keep output minimal, I decided to keep only the first 9 characters, e.g., luks-a4aa, in the report, but a -v option exists to show the full path.

DEVICE

In order to show sda alongside luks-a4aa we need something that associates the too. Once again lsblk has what we need, though it requires some critical manipulation. Let’s consider just one device to illustrate the challenge.

lsblk -n -i -p -o NAME,TYPE,UUID /dev/sda
/dev/sda disk a4aa4856-393b-4620-8c76-88c883cdb632 `-/dev/mapper/luks-a4aa4856-393b-4620-8c76-88c883cdb632 crypt 709f7252-4bdb-43a2-9601-7ee94c15d501

You can see the graphical connection, but it’s much easier to match and lookup things when it’s all on one line. lsblk -n -i -p -o NAME,TYPE,UUID | while read line; do printf "%s" "$([[ "${line}" == *"disk"* ]] && printf $' '"%s" "$line" || printf "%s " "$line")"; done does exactly that. With that output, we can grep a line for luks-a4aa... and then pipe the result through awk '{printf $1}' to get the first column of the results.

WRFCG

This column is just the first letter of write_io_errs, read_io_errs, flush_io_errs, corruption_errs, and generation_errs. As with other tables in this report, I check the values of each and report . if the value is 0 and x if it’s greater than 0.

More Than One Server In The Report

The report as viewed on an iPhone.

So everything above describes the steps to get a report locally on a single server. But I have three. One solution would be to have each server run its own report and email it to me, but then I’d get three emails. While not terrible, I’m far more likely to scan over the report if it’s just one email.

To accomplish this single email approach, I have one server scheduled via a cronjob to run a script that runs the report locally, and then ssh into each server I’m interested in, run the report, and pull the results into a single email. I also leverage this script to make a multi-part email as I’ve found most email clients will render plain text email with a non-fixed-width font. I use the html portion of the email to explicitly declare a monospace font-family, which ensures the report columns stay aligned and readable.

To ssh into all my servers in the cron’ed script, I use the keychain command, which interfaces with ssh-agent and allows me to ssh to my other servers without entering my ssh key’s password each time.

Consider this snippet. The source calls ensure I’m using an existing keychain instance. The first plain_text_report assignment is a local one. The second is over ssh.

source "/home/${USER}/.keychain/${HOSTNAME}-sh"
plain_text_report="CORE $(uptime -p) on $(uname -r)"$'\n\n'
plain_text_report+="ECHO $(source "/home/${USER}/.keychain/${HOSTNAME}-sh"; ssh -l ${USER} -i /home/${USER}/.ssh/id_ed25519 10.0.1.202 "uptime -p")

Final Notes

This isn’t a particularly fast report. For my purposes, which is running this every few days or on demand, it works just fine and it’s very hackable. There are some obvious areas that could be optimized (e.g., caching smartctl calls) that maybe I’ll get to someday. If you’d like to improve it, it’s on github.