Maintenance#

Software#

Database#

PostgreSQL#

Important

All these instructions have been tested with PostgreSQL version 13 only!

Changing collate#

Sometimes you might create a new database without specifying the collate information. PosgreSQL will use the default collate setting and there is no way to change it once the database is created. The only solution is to dump the database, to create a new one with the correct collate and finally to import the dump and drop the original database.

See also

  • What is Collation in Databases? [1]

  1. backup old database

    pg_dump -U myuser olddb > olddb.bak.sql
    
  2. create new database with correct collate

    CREATE DATABASE newdb WITH OWNER myuser TEMPLATE template0 ENCODING UTF8 LC_COLLATE 'en_US.UTF-8' LC_CTYPE 'en_US.UTF-8';
    
  3. import dump into new database

    psql -U myuser -d newdb < olddb.bak.sql
    
  4. rename old database

    ALTER DATABASE olddb RENAME TO olddb_bak;
    
  5. rename the new database to the original database name

    ALTER DATABASE newdb RENAME TO olddb;
    
  6. restart services and check that everything works

  7. drop old database

    DROP DATABASE olddb_bak;
    
Moving data directory#

If the data directory in the root partition is getting too large you can create a new partition (mounted on /postgresql in this example) and let PostgreSQL point to that one instead.

  1. install the dependencies

    apt-get install rsync
    
  2. stop PostgreSQL

    systemctl stop postgresql
    
  3. copy the database directory

    rsync -avAX /var/lib/postgresql/13/main /postgresql/13
    
  4. change the data_directory setting in /etc/postgresql/postgresql.conf

    data_directory = '/postgresql/13/main'      # use data in another directory
    
  5. restart PostgreSQL

    systemctl start postgresql
    

High availability#

In this example we configure two nodes. When the master node goes offline the backup node takes over the floating IP address. We replicate services on the backup node such as Apache and Unbound.

Keepalived is a tool which handles network replication for layers 3 and 4. Here you can find the configuration file for the master node (the backup node just needs minor edits).

A script you find below helps you copy the content of webservers, DNS server, etc on the backup node as well as restarting those services automatically.

See also

  • Keepalived for Linux [2]

  • Is it possible to add a static mac address for a vrrp ip? · Issue #34 · osixia/docker-keepalived · GitHub [3]

  • Building Layer 3 High Availability | Documentation [4]

Key#

Node

IP

Hostname

Network interface name

MASTER

192.168.0.10

mst

eno1

BACKUP

192.168.0.11

bak

eno1

floating IP

192.168.0.100

-

-

Basic setup#

  1. install the dependencies. Keepalived must be installed on all nodes

    apt-get install keepalived rsync
    
  2. create the configuration for the master node

    /etc/keepalived/keepalived.conf#
     1global_defs {
     2    max_auto_priority -1
     3}
     4
     5########################
     6## VRRP configuration ##
     7########################
     8
     9# Identify the VRRP instance as, in this case, "failover_link".
    10vrrp_instance failover_link {
    11
    12    # Initial state of the keepalived VRRP instance on this host
    13    # (MASTER or BACKUP). Once started, only priority matters.
    14    state MASTER
    15
    16    # interface this VRRP instance is bound to.
    17    interface eno1
    18
    19    # Arbitrary value between 1 and 255 to distinguish this VRRP
    20    # instance from others running on the same device. It must match
    21    # with other peering devices.
    22    virtual_router_id 1
    23
    24    # Highest priority value takes the MASTER role and the
    25    # virtual IP (default value is 100).
    26    priority 110
    27
    28    # Time, in seconds, between VRRP advertisements. The default is 1,
    29    # but in some cases you can achieve more reliable results by
    30    # increasing this value.
    31    advert_int 2
    32
    33    use_vmac
    34    vmac_xmit_base
    35
    36    # Authentication method: AH indicates ipsec Authentication Header.
    37    # It offers more security than PASS, which transmits the
    38    # authentication password in plaintext. Some implementations
    39    # have complained of problems with AH, so it may be necessary
    40    # to use PASS to get keepalived's VRRP working.
    41    #
    42    # The auth_pass will only use the first 8 characters entered.
    43    authentication {
    44        auth_type AH
    45        auth_pass f5K.*0Bq
    46    }
    47
    48    # VRRP advertisements ordinarily go out over multicast. This
    49    # configuration paramter causes keepalived to send them
    50    # as unicasts. This specification can be useful in environments
    51    # where multicast isn't supported or in instances where you want
    52    # to limit which devices see your VRRP announcements. The IP
    53    # address(es) can be IPv4 or IPv6, and indicate the real IP of
    54    # other members.
    55    unicast_peer {
    56        192.168.0.11
    57    }
    58    # Virtual IP address(es) that will be shared among VRRP
    59    # members. "Dev" indicates the interface the virtual IP will
    60    # be assigned to. And "label" allows for clearer description of the
    61    # virtual IP.
    62    virtual_ipaddress {
    63        192.168.0.100 dev eno1 label eno1:vip
    64    }
    65}
    

    Note

    Copy this file in the backup node as well and change:

    • state MASTER to state BACKUP

    • unicast_peer { 192.168.0.11 } to unicast_peer { 192.168.0.10 }

    • priority 110 to priority 100

  3. restart keepalived on both nodes

    systemctl restart keepalived
    
  4. ping the floating IP address

    ping -c1 192.168.0.100
    
  5. test replication by stopping Keepalived on the master node only and pinging the floating IP address. Finally, restart keepalived

    systemctl stop keepalived
    ping -c1 192.168.0.100
    systemctl start keepalived
    

Service replication#

Make sure to be in a trusted network because we allow root login via SSH to simplify operations. In this example we copy files from

  • Apache

  • Unbound

  • dnscrypt-proxy

  • Certbot (Let’s encrypt)

The enabled_files directory in the master node contains files with lists of files or directories which will be copied by rsync to the backup server.

  1. create the script

    /home/jobs/scripts/by-user/root/keepalived/keepalived_deploy.sh#
     1#!/usr/bin/env bash
     2#
     3# keepalived_deploy.sh
     4#
     5# Copyright (C) 2022 Franco Masotti (franco \D\o\T masotti {-A-T-} tutanota \D\o\T com)
     6#
     7# This program is free software: you can redistribute it and/or modify
     8# it under the terms of the GNU General Public License as published by
     9# the Free Software Foundation, either version 3 of the License, or
    10# (at your option) any later version.
    11#
    12# This program is distributed in the hope that it will be useful,
    13# but WITHOUT ANY WARRANTY; without even the implied warranty of
    14# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    15# GNU General Public License for more details.
    16#
    17# You should have received a copy of the GNU General Public License
    18# along with this program.  If not, see <http://www.gnu.org/licenses/>.
    19
    20set -euo pipefail
    21
    22. "${1}"
    23
    24SRC='/'
    25DST='/'
    26ENABLED_FILES=$(find enabled_files/* -type f)
    27SYSTEMD_DEPLOY_SERVICES=$(cat systemd_deploy_services.txt)
    28
    29# Sync files.
    30for f in ${ENABLED_FILES}; do
    31    printf "%s\n" "${RSYNC_BASE} --files-from="${f}" "${SRC}" \
    32        "${USER}"@"${HOST}":"${DST}""
    33    ${RSYNC_BASE} --files-from="${f}" "${SRC}" "${USER}"@"${HOST}":"${DST}"
    34done
    35
    36# Restart systemd services.
    37ssh "${USER}"@"${HOST}" "\
    38    systemctl daemon-reload \
    39    && systemctl reenable --all "${SYSTEMD_DEPLOY_SERVICES}" \
    40    && systemctl restart --all "${SYSTEMD_DEPLOY_SERVICES}" \
    41    && systemctl status --all --no-pager "${SYSTEMD_DEPLOY_SERVICES}""
    
  2. create a configuration file

    /home/jobs/scripts/by-user/root/keepalived/keepalived_deploy.conf#
    1# o      The  --archive  (-a) option's behavior does not imply --recursive
    2#        (-r), so specify it explicitly, if you want it.
    3# Use --dry-run to simulate
    4RSYNC_BASE='rsync -avAX -r --delete'
    5USER='root'
    6HOST='192.168.0.11'
    
  3. create an SSH key. Do not set a password for it

    ssh-keygen -t rsa -b 16384 -C "$(whoami)@$(hostname)-$(date +%F)"
    
  4. Add the following to the SSH configuration

    /root/.ssh/config#
    1Match host 192.168.0.11 user root
    2  IdentityFile=/root/.ssh/bak_root
    
  5. go to the backup node and copy the newly created public key in /root/.ssh/authorized_keys

  6. edit the SSH server configuration

    /etc/ssh/sshd_config#
    1# [ ... ]
    2
    3PermitRootLogin yes
    4AllowUsers root    # [ ... ]
    5Match user root
    6    PasswordAuthentication no
    7
    8# [ ... ]
    
  7. restart the SSH service in the backup node

    systemctl restart ssh
    
  8. go back to the master node and test if the key is working

    ssh root@192.168.0.11
    
  9. create a Systemd service unit file

    /home/jobs/services/by-user/root/keepalived-deploy.service#
     1[Unit]
     2Description=Copy files for keepalived
     3Requires=network-online.target
     4After=network-online.target
     5
     6[Service]
     7Type=simple
     8WorkingDirectory=/home/jobs/scripts/by-user/root/keepalived
     9ExecStart=/home/jobs/scripts/by-user/root/keepalived/keepalived_deploy.sh /home/jobs/scripts/by-user/root/keepalived/keepalived_deploy.conf
    10User=root
    11Group=root
    
  10. create a Systemd timer unit file

    /home/jobs/services/by-user/root/keepalived-deploy.timer#
    1[Unit]
    2Description=Once every day copy files for keepalived
    3
    4[Timer]
    5OnCalendar=*-*-* 5:30:00
    6Persistent=true
    7
    8[Install]
    9WantedBy=timers.target
    

Apache replication#

See also

  • How to redirect all pages to one page? [5]

  1. in your master node, separate replicatable service from non-replicatable ones. You can do this by separating the configuration in multiple files and then including those configuration in the main file (/etc/apache2/apache2.conf).

  2. add these files in the master node.

    The first one copies the Apache configuration

    /home/jobs/scripts/by-user/root/keepalived/enabled_files/apache2.txt#
    /etc/apache2/apache2.conf
    /etc/apache2/replicated-servers.conf
    

    Important

    you must change the virtual host directive to use the floating IP like this: <VirtualHost 192.168.0.100:443>

    The second file copies the server data. You can replicate static data (Jekyll website, HTML, etc…) but not programs that rely on databases without extra work

    /home/jobs/scripts/by-user/root/keepalived/enabled_files/replicated_webservers_data.txt#
    /var/www/franco.net.eu.org
    /var/www/assets.franco.net.eu.org
    /var/www/blog.franco.net.eu.org
    /var/www/docs.franco.net.eu.org
    /var/www/keepachangelog.franco.net.eu.org
    

    The third file copies all the HTTPS certificates

    /home/jobs/scripts/by-user/root/keepalived/enabled_files/letsencrypt.txt#
    /etc/letsencrypt
    
  3. in the backup node you must “patch” the non-replicatable service. You can setup an error message for each server like this:

    /etc/apache2/standard-servers.conf#
     1<IfModule mod_ssl.c>
     2<VirtualHost 192.168.0.100:443>
     3    UseCanonicalName on
     4    Keepalive On
     5    SSLCompression      off
     6    ServerName software.franco.net.eu.org
     7    RewriteEngine On
     8
     9    Include /etc/apache2/standard-servers-outage-text.conf
    10
    11    Include /etc/letsencrypt/options-ssl-apache.conf
    12    SSLCertificateFile /etc/letsencrypt/live/software.franco.net.eu.org/fullchain.pem
    13    SSLCertificateKeyFile /etc/letsencrypt/live/software.franco.net.eu.org/privkey.pem
    14</VirtualHost>
    15</IfModule>
    
    /etc/apache2/standard-servers-outage-text.conf#
     1DocumentRoot "/var/www/standard-servers"
     2<Directory "/var/www/standard-servers">
     3    Options -ExecCGI -Includes -FollowSymLinks -Indexes
     4    AllowOverride None
     5    Require all granted
     6</Directory>
     7
     8# Redirect all requests to the root directory of the virtual server.
     9RewriteEngine On
    10RewriteRule \/.+ / [L,R]
    

    Create a file in /var/www/standard-servers/index.html with your outage message

DNS replication#

I use dnscrypt-proxy as DNS server and Unbound as caching server. The systemd socket file is useful to set the listening port.

See also

  • Unbound DNS server behind a VIP - solving reply from unexpected source [6]

  1. use these configurations to replicate the two services.

    /home/jobs/scripts/by-user/root/keepalived/enabled_files/dnscrypt-proxy.txt#
    /etc/dnscrypt-proxy/dnscrypt-proxy.toml
    /etc/systemd/system/dnscrypt-proxy.socket
    
    /home/jobs/scripts/by-user/root/keepalived/enabled_files/unbound.txt#
    /etc/unbound/unbound.conf
    

Important

Add interface-automatic: yes to the unbound configuration in the server section.

Final steps#

  1. run the deploy script

Kernel#

See also

  • filesystem - Where does update-initramfs look for kernel versions? - Ask Ubuntu [7]

RAID#

Run periodical RAID data scrubs on hard drives and SSDs.

See also

  • ubuntu - How to wipe md raid meta? - Unix & Linux Stack Exchange [8]

  • RAID data scrubbing [9]

  1. install the dependencies

    apt-get install mdadm python3-yaml python3-requests
    
  2. install fpyutils. See reference

  3. create the jobs directories. See reference

    mkdir -p /home/jobs/{scripts,services}/by-user/root
    
  4. create the script

    /home/jobs/scripts/by-user/root/mdadm_check.py#
      1#!/usr/bin/env python3
      2#
      3# Copyright (C) 2014-2017 Neil Brown <neilb@suse.de>
      4#
      5#
      6#    This program is free software; you can redistribute it and/or modify
      7#    it under the terms of the GNU General Public License as published by
      8#    the Free Software Foundation; either version 2 of the License, or
      9#    (at your option) any later version.
     10#
     11#    This program is distributed in the hope that it will be useful,
     12#    but WITHOUT ANY WARRANTY; without even the implied warranty of
     13#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
     14#    GNU General Public License for more details.
     15#
     16#    Author: Neil Brown
     17#    Email: <neilb@suse.com>
     18#
     19# Copyright (C) 2019-2022 Franco Masotti (franco \D\o\T masotti {-A-T-} tutanota \D\o\T com)
     20r"""Run RAID tests."""
     21
     22import collections
     23import multiprocessing
     24import os
     25import pathlib
     26import sys
     27import time
     28
     29import fpyutils
     30import yaml
     31
     32# Constants.
     33STATUS_CLEAN = 'clean'
     34STATUS_ACTIVE = 'active'
     35STATUS_IDLE = 'idle'
     36
     37
     38class UserNotRoot(Exception):
     39    """The user running the script is not root."""
     40
     41
     42class NoAvailableArrays(Exception):
     43    """No available arrays."""
     44
     45
     46class NoSelectedArraysPresent(Exception):
     47    """None of the arrays in the configuration file exists."""
     48
     49
     50def get_active_arrays():
     51    active_arrays = list()
     52    with open('/proc/mdstat') as f:
     53        line = f.readline()
     54        while line:
     55            if STATUS_ACTIVE in line:
     56                active_arrays.append(line.split()[0])
     57            line = f.readline()
     58
     59    return active_arrays
     60
     61
     62def get_array_state(array: str):
     63    return open('/sys/block/' + array + '/md/array_state').read().rstrip()
     64
     65
     66def get_sync_action(array: str):
     67    return open('/sys/block/' + array + '/md/sync_action').read().rstrip()
     68
     69
     70def run_action(array: str, action: str):
     71    with open('/sys/block/' + array + '/md/sync_action', 'w') as f:
     72        f.write(action)
     73
     74
     75def main_action(array: str, config: dict):
     76    action = devices[array]
     77    go = True
     78    while go:
     79        if get_sync_action(array) == STATUS_IDLE:
     80            message = 'running ' + action + ' on /dev/' + array + '. pid: ' + str(
     81                os.getpid())
     82            run_action(array, action)
     83            message += '\n\n'
     84            message += 'finished pid: ' + str(os.getpid())
     85            print(message)
     86
     87            if config['notify']['gotify']['enabled']:
     88                m = config['notify']['gotify']['message'] + ' ' + '\n' + message
     89                fpyutils.notify.send_gotify_message(
     90                    config['notify']['gotify']['url'],
     91                    config['notify']['gotify']['token'], m,
     92                    config['notify']['gotify']['title'],
     93                    config['notify']['gotify']['priority'])
     94            if config['notify']['email']['enabled']:
     95                fpyutils.notify.send_email(
     96                    message, config['notify']['email']['smtp_server'],
     97                    config['notify']['email']['port'],
     98                    config['notify']['email']['sender'],
     99                    config['notify']['email']['user'],
    100                    config['notify']['email']['password'],
    101                    config['notify']['email']['receiver'],
    102                    config['notify']['email']['subject'])
    103
    104            go = False
    105        if go:
    106            print('waiting ' + array + ' to be idle...')
    107            time.sleep(config['generic']['timeout_idle_check'])
    108
    109
    110if __name__ == '__main__':
    111    if os.getuid() != 0:
    112        raise UserNotRoot
    113
    114    configuration_file = sys.argv[1]
    115    config = yaml.load(open(configuration_file), Loader=yaml.SafeLoader)
    116    devices = dict()
    117    for dev_element in config['devices']:
    118        key = dev_element.keys()
    119        device = list(key)[0]
    120        devices[device] = dev_element[device]
    121
    122    active_arrays = get_active_arrays()
    123    dev_queue = collections.deque()
    124    if len(active_arrays) > 0:
    125        for dev in active_arrays:
    126            if pathlib.Path('/sys/block/' + dev + '/md/sync_action').is_file():
    127                state = get_array_state(dev)
    128                if state == STATUS_CLEAN or state == STATUS_ACTIVE or state == STATUS_IDLE:
    129                    try:
    130                        if devices[dev] != 'ignore' and dev in devices:
    131                            dev_queue.append(dev)
    132                    except KeyError:
    133                        pass
    134
    135    if len(active_arrays) == 0:
    136        raise NoAvailableArrays
    137    if len(dev_queue) == 0:
    138        raise NoSelectedArraysPresent
    139
    140    while len(dev_queue) > 0:
    141        for i in range(0, config['generic']['max_concurrent_checks']):
    142            if len(dev_queue) > 0:
    143                ready = dev_queue.popleft()
    144                p = multiprocessing.Process(target=main_action,
    145                                            args=(
    146                                                ready,
    147                                                config,
    148                                            ))
    149                p.start()
    150        p.join()
    
  5. create a configuration file

    /home/jobs/scripts/by-user/root/mdadm_check.yaml#
     1#
     2# mdadm_check.yaml
     3#
     4# Copyright (C) 2014-2017 Neil Brown <neilb@suse.de>
     5#
     6#
     7#    This program is free software; you can redistribute it and/or modify
     8#    it under the terms of the GNU General Public License as published by
     9#    the Free Software Foundation; either version 2 of the License, or
    10#    (at your option) any later version.
    11#
    12#    This program is distributed in the hope that it will be useful,
    13#    but WITHOUT ANY WARRANTY; without even the implied warranty of
    14#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    15#    GNU General Public License for more details.
    16#
    17#    Author: Neil Brown
    18#    Email: <neilb@suse.com>
    19#
    20# Copyright (C) 2019-2022 Franco Masotti (franco \D\o\T masotti {-A-T-} tutanota \D\o\T com)
    21
    22generic:
    23    # The maximum number of concurrent processes.
    24    max_concurrent_checks: 2
    25
    26    # In seconds.
    27    timeout_idle_check: 10
    28
    29# key:      RAID array name without '/dev/'.
    30# value:    action.
    31devices:
    32    md1: 'check'
    33    md2: 'ignore'
    34    md3: 'check'
    35
    36notify:
    37    email:
    38        enabled: true
    39        smtp_server: 'smtp.gmail.com'
    40        port: 465
    41        sender: 'myusername@gmail.com'
    42        user: 'myusername'
    43        password: 'my awesome password'
    44        receiver: 'myusername@gmail.com'
    45        subject: 'mdadm operation'
    46    gotify:
    47        enabled: true
    48        url: '<gotify url>'
    49        token: '<app token>'
    50        title: 'mdadm operation'
    51        message: 'starting mdadm operation'
    52        priority: 5
    

    Important

    • do not prepend /dev to RAID device names

    • possible values: check, repair, idle, ignore

      • ignore will make the script skip the device

      • use repair at your own risk

    • absent devices are ignored

    • run these commands to get the names of RAID arrays

      lsblk
      cat /proc/mdstat
      
  6. create a Systemd service unit file

    /home/jobs/services/by-user/root/mdadm-check.service#
     1[Unit]
     2Description=mdadm check
     3Requires=sys-devices-virtual-block-md1.device
     4Requires=sys-devices-virtual-block-md2.device
     5Requires=sys-devices-virtual-block-md3.device
     6After=sys-devices-virtual-block-md1.device
     7After=sys-devices-virtual-block-md2.device
     8After=sys-devices-virtual-block-md3.device
     9
    10[Service]
    11Type=simple
    12ExecStart=/home/jobs/scripts/by-user/root/mdadm_check.py /home/jobs/scripts/by-user/root/mdadm_check.yaml
    13User=root
    14Group=root
    15
    16[Install]
    17WantedBy=multi-user.target
    
  7. create a Systemd timer unit file

    /home/jobs/services/by-user/root/mdadm-check.timer#
    1[Unit]
    2Description=Once a month check mdadm arrays
    3
    4[Timer]
    5OnCalendar=Monthly
    6Persistent=true
    7
    8[Install]
    9WantedBy=timers.target
    
  8. fix the permissions

    chmod 700 /home/jobs/{scripts,services}/by-user/root
    
  9. run the deploy script

S.M.A.R.T.#

Run periodical S.M.A.R.T. tests on hard drives and SSDs. The provided script supports only /dev/disk/by-id names.

See also

  • A collection of scripts I have written and/or adapted that I currently use on my systems as automated tasks [10]

  1. install the dependencies

    apt-get install hdparm smartmontools python3-yaml python3-requests
    
  2. install fpyutils. See reference

  3. identify the drives you want to check S.M.A.R.T. values

    ls /dev/disk/by-id
    

    See also the udev rule file /lib/udev/rules.d/60-persistent-storage.rules. You can also use this command to have more details of specific drives

    hdparm -I /dev/disk/by-id/${drive_name}
    # or
    hdparm -I /dev/sd${letter}
    
  4. create the jobs directories. See reference

    mkdir -p /home/jobs/{scripts,services}/by-user/root
    chmod 700 -R /home/jobs/{scripts,services}/by-user/root
    
  5. create the script

    /home/jobs/scripts/by-user/root/smartd_test.py#
      1#!/usr/bin/env python3
      2#
      3# smartd_test.py
      4#
      5# Copyright (C) 2019-2021 Franco Masotti (franco \D\o\T masotti {-A-T-} tutanota \D\o\T com)
      6#
      7# This program is free software: you can redistribute it and/or modify
      8# it under the terms of the GNU General Public License as published by
      9# the Free Software Foundation, either version 3 of the License, or
     10# (at your option) any later version.
     11#
     12# This program is distributed in the hope that it will be useful,
     13# but WITHOUT ANY WARRANTY; without even the implied warranty of
     14# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
     15# GNU General Public License for more details.
     16#
     17# You should have received a copy of the GNU General Public License
     18# along with this program.  If not, see <http://www.gnu.org/licenses/>.
     19r"""Run S.M.A.R.T tests on hard drives."""
     20
     21import json
     22import os
     23import pathlib
     24import re
     25import shlex
     26import subprocess
     27import sys
     28
     29import fpyutils
     30import yaml
     31
     32
     33class UserNotRoot(Exception):
     34    """The user running the script is not root."""
     35
     36
     37def get_disks() -> list:
     38    r"""Scan all the disks."""
     39    disks = list()
     40    for d in pathlib.Path('/dev/disk/by-id').iterdir():
     41        # Ignore disks ending with part-${integer} to avoid duplicates (names
     42        # corresponding to partitions of the same disk).
     43        disk = str(d)
     44        if re.match('.+-part[0-9]+$', disk) is None:
     45            try:
     46                ddict = json.loads(
     47                    subprocess.run(
     48                        shlex.split('smartctl --capabilities --json ' +
     49                                    shlex.quote(disk)),
     50                        capture_output=True,
     51                        check=False,
     52                        shell=False,
     53                        timeout=30).stdout)
     54                try:
     55                    # Check for smart test support.
     56                    if ddict['ata_smart_data']['capabilities'][
     57                            'self_tests_supported']:
     58                        disks.append(disk)
     59                except KeyError:
     60                    pass
     61            except subprocess.TimeoutExpired:
     62                print('timeout for ' + disk)
     63            except subprocess.CalledProcessError:
     64                print('device ' + disk +
     65                      ' does not support S.M.A.R.T. commands, skipping...')
     66
     67    return disks
     68
     69
     70def disk_ready(disk: str, busy_status: int = 249) -> bool:
     71    r"""Check if the disk is ready."""
     72    # Raises a KeyError if disk has not S.M.A.R.T. status capabilities.
     73    ddict = json.loads(
     74        subprocess.run(shlex.split('smartctl --capabilities --json ' +
     75                                   shlex.quote(disk)),
     76                       capture_output=True,
     77                       check=True,
     78                       shell=False,
     79                       timeout=30).stdout)
     80    if ddict['ata_smart_data']['self_test']['status']['value'] != busy_status:
     81        return True
     82    else:
     83        return False
     84
     85
     86def run_test(disk: str, test_length: str = 'long') -> str:
     87    r"""Run the smartd test."""
     88    return subprocess.run(
     89        shlex.split('smartctl --test=' + shlex.quote(test_length) + ' ' +
     90                    shlex.quote(disk)),
     91        capture_output=True,
     92        check=True,
     93        shell=False,
     94        timeout=30).stdout
     95
     96
     97if __name__ == '__main__':
     98    if os.getuid() != 0:
     99        raise UserNotRoot
    100
    101    configuration_file = shlex.quote(sys.argv[1])
    102    config = yaml.load(open(configuration_file), Loader=yaml.SafeLoader)
    103
    104    # Do not prepend '/dev/disk/by-id/'.
    105    disks_to_check = shlex.quote(sys.argv[2])
    106    disks_available = get_disks()
    107
    108    for d in config['devices']:
    109        dev = '/dev/disk/by-id/' + d
    110        if config['devices'][d]['enabled'] and dev in disks_available:
    111            if disks_to_check == 'all' or disks_to_check == d:
    112                if disk_ready(dev, config['devices'][d]['busy_status']):
    113                    print('attempting ' + d + ' ...')
    114                    message = run_test(
    115                        dev, config['devices'][d]['test']).decode('utf-8')
    116                    print(message)
    117                    if config['devices'][d]['log']:
    118                        if config['notify']['gotify']['enabled']:
    119                            m = config['notify']['gotify'][
    120                                'message'] + ' ' + d + '\n' + message
    121                            fpyutils.notify.send_gotify_message(
    122                                config['notify']['gotify']['url'],
    123                                config['notify']['gotify']['token'], m,
    124                                config['notify']['gotify']['title'],
    125                                config['notify']['gotify']['priority'])
    126                        if config['notify']['email']['enabled']:
    127                            fpyutils.notify.send_email(
    128                                message,
    129                                config['notify']['email']['smtp_server'],
    130                                config['notify']['email']['port'],
    131                                config['notify']['email']['sender'],
    132                                config['notify']['email']['user'],
    133                                config['notify']['email']['password'],
    134                                config['notify']['email']['receiver'],
    135                                config['notify']['email']['subject'])
    136                else:
    137                    # Drop test requests if a disk is running a test in a particular moment.
    138                    # This avoid putting the disks under too much stress.
    139                    print('disk ' + d + ' not ready, checking the next...')
    
  6. create a configuration file

    includes/home/jobs/scripts/by-user/root/smartd_test.yaml#
     1#
     2# smartd_test.yaml
     3#
     4# Copyright (C) 2019-2020 Franco Masotti (franco \D\o\T masotti {-A-T-} tutanota \D\o\T com)
     5#
     6# This program is free software: you can redistribute it and/or modify
     7# it under the terms of the GNU General Public License as published by
     8# the Free Software Foundation, either version 3 of the License, or
     9# (at your option) any later version.
    10#
    11# This program is distributed in the hope that it will be useful,
    12# but WITHOUT ANY WARRANTY; without even the implied warranty of
    13# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    14# GNU General Public License for more details.
    15#
    16# You should have received a copy of the GNU General Public License
    17# along with this program.  If not, see <http://www.gnu.org/licenses/>.
    18
    19devices:
    20    ata-disk1:
    21        enabled: true
    22        test: 'long'
    23        log: true
    24        busy_status: 249
    25    ata-disk2:
    26        enabled: true
    27        test: 'long'
    28        log: false
    29        busy_status: 249
    30    ata-diskn:
    31        enabled: true
    32        test: 'long'
    33        log: true
    34        busy_status: 249
    35
    36notify:
    37    gotify:
    38        enabled: true
    39        url: '<gotify url>'
    40        token: '<app token>'
    41        title: 'smart test'
    42        message: 'starting smart test on'
    43        priority: 5
    44    email:
    45        enabled: true
    46        smtp_server: 'smtp.gmail.com'
    47        port: 465
    48        sender: 'myusername@gmail.com'
    49        user: 'myusername'
    50        password: 'my awesome password'
    51        receiver: 'myusername@gmail.com'
    52        subject: 'smartd test'
    

    Important

    • absent devices are ignored

    • devices must be explicitly enabled

    • do not prepend /dev/disk/by-id/ to drive names

    • run a short test to get the busy_status value.

      smartctl -t short /dev/disk/by-id/${drive_name}
      

      You should be able to capture the value while the test is running by looking at the Self-test execution status: line. In my case it is always 249, but this value is not hardcoded in smartmontools’ source code

      smartctl --all /dev/disk/by-id/${drive_name}
      
  7. use this Systemd service unit file

    /home/jobs/services/by-user/root/smartd-test.ata_disk1.service#
    1[Unit]
    2Description=execute smartd on ata-disk1
    3
    4[Service]
    5Type=simple
    6ExecStart=/home/jobs/scripts/by-user/root/smartd_test.py /home/jobs/scripts/by-user/root/smartd_test.yaml ata-disk1
    7User=root
    8Group=root
    
  8. use this Systemd timer unit file

    /home/jobs/services/by-user/root/smartd-test.ata_disk1.timer#
    1[Unit]
    2Description=Once every two months smart test ata-disk1
    3
    4[Timer]
    5OnCalendar=*-01,03,05,07,09,11-01 00:00:00
    6Persistent=true
    7
    8[Install]
    9WantedBy=timers.target
    
  9. fix the permissions

    chmod 700 -R /home/jobs/scripts/by-user/smartd_test.*
    chmod 700 -R /home/jobs/services/by-user/root
    
  10. run the deploy script

Important

To avoid tests being interrupted you must avoid putting the disks to sleep, therefore, programs like hd-idle must be stopped before running the tests.

Services#

Notify unit status#

This script is useful to notfiy about failed Systemd service.

Some time ago my Gitea instance could not start after an update. If I used this script I would have known immediately about the problem instead of several days later.

See also

  • linux - get notification when systemd-monitored service enters failed state - Server Fault [11]

  • GitHub - caronc/apprise: Apprise - Push Notifications that work with just about every platform! [12]

  1. install the dependencies

    apt-get install python3-pip python3-venv
    
  2. create the jobs directories. See reference

    mkdir -p /home/jobs/{scripts,services}/by-user/root
    
  3. create a new virtual environment

    cd /home/jobs/scripts/by-user/root
    python3 -m venv .venv_notify_unit_status
    . .venv_notify_unit_status/bin/activate
    
  4. create the requirements_notify_unit_status.txt file

    /home/jobs/scripts/by-user/root/requirements_notify_unit_status.txt#
    apprise
    PyYAML
    
  5. install the dependencies

    pip3 install -r requirements_notify_unit_status.txt
    deactivate
    
  6. create the script

    /home/jobs/scripts/by-user/root/notify_unit_status.py#
     1#!/usr/bin/env python3
     2#
     3# notify_unit_status.py
     4#
     5# Copyright (C) 2015 Pablo Martinez @ Stack Exchange (https://serverfault.com/a/701100)
     6# Copyright (C) 2018 Davy Landman @ Stack Exchange (https://serverfault.com/a/701100)
     7# Copyright (C) 2020-2024 Franco Masotti
     8#
     9# This script is licensed under a
    10# Creative Commons Attribution-ShareAlike 4.0 International License.
    11#
    12# You should have received a copy of the license along with this
    13# work. If not, see <http://creativecommons.org/licenses/by-sa/4.0/>.
    14r"""Send a notification when a Systemd unit fails."""
    15
    16import shlex
    17import sys
    18
    19import apprise
    20import yaml
    21
    22if __name__ == '__main__':
    23    configuration_file = shlex.quote(sys.argv[1])
    24    config = yaml.load(open(configuration_file), Loader=yaml.SafeLoader)
    25    failed_unit = shlex.quote(sys.argv[2])
    26
    27    message = 'Systemd service failure: ' + failed_unit
    28
    29    # Create an Apprise instance.
    30    apobj = apprise.Apprise()
    31
    32    # Add all of the notification services by their server url.
    33    for uri in config['apprise_notifiers']['dest']:
    34        apobj.add(uri)
    35
    36    apobj.notify(body=message, title=config['apprise_notifiers']['title'])
    
  7. create a configuration file

    includes/home/jobs/scripts/by-user/root/notify_unit_status.yaml#
     1#
     2# notify_unit_status.yaml
     3#
     4# Copyright (C) 2015 Pablo Martinez @ Stack Exchange (https://serverfault.com/a/701100)
     5# Copyright (C) 2018 Davy Landman @ Stack Exchange (https://serverfault.com/a/701100)
     6# Copyright (C) 2020-2024 Franco Masotti
     7#
     8# This script is licensed under a
     9# Creative Commons Attribution-ShareAlike 4.0 International License.
    10#
    11# You should have received a copy of the license along with this
    12# work. If not, see <http://creativecommons.org/licenses/by-sa/4.0/>.
    13
    14apprise_notifiers:
    15  dest:
    16    - 'nctalks://<string>/'
    17    - 'mailtos://<string>'
    18  title: 'notify unit status'
    
  8. use this Systemd service unit file

    includes/home/jobs/services/by-user/root/notify-unit-status@.service#
     1#
     2# notify-unit-status@.service
     3#
     4# Copyright (C) 2015 Pablo Martinez @ Stack Exchange (https://serverfault.com/a/701100)
     5# Copyright (C) 2018 Davy Landman @ Stack Exchange (https://serverfault.com/a/701100)
     6# Copyright (C) 2020 Franco Masotti (franco \D\o\T masotti {-A-T-} tutanota \D\o\T com)
     7#
     8# This script is licensed under a
     9# Creative Commons Attribution-ShareAlike 4.0 International License.
    10#
    11# You should have received a copy of the license along with this
    12# work. If not, see <http://creativecommons.org/licenses/by-sa/4.0/>.
    13
    14[Unit]
    15Description=Unit Status Mailer Service
    16After=network-online.target
    17Wants=network-online.target
    18
    19[Service]
    20Type=simple
    21WorkingDirectory=/home/jobs/scripts/by-user/root
    22ExecStart=/bin/sh -c '. .notify_unit_status/bin/activate && ./notify_unit_status.py ./notify_unit_status.yaml %I; deactivate'
    
  9. edit the Systemd service you want to monitor. In this example the service to be monitored is Gitea

    systemctl edit gitea.service
    
  10. add this content

    # [ ... ]
    
    [Unit]
    
    # [ ... ]
    
    OnFailure=notify-unit-status@%n.service
    
    # [ ... ]
    

Updates#

Update action#

This script can be used to update software not supported by the package manager, for example Docker images.

Important

Any arbitrary command can be configured.

See also

  • A collection of scripts I have written and/or adapted that I currently use on my systems as automated tasks [10]

  1. install the dependencies

    apt-get install python3-yaml python3-requests
    
  2. install fpyutils. See reference

  3. create the script

    /home/jobs/scripts/by-user/root/update_action.py#
     1#!/usr/bin/env python3
     2#
     3# update_action.py
     4#
     5# Copyright (C) 2021-2022 Franco Masotti (franco \D\o\T masotti {-A-T-} tutanota \D\o\T com)
     6#
     7# This program is free software: you can redistribute it and/or modify
     8# it under the terms of the GNU General Public License as published by
     9# the Free Software Foundation, either version 3 of the License, or
    10# (at your option) any later version.
    11#
    12# This program is distributed in the hope that it will be useful,
    13# but WITHOUT ANY WARRANTY; without even the implied warranty of
    14# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    15# GNU General Public License for more details.
    16#
    17# You should have received a copy of the GNU General Public License
    18# along with this program.  If not, see <http://www.gnu.org/licenses/>.
    19r"""update_action.py."""
    20
    21import shlex
    22import sys
    23
    24import fpyutils
    25import yaml
    26
    27
    28def send_notification(message: str, notify: dict):
    29    m = notify['gotify']['message'] + '\n' + message
    30    if notify['gotify']['enabled']:
    31        fpyutils.notify.send_gotify_message(notify['gotify']['url'],
    32                                            notify['gotify']['token'], m,
    33                                            notify['gotify']['title'],
    34                                            notify['gotify']['priority'])
    35    if notify['email']['enabled']:
    36        fpyutils.notify.send_email(
    37            message, notify['email']['smtp_server'], notify['email']['port'],
    38            notify['email']['sender'], notify['email']['user'],
    39            notify['email']['password'], notify['email']['receiver'],
    40            notify['email']['subject'])
    41
    42
    43if __name__ == '__main__':
    44
    45    def main():
    46        configuration_file = shlex.quote(sys.argv[1])
    47        config = yaml.load(open(configuration_file), Loader=yaml.SafeLoader)
    48
    49        # Action types. Preserve this order.
    50        types = ['pre', 'update', 'post']
    51        services = config['services']
    52
    53        for service in services:
    54            for type in types:
    55                for cmd in services[service]['commands'][type]:
    56                    for name in cmd:
    57                        retval = fpyutils.shell.execute_command_live_output(
    58                            cmd[name]['command'], dry_run=False)
    59                        if cmd[name]['notify']['success'] and retval == cmd[
    60                                name]['expected_retval']:
    61                            send_notification(
    62                                'command "' + name + '" of service "' +
    63                                service + '": OK', config['notify'])
    64                        elif cmd[name]['notify']['error'] and retval != cmd[
    65                                name]['expected_retval']:
    66                            send_notification(
    67                                'command "' + name + '" of service "' +
    68                                service + '": ERROR', config['notify'])
    69
    70    main()
    
  4. create a configuration file

    includes/home/jobs/scripts/by-user/root/update_action.mypurpose.yaml#
     1#
     2# update_action.mypurpose.yaml
     3#
     4# Copyright (C) 2021-2022 Franco Masotti (franco \D\o\T masotti {-A-T-} tutanota \D\o\T com)
     5#
     6# This program is free software: you can redistribute it and/or modify
     7# it under the terms of the GNU General Public License as published by
     8# the Free Software Foundation, either version 3 of the License, or
     9# (at your option) any later version.
    10#
    11# This program is distributed in the hope that it will be useful,
    12# but WITHOUT ANY WARRANTY; without even the implied warranty of
    13# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    14# GNU General Public License for more details.
    15#
    16# You should have received a copy of the GNU General Public License
    17# along with this program.  If not, see <http://www.gnu.org/licenses/>.
    18
    19notify:
    20    email:
    21        enabled: true
    22        smtp_server: 'smtp.gmail.com'
    23        port: 465
    24        sender: 'myusername@gmail.com'
    25        user: 'myusername'
    26        password: 'my awesome password'
    27        receiver: 'myusername@gmail.com'
    28        subject: 'update action'
    29    gotify:
    30        enabled: true
    31        url: '<gotify url>'
    32        token: '<app token>'
    33        title: 'update action'
    34        message: 'update action'
    35        priority: 5
    36
    37services:
    38    hello:
    39        commands:
    40            pre:
    41                - stop_service:
    42                    # string
    43                    command: 'systemctl stop docker-compose.hello.service'
    44                    # integer
    45                    expected_retval: 0
    46                    # boolean: {true,false}
    47                    notify:
    48                        success: true
    49                        error: true
    50            update:
    51                - pull:
    52                    command: 'pushd /home/jobs/scripts/by-user/root/docker/hello && docker-compose pull'
    53                    expected_retval: 0
    54                    notify:
    55                        success: true
    56                        error: true
    57                - build:
    58                    command: 'pushd /home/jobs/scripts/by-user/root/docker/hello && docker-compose build --pull'
    59                    expected_retval: 0
    60                    notify:
    61                        success: true
    62                        error: true
    63            post:
    64                - start_service:
    65                    command: 'systemctl start docker-compose.hello.service'
    66                    expected_retval: 0
    67                    notify:
    68                        success: true
    69                        error: true
    70    goodbye:
    71        commands:
    72            pre:
    73                - stop_service:
    74                    command: 'systemctl stop docker-compose.goodbye.service'
    75                    expected_retval: 0
    76                    notify:
    77                        success: true
    78                        error: true
    79            update:
    80                - pull_only:
    81                    command: 'pushd /home/jobs/scripts/by-user/root/docker/goodbye && docker-compose pull'
    82                    expected_retval: 0
    83                    notify:
    84                        success: true
    85                        error: true
    86            post:
    87                - start_service:
    88                    command: 'systemctl start docker-compose.goodbye.service'
    89                    expected_retval: 0
    90                    notify:
    91                        success: true
    92                        error: true
    
  5. use this Systemd service unit file

    /home/jobs/services/by-user/root/update-action.mypurpose.service#
     1[Unit]
     2Description=Update action mypurpose
     3Wants=network-online.target
     4After=network-online.target
     5
     6[Service]
     7Type=simple
     8ExecStart=/home/jobs/scripts/by-user/root/update_action.py /home/jobs/scripts/by-user/root/update_action.mypurpose.yaml
     9User=root
    10Group=root
    
  6. use this Systemd timer unit file

    /home/jobs/services/by-user/root/update-action.mypurpose.timer#
    1[Unit]
    2Description=Update action mypurpose monthly
    3
    4[Timer]
    5OnCalendar=monthly
    6Persistent=true
    7
    8[Install]
    9WantedBy=timers.target
    
  7. fix the permissions

    chmod 700 -R /home/jobs/scripts/by-user/update_action.*
    chmod 700 -R /home/jobs/services/by-user/root
    
  8. run the deploy script

Footnotes