maintenance – Alan’s Notes

Up again, but not public yet

Well, except, you’re reading this so it is public.

Lost interest in maintaining this server and website when I lost my job and couldn’t get another. The server’s Ubuntu, web server is Apache, and CMS is WordPress. It’s been running for a number of years without issue. I have never thought of it as “production” because I don’t rely on it for anything. It’s just a test bed to familiarize myself with the software stack and gain some understanding of its setup and administration. I’m self hosting. It’s an old computer repurposed as a server.

One other thing I experimented with is DNS. I wanted to be able to get to my server on my home network using wp.boba.org, whether on the public Internet or my home network. That worked fine for years with BIND9 and isc-dhcp.

I developed the habit of running upgrades periodically without testing. If there was a problem then no big deal, not production, figure out the issue, repair and proceed. Problems happened a few times with that approach and were always easily rectified.

DNS on the server stopped working after an upgrade. I tried many things and couldn’t figure out why. Rather than rollback the upgrade or restore the system from a backup I kept mucking with it to try and get it to work. No success. Eventually I just lost interest and let the server go dark. I wasn’t working so didn’t have anyone to talk with about the server. With no one to talk tech with about my server project there seemed no point to fixing it.

I did want to dip my toe in the water again after a while. I decided to rebuild the server and bring all components up to the latest release. I still couldn’t get BIND9 DNS to work. Searching BIND9 issues I found other Ubuntu users were also having problems with it. After searching for alternate DNS servers I decided to try dnsmasq. That got me to a working DNS on my home network. And that got me to the point of having the server up and publicly available again.

All development of the server configuration and settings was done on a virtual machine, vm, in a virtual network with virtual clients. VirtualBox is the hypervisor being used. Once everything worked as expected I migrated the server vm to a physical host. That took surprisingly little tweaking. Network addresses had to be changed from the virtual network settings to the home network settings and a different Ethernet device name entered where needed. That was about it to migrate from a virtual to physical server.

For all the world to see, in all its underwhelming glory, wp.boba.org is back. Enjoy.

Mount an external LFS drive

It’s easy. Just took a while to recall.

Original server was hardware installed from thumb drive iso. Set up LFS on server install.

New server from VirtualBox vm. Used ext4 there. Have it running on different drive on original server. LFS drive is set aside.

Want to get at some info from LFS drive. Trying to mount external LFS drive is running into many dead ends so far.

And of course it was simply a question of installing the correct file system drivers. In this case # apt update & apt install lvm2, and the volume can be mounted read/write.

I will keep the old drive around for a while in the external housing. I’m sure there will be times I want to find stuff to pluck off. But I need to put a label on it for a hard date to be DBANed.

bind9 and DHCP

Some emphasis on rndc freeze could save headaches.

Want to get full services on my home LAN such that devices that get DHCP addresses can be called by their host names. In other words, Dynamic DNS on the LAN. In a Windows domain it isn’t something I’ve thought about. It is inherent in setting up the DNS and DHCP server in the same domain. Or maybe doing that just masks netbind sharing names. In any case, can do DNS for DHCP hosts and address by name very easily.

Want the same for home network but am using Ubuntu server. DNS is BIND9 and DHCP is ISC-DHCP. Both work. DNS for the fixed IP devices, home servers, router, printer, works fine. Can ping by hostname or FQDN. The DHCP devices, not so much. They get an IP just fine and can all be seen by dhcp-lease-list. They just can’t be pinged by hostname or FQDN.

At least the home DNS has primary and secondary servers. And for DHCP clients, IP for <name> is available via dhcp-lease-list. But ping <name> fails with error … .

All the above was written before an eventual solution was found. The error was one part me (syntax) and one part bind9.

Ping by hostname would require the host’s A record appear in the domain’s zone file. But the majority of hosts get dynamic IP address so there’s no fixed list of hostname to IP address for LOTS of hosts.

The server providing IP addresses is isc-dhcp-server.service and the server providing DNS is bind9.service. The method, isc-dhcp-server.service updates bind9.service when an IP address is leased.

Of course. But it worked initially then didn’t. What happened?

CARDINAL RULE of BIND9 never update zone files while bind server is running or while bind is actively maintaining the zone files. And twice as emphatically, once zone file replication to secondary server(s) has been established and .jnl files have been created, never update zone files unless bind server has been rndc freeze frozen or systemctl stop stopped !!!!

Use rndc to freeze the zone files while leaving the name server running and responding to queries.

Make sure to update the zone file’s sequence number.

Delete any dynamic entries in the file. (when troubleshooting, not for routine maintenance)

Delete any .jnl files. (again, troubleshooting, not for routine)

Unfreeze the zone files.

Excepting “troubleshooting options”, if the steps above are not followed then the zone files will not properly update going forward. And no freeze, maintain, unfreeze, will fix the failures to update.

Plus named-checkconf and named-checkzone didn’t detect any errors after bind and dhcp were no longer updating zone and .jnl files. Nor did named-compilezone.

And I was confounding that with a failure of reverse zone lookup. Couldn’t get a host name for any dynamic IP address. “But it works in the virtual setup”, and it did. Reverse look ups and all.

Eventually I found a different spelling of in-addr.arpa between the primary and secondary zone files. With that fixed, zone update of dynamic IPs still not happening.

The final fix? The procedure above including the “for troubleshooting” steps. With the zones cleared of dynamic A records and managed keys .jnl file and zone .jnl files on both the primary and secondary removed while bind9 was frozen by rndc on both. Then restart both. Then, it all works.

Lesson learned, ALWAYS rndc freeze before doing any bind9 maintenance.

Retiring some hardware

…when a computer’s been around too long

Time to retire some old tech. That display is a whopping 15″ diagonal. Resolution was limited. Only used it for a terminal for a server these last six years or so. And this is it under my arm on the way to the dumpster.

Right after the monitor, the old server was carried out to rubbish.

BEFORE delivering to rubbish I made sure to wipe the HD with DBAN, Darik’s Boot and Nuke. Have relied on it for years.

The computer’s manufacturing date stamp was 082208. Didn’t think to take a photo. It was a Dell OptiPlex 330 SFF, Pentium Dual Core E2160 1.8GHz, 4GiB RAM, 90 GB HD. They looked like this.

I got it in 2015. It had been replaced during a customer hardware upgrade then sat on the shelf unpowered for about a year before I joined that office. On hardware clean-out day it was in a pile to take home or put in the dumpster.

It became my boba.org server sometime in 2015 and served that function until December 2022.

Six years of service and then it sat on the shelf for a year. Then eight years hosting boba.org. Fourteen years of service is a LONG life for a computer!

The replacement “server” is an old laptop, old, but it’s new enough it doesn’t have an Ethernet port. I got a USB Ethernet adapter, Realtek Semiconductor Corp. RTL8153 Gigabit Ethernet Adapter, and plugged a cable in. Better performance than WiFi.

Hardware is several steps above the old server too. Intel Core i3-5015U CPU @ 2.10GHz, 6GiB RAM, 320 GB HD (I should replace with SSD). Date of manufacture isn’t as clear. Maybe late 2015 early 2016.

The CPU Benchmark comparison of the two processors, Intel Core i3-5015U vs Intel Pentium E2160, shows clear differences in processing power.

Now that the new server is up, well has been for a few months but I didn’t want to add new services until I got secondary DNS running, its time to add features and services on the network.

Powershell – install a program with no .MSI

Don’t let the quoting drive you mad!

In an earlier post, Powershell – love it / hate it, I described needing to check the install status of a program that didn’t have an .MSI installer. That post provided details of parsing the install file names to know which pcs got the target install. This post provides details on what I did to make the install happen and create the files that logged the process.

With no software deployment tool and only an .exe for install you can still keep track of deployment with powershell.

In this case the program needed to be targeted at specific computers, not particular users. Easy enough to create a list of target pcs. Without an .MSI file GPO install isn’t available unless… that GPO runs a startup script to do the install. But it can’t be a powershell script if that’s disabled in the environment, so .bat files it is. Still want to know which pcs get the install and which don’t so have to log that somewhere.

How to make it all happen? This is how…

An install .bat file that makes use of powershell Invoke-Command -ScripBlock {} which will run even if powershell is disabled. The quoting to run the commands within -ScriptBlock {} gets really convoluted. Avoided that by calling .bat files from the -ScripBlock {} to have simpler quoting in the called .bat files.

The prog_install.bat file checks if the runtime dependency is installed and calls the .bat file to install it if it isn’t. Then it checks if the target program is installed and installs it if it isn’t found. For each of the steps the result is appended to a log file based on the hostname.

REM prog_install.bat
GOTO EoREM

REM prog name install
REM
REM This routine checks that both Windows Desktop Runtime (a dependency) 
REM and prog name are installed and writes the status to a file to have  
REM install results history.
REM 
REM The install results file must be in a share writeable by the process
REM running this install routine which is after boot and before logon.
REM 
REM A file is created or appended to based on the hostname the process
REM runs on. 
REM

:EoREM
 
@echo off

REM Check if required Microsoft Windows Desktop Runtime is intalled. 
REM Install if not found. 
REM Write reslut to results file.
Powershell Invoke-Command -ScriptBlock { if ^( Get-ItemProperty HKLM:\\Software\\Microsoft\\Windows\\CurrentVersion\\Uninstall\\* ^| Where-Object { $_.DisplayName -like """Microsoft Windows Desktop Runtime - 3.*""" } ^) { Add-Content -Path \\server\prog\prog_$Env:COMPUTERNAME.txt -Value """$(Get-Date) $Env:COMPUTERNAME Microsoft Windows Desktop Runtime is installed.""" } else { Start-Process -Wait -NoNewWindow \\server.local\SysVol\server.local\scripts\prog\inst_run.bat; Add-Content -Path \\server\prog\prog_$Env:COMPUTERNAME.txt -Value """$(Get-Date) $Env:COMPUTERNAME Microsoft Windows Desktop Runtime NOT installed. Installing""" } }

REM Check if prog name is intalled. 
REM Install if not found.
REM Write reslut to results file.
REM NOTE: Add-Content before Start-Process (reverse order compared to runtime install above)
REM       Above Add-Content after Start-Process so "installing" not written until after actual install.
REM       For prog name install, if Add-Content after Start-Process then Add-Content fails to write to file.
REM
Powershell Invoke-Command -ScriptBlock { if ^( Get-ItemProperty HKLM:\\Software\\Microsoft\\Windows\\CurrentVersion\\Uninstall\\* ^| Where-Object { $_.DisplayName -like """prog name""" } ^) { Add-Content -Path \\server\prog\prog_$Env:COMPUTERNAME.txt -Value """$(Get-Date) $Env:COMPUTERNAME ver $($(Get-ItemProperty HKLM:\\Software\\Microsoft\\Windows\\CurrentVersion\\Uninstall\\* | Where-Object { $_.DisplayName -like """prog name""" }).DisplayVersion) prog name is installed.""" } else { Add-Content -Path \\server\prog\prog_$Env:COMPUTERNAME.txt -Value """$(Get-Date) $Env:COMPUTERNAME prog name NOT installed. Installing"""; Start-Process -Wait -NoNewWindow \\server.local\SysVol\server.local\scripts\prog\inst_prog.bat } }

The batch files that do the actual installs refer to the SysVol folder for the programs to run. Using the SysVol folder because need a share that’s accessible early in the boot.

REM inst_run.bat
REM To work prog requires the following Windows runtime package be installed

start /wait \\server.local\SysVol\server.local\scripts\prog\dotnet-sdk-3.1.415-win-x64.exe /quiet /norestart

REM inst_prog.bat
REM Install the prog name package.

start /wait \\server.local\SysVol\server.local\scripts\prog\prog_installer_0.8.5.1.exe /SILENT /NOICONS /Key="secret_key"

So there you have it. To install a program with its .exe installer via GPO in an environment with no .MSI packager, no deployment tool, and powershell.exe disabled by GPO use powershell Invoke-Command -ScripBlock {} in a .bat file to do the install and log results. And call .bat files to simplify quoting where needed.

Perils of a part time web server admin

Not being “in it” all the time can make simple things hard.

Recently one of the domain names I’ve held for a while expired. Or actually, I let it expire. It was hosted on this same web server along with several other websites and had a secure connection using a Lets Encrypt SSL certificate. All good.

The domain name expired, I disabled the website, and all the other websites on the server continued to be available. Until they weren’t! When I first noticed I just tried restarting the web server. No joy, that didn’t get the other sites back up.

And here’s the perils of part time admin. Where to start with the troubleshooting? For all my sites and the hosting server I really don’t do much except keep the patches current and occasionally post content using WordPress CMS. Not much troubleshooting, monitoring logs, etc. because there isn’t much going on. And, though some might say otherwise, I don’t spend all my time at the computer dissecting how it operates.

I put off troubleshooting for a while. This web server’s experimental, not production, so sometimes I cut some slack and don’t dive right in when things aren’t working. Had other things pending that required more attention.

When I did start I was very much at a loss where to start because, as noted, I disabled a web site and everything continued to work for a while. When it stopped working I hadn’t made any additional changes.

Logs are always a good place to look, yes? This web server is set up to create separate logs for most of the sites it’s hosting. Two types of logs are created, access logs and error logs. Access logs showed what was expected, no more access to that site after I disabled it.

Error logs confused me though. The websites use Lets Encrypt SSL certificates. And they use Certbot to set up the https on the Apache http server. A very common setup. The confusing thing about the error log was it showed the SSL configuration for the expired web site failing to load. Why was the site trying to load at all??? I had disabled the site using the a2dissite program provided by the server distribution. The thing I hadn’t thought about is the Certbot script for Apache sets up the SSL by modifying the <site_name>.conf file AND creating a <site_name>-le-ssl.conf file.

So even though the site had been disabled by a2dissite <site_name>.conf I hadn’t thought to a2dissite <site_name>-le-ssl.conf. Once I recognized that issue and ran the second a2dissite command the web server again started right up. No more failing to load SSL for the expired site. And, surprising, failing to load the SSL for the one site prevented the server from starting rather than disabling the one site and loading the others that didn’t have configuration issues.

Something for another time… I expect there must be a way for the server to start and serve correctly configured sites while not loading incorrectly configured sites and not allowing presence of an incorrectly configured site to prevent all sites from loading. It just does not seem likely that such a widely used web server would fail to serve correctly configured sites when only one or some of multiple hosted sites is misconfigured.

The perils of part-time admin, or jack of all trades and master of none, is that these sort of gotcha’s pop up all the time because of limited exposure to the full breath of dependencies for a program to perform in a particular way. It isn’t a bad thing. Just something to be aware of so rather than blame the software for not doing something, need to be aware that there are often additional settings to make to achieve the desired effect.

Be patient. Expect to need to continue learning. And always, always, RTFM and any other supporting documents.

Powershell – love it / hate it

Sometimes it’s hard for me to wrap my head around things.

Powershell makes so many things easier than before it existed. At least for me. I’m not a programmer but simple commands piped one to another, like bash in Linux, I can get a lot done.

One of the “things” I need to get done is checking how many computers got a program installed. Because of the environment I’m in and the program itself, there’s no GPO based install for an MSI and there’s no third party tool based install. This stumped me for a while until I came up with the idea of using a startup script for the install.

Another challenge, powershell scripting is disabled. However I learned from “PowerShell Security” by Michael Pietroforte and Wolfgang Sommergut that powershell can be called within a .bat using Powershell Invoke-Command -ScriptBlock {} even if powershell is disabled by policy. So I wrote a start up script that relied on .bat files that had Powershell Invoke-Command -ScriptBlock {} in them to run the program install. The -ScriptBlock {} checked first if the dependencies were installed, installed them if not, then checked if the desired program was installed and installed it if not. It also wrote a log file for each pc named as “progname_<hostname>.txt” and appended to the file with each restart.

The startup script wasn’t running reliably every time a pc booted. Seemed to be NIC initialization or network initialization related. In any case, the pcs that were to be installed were listed in an AD group. The pcs that had run the startup script output that info into a file named “progname_<hostname>.txt”. One way to see which of the pcs had not gotten the install was by comparing the members of the AD group, the computer names, to the <hostname> portion of the log file names that were being created. Computers from the group without a corresponding file hadn’t gotten installed.

Easy, right? Get the list of computers to install with Get-ADGroupMember and compare that list to the <hostname> portion of the log files. How to get only the <hostname> portion? Get-ChildItem makes it easy to get the list of file names. But then need to parse it to get only the <hostname> part. Simple in a spreadsheet but I really wanted to get a listing of only the <hostname> without having to take any other steps.

I knew I needed to look at the Name portion of the file name, handle it as a string, chop off the “progname_”, and drop the “.txt” portion. But how to do that? After what seemed like way to much searching and experimenting I finally came up with…

$( Foreach ( $name in $(Get-ChildItem progname* -Name) ) { $name.split('_')[1].split('.')[0] } ) | Sort

The first .split('_')[1] lops off the common part of the filename that’s there only to identify the program the log is being created for, “progname_”, and keeps the rest for the second split(). The next split(), .split('.')[0], cuts off the file extension, .txt, and keeps only the part that precedes it. And so the output is only the hostname portion of the filename that the startup script has created.

Compare that list to the list from Get-ADGroupMember and, voila, I know which of the targeted pcs have and have not had the program installed without doing any extra processing to trim the file names. Simple enough, but for some reason it took me a while to see how to handle the file names as strings and parse them the way I needed.

Ubuntu, ZFS and running out of drive space

For userland, system operations really need to be invisible.

I have been using Ubuntu as my desktop OS for over 15 years now. I came to it by way of using it on a backup computer to do a task that my primary computer, using Windows XP, failed at if I tried to do anything else concurrently. I didn’t want to buy another computer or a license for XP and put it on an old computer with specs that were very marginal for XP. So I thought, try this free OS and see whether I can use the old computer to do the task that my new XP computer would only do if nothing else was done at the same time.

Ubuntu got the job done. I recorded my vinyl albums to sound files, broke the sound files into tracks like on the albums, and then burned the tracks onto CDs so I could carry my music around more conveniently and listen to it in more places. XP did this too, but the sound files were corrupt or the burn failed if I did something else like open my word processor, spreadsheet, or web browser while recording or burning were going on.

Now I had two pcs running and would switch between them as needed to keep the vinyl to CD process going. That was a little inconvenient because my workspace didn’t let me put the two pcs right next to each other. Switching wasn’t a case of moving my hands from one keyboard and mouse to another, or flipping a switch on a KVM. Delaying getting to the Ubuntu pc after the album side was over meant extra time spent trimming the audio file to delete the tail of the file. This led to me trying some things on the Ubuntu pc like opening the word processor or spreadsheet or browsing the web while the recording or burning were going on to see if it caused problems. And amazing, it didn’t! A lower spec pc with Ubuntu could do more of what I wanted without errors than my much better XP desktop.

That led me to using the Ubuntu pc while converting my albums to CD and that led me to Ubuntu for home use. Professional life continued and continues to be Windows, but at home Ubuntu. And Ubuntu is still preferred at home because it doesn’t mysteriously prevent me from doing things, inconveniently interrupt me, or insist on having information I don’t want to share like Windows does. That is until ZFS in Ubuntu 20.04 started preventing me from doing updates on my primary and backup pc because of lack of space.

I’ve run out of space on Windows and Ubuntu before. It just meant time to finally do some housekeeping and get rid of large chunks of files, like virtual machines, that I hadn’t used in a while. Do that and boom, back to work! Not so with ZFS. Do that and gloom, still can’t do updates.

There were different error messages on the two pcs, one said bpool space free below 20%, the other was rpool space free below 20%. Rpool and bpool, what are they? And why, when there’s nearly 20% free space, is updating prevented? And why after deleting or moving tens of gigabytes of files off the drive and purging old kernels, a Linux thing, are updates still prevented and rpool and bpool still report less than 20% free? Gigs of files were just moved off the drive and these rpool and bpool things don’t reflect that!

My first experience with Ubuntu after more than 15 years using it where keeping it up to date wasn’t just a case of using it and running updates every once in a while.

Windows has a feature called “Recovery Points” that I’ve used to get back to a working system when things have been broken to the point of making it hard or impossible to use the pc. Ubuntu hasn’t really had anything equivalent until the introduction of ZFS. And as I’ve learned, that’s way too simple minded an explanation and doesn’t give credit to the capabilities of ZFS that go way beyond Windows Recovery Points. True, and so be it.

I dug through many ZFS web pages and tried many things until finally getting more than 20% rpool or bpool free on each pc, a list of links is at the end of this post. Now the pcs are back to updating without complaint.

What I’ve learned is Ubuntu has a way to go to make ZFS user friendly. Things I’d suggest to Canonical for desktop Ubuntu:

Double the recommended minimum drive size and/or tell end users they should have 2x the drive space they think they need if they already think they need more than the minimum
Reduce the default number of snapshots to 10
Provide a UI for setting the number of snapshots
Provide a UI for selectively removing snapshots from bpool or rpool when free space goes below the dreaded 20%
After prompting for confirmation automatically remove the oldest snapshots to get back to 20% free when the condition occurs

Both my pcs are now above 20% free space on rpool and bpool and updating without complaint. It took a while and some learning to make that happen. It wasn’t the type of thing an average end user would ever want to face or even know about.

77% and 48% free – will probably bump rpool space issues first again on this pc

89% and 90% free – plenty of room before bpool is a problem again on this pc

ZFS focus on Ubuntu 20.04 LTS: ZSys general presentation · ~DidRocks

ZFS focus on Ubuntu 20.04 LTS: ZSys general principle on state management · ~DidRocks

ZFS focus on Ubuntu 20.04 LTS: ZSys commands for state management · ~DidRocks

ZFS focus on Ubuntu 20.04 LTS: ZSys state collection · ~DidRocks

ZFS focus on Ubuntu 20.04 LTS: ZSys for system administrators · ~DidRocks

ZFS focus on Ubuntu 20.04 LTS: ZSys partition layout · ~DidRocks

ZFS focus on Ubuntu 20.04 LTS: ZSys dataset layout · ~DidRocks

ZFS focus on Ubuntu 20.04 LTS: ZSys properties on ZFS datasets · ~DidRocks

apt – Out of space on boot zpool and cant run updates anymore – Ask Ubuntu
For this link see especially Hannu‘s answer on Nov 19, 2020 at 17:22

docs.oracle.com | Displaying and Accessing ZFS Snapshots

docs.oracle.com | Destroying a ZFS File System

docs.oracle.com | Creating and Destroying ZFS Snapshots

Ubuntu server upgrade 16.04 to 18.04 (20.04 pending)

Virtualize, document, and test. The surest way to upgrade success.

For years my server has been running my personal websites and other services without a hitch. It was Ubuntu 16.04. More than four years old at this point. Only a year left on the 16.04 support schedule. Plus 20.04 is out. Time to move to the latest platform without rushing rather than make the transition with support ended or time running out.

With the above in mind I decided to upgrade my 16.04.6 server to 20.04 and get another five years of support on deck. I’m half way there, at 18.04.4, and hovering for the next little while before the bump up to 20.04. The pause is because of a behavior of do-release-upgrade that I learned about while planning and testing the upgrade.

It turns out that do-release-upgrade won’t actually run the upgrade until a version’s first point release is out. A switch, -d, must be used to override that. Right now 20.04 is just that, 20.04. Once it’s 20.04.1 the upgrade will run without the switch. Per “How to upgrade from Ubuntu 18.04 LTS to 20.04 LTS today” the switch, which is intended to enable upgrading to a development release, does the upgrade to 20.04 because it is released.

I’m interested to try out the VPN that is in 20.04, WireGuard, so may try the -d before 20.04.1 gets here. In the meantime let me tell you about the fun I had with the upgrade.

First, as you should always see in any story about upgrade, backup! I did, several different ways. Mostly as experiments to see if I want to change how I’m doing it, rsync. An optional feature of 20.04 that looks to make backup simpler and more comprehensive is ZFS. It’s newly integrated into Ubuntu and I want to try it for backups.

I got my backups then took the server offline to get a system image with Clonezilla. Then I used VBoxManage convertfromraw to turn the Clonezilla disk image into a VDI file. That gave me a clone of the server in VirtualBox to practice upgrading and work out any kinks.

The server runs several websites, a MySQL server for the websites and other things, an SSH server for remote access, NFS, phpmyadmin, DNS, and more. They are either accessed remotely or from a LAN client. Testing those functions required connecting a client to the server. VirtualBox made that a simple trick.

In the end my lab setup was two virtual machines, my cloned server and a client, on a virtual network. DHCP for the client was provided by the VirtualBox Internal Network, the server had a fixed ip on the same subnet as the VirtualBox Internal Network and the server provided DNS for the network.

I ran the 16.04 to 18.04 upgrade on the server numerous times taking snapshots to roll back as I made tweaks to the process to confirm each feature worked. Once I had a final process I did the upgrade on the virtual machine three times to see if I could find anything I might have missed or some clarification to make to the document. Success x3 with no changes to the document!

Finally I ran the upgrade on the production hardware. Went exactly as per the document which of course is a good thing. Uneventful but slower than doing it on the virtual machine, which was expected. The virtual machine host is at least five years newer than the server hardware and has an SSD too.

I’ll continue running on 18.04 for a while and monitor logs for things I might have missed. Once I’m convinced everything is good then I’ll either use -d to get to 20.04 or wait until 20.04.1 is out and do it then.

MySQL backup and restore

Dig in and do it, and repeat. Get the desired result faster by combining research and testing.

Maintenance is important. A car needs oil changes or eventually the engine will be damaged by regular operation. A server needs software updates to fix bugs and protect against threats. Even when those things are done failures can happen that create problems returning to normal operation.

For a car there needs to be a spare ready to go in case of a flat. If there’s not a spare ready for use it will take longer to get the car back in operation when a flat happens. For a computer, programs and data need to be backed up. If a disk drive crashes the information stored there may be lost or too expensive to recover, so just as good as lost.

This website has not been well protected for too long and I knew that needed to change. There’s a server operating system, a web server, WordPress software, and a MySQL database that all operate interdependently to make it work. As the amount of content slowly continues to grow my manual system to back everything up has become too cumbersome and is not done frequently enough to ensure minimal to no loss of data.

That needed to change.

Step one – automate the MySQL backups. Documentation states the “logical” backup method is slow and not recommend for large databases. The alternative “physical” backup entails stopping the database server and copying the files. The licensed MySQL Enterprise Backup performs physical backups and from what I’m able to tell runs clone databases so one can be stopped and the files backed up while the clone continues to run and is available for use.

This is a hobby operation and has limited resources so purchasing a license for Enterprise Backup is out of the question. Taking the whole thing offline to backup probably doesn’t bother anyone except me. Still, I did want to be able to continue to run the server while the databases are being backed up. Enter logical backup.

It didn’t take long to find the command, mysqldump. Confirming that it would backup everything including user names and passwords so all the accounts got restored with all the data took longer.

Despite my best search-fu I was unable to find any documentation that explicitly says “do this” to back up user accounts in addition to system databases and other databases. Let me fill that gap by saying “do this to back up user accounts, system databases, and other databases”. mysqldump -u root -p -h server.yourdomain.org --all-databases -r backup_file.sql. I did find the preceding command as the backup command. Nothing I could find said this backs up user accounts and system databases. I tested it. It does.

With the backup done, the next step is restore. And confirming the restore works as expected. Another case of things that otherwise seem obvious not being declared in the documentation.

Restore from the command line looks like this: mysql -u root -p database < backup_file.sql. But wait, I want to restore all databases. Search-fu failed again to find any explicit instruction how to restore all databases and what database to name on the command line.

Try the command without naming a database to see if all are restored. No, that fails. Then a flash of insight. Create an empty database, name that on the command line, and try the restore again. It works!

$ mysql -u root -p
> create database scratch;
> exit
$ mysql -u root -p scratch < backup_file.sql

Did this a few times and then restored the tables. As far as I’ve been able to determine the restore is an exact replica of the backed up data.

It seems odd that important use cases, complete backup of database server and complete restore of database server aren’t clearly documented. The information is there but important nuggets are left out. The only way to be sure you’ll get what you need is to experiment until you’re able to produce the results you need.

So yes, do the research but also just do the work and inspect the results. When research doesn’t clearly answer the questions backup it up with experimentation. Do both and get a result faster.