Real time constant backups with ZFS+Zrepl

This is a guide to making your home network backups seamless, secure, and awesome. It’s comparable in many ways to Apple Time Machine.

Prerequisites:
A machine in the cloud with lots of disk space
Mine comes from zfs.rent, where I sent some physical hard drives and rent a small machine with them attached.
CA Infrastructure.
Creating your own private CA is a whole topic of it’s own, one I hope to make simpler. For the short term, creating a private CA just for Zrepl is the best option. Create the key with cfssl genkey -initca zrepl-ca.json | cfssljson -bare zrepl-ca

# /etc/zrepl/zrepl-ca.json - On your backup host

{
  "CN": "Personal Zrepl Root CA",
  "key": {
    "algo": "rsa",
    "size": 4096
  },
  "names": [
    {
      "C": "US",
      "L": "Seattle",
      "O": "Thomas Hahn",
      "OU": "Zrepl",
      "ST": "Washington"
    }
  ]
}


Setup
Create certificates for your hosts. They should use the hostname as the CN, but also have appropriate Subject Alternative Names. I recommend CFSSL to do so.

Example contents of your cfssl csr json:

# /etc/zrepl/zrepl.json
{
  "CN": "timemachine",
  "hosts": [
    "timemachine.internal",
    "timemachine.zfs.rent",
    "timemachine.gauntletwizard.net"
  ],
  "key": {
    "algo": "rsa",
    "size": 4096
  },
  "names": [
    {
      "C": "US",
      "L": "Seattle",
      "O": "Thomas Hahn",
      "OU": "Zrepl",
      "ST": "Washington"
    }
  ]
}
mkdir /etc/zrepl
cfssl genkey zrepl.json | cfssljson -bare zrepl-$HOSTNAME

You should now have a zrepl-key.pem and a zrepl.csr in your /etc/zrepl folder, on each of your machines. Copy all of the .csr files to your CA for signing. Don’t touch the -key.pem files! These are your private keys, and need to be secret. Once you have the csrs on your CA’s machine, sign them. Sign them with:

host=HOSTNAME # Replace HOSTNAME with the name of each host
cfssl sign -ca ca.pem -ca-key ca-key.pem "zrepl-${host}.csr" | cfssljson -bare ${host}

Signing them with the above will leave you with a set of .pem files, named host.pem. You should verify that they signed correctly: openssl verify -CAfile ca.pem host.pem. It should print ‘host.pem: ok’. Copy these files back to their respective hosts, and rename them on that host to zrepl.crt. Also copy the ca.pem file to each host as /etc/zrepl/ca.pem

Next, set up the backup as a sink:

global:
  logging:
    # use syslog instead of stdout because it makes journald happy
    - type: syslog
      format: human
      level: warn
  monitoring:
    - type: prometheus
      listen: ':9811'


jobs:
  - name: backups
    type: sink
    root_fs: tank/backups
    recv:
      placeholder:
        encryption: inherit
    serve:
      type: tls
      listen: ":8826"
      ca: "/etc/zrepl/ca.pem"
      cert: "/etc/zrepl/zrepl.crt"
      key: "/etc/zrepl/zrepl-key.pem"
      client_cns:
        - "desktop"
        - "laptop"
        - "timemachine"

zrepl uses the CN field for disambiguation. Add each host you signed above to the client_cns section in zrepl.yaml. zrepl will create a new zfs filesystem under your root_fs for each of these client cns upon their first backup, i..e. tank/backups/desktop and so on.

Next, configure your machines as sources:

global:
  logging:
    # use syslog instead of stdout because it makes journald happy
    - type: syslog
      format: human
      level: warn
  monitoring:
    - type: prometheus
      listen: ':9811'


jobs:
  - name: desktop
    type: push
    filesystems:
      "desktop/home<": true
    send:
      encrypted: false
    connect:
      type: tls
      address: "timemachine:8826"
      ca: /etc/zrepl/ca.pem
      cert: /etc/zrepl/zrepl.crt
      key:  /etc/zrepl/zrepl-key.pem
      server_cn: "timemachine"

    snapshotting:
      type: periodic
      prefix: zrepl_
      interval: 5m
    pruning:
      keep_sender:
        - type: not_replicated
        - type: regex
          regex: ".*"
      keep_receiver:
      - type: grid
        grid: 1x1h(keep=all) | 24x1h | 30x1d | 6x30d
        regex: "^zrepl_"

Building and pushing multiple manifests

I’m trying to build multiarch docker images. I’d like to sand it down as much as possible. In a better world, that would mean just one simple command. It’s not.

Here’s the first error I got from my simple script:

podman build --platform linux/amd64 . -t "${TAG}-amd64"
podman build --platform linux/arm64/v8 . -t "${TAG}-arm64"
podman manifest create "$TAG" "${TAG}-amd64" "${TAG}-arm64"

Error: setting up to read manifest and configuration from "docker://account.dkr.ecr.us-east-1.amazonaws.com/image:tag": reading manifest docker://account.dkr.ecr.us-east-1.amazonaws.com/image:tag: manifest unknown: Requested image not found

This didn’t work, and it turned out the reason was quite simple if obtuse – podman manifest wants to build from the real repositories. As I hadn’t pushed those images, it couldn’t find them on the remote repository.

I spent some time searching for a solution to build images locally, then build them into a manifest, and then finally tag them. I found a couple of things that should work, but didn’t:

podman manifest add MANIFEST containers-storage:image:tag
reference "[overlay@/home/ted/.local/share/containers/storage+/run/user/1000/containers]docker.io/library/image:tag" does not resolve to an image ID: identifier is not an image

I don’t know why this didn’t work. According to the docs on transports, containers-storage is the transport we can use to inspect local images. This is somewhat consistent in behavior:

podman build image:tag
podman tag 
podman inspect image:tag
...
podman  inspect containers-storage:repo/image:tag
...
podman inspect containers-storage:image:tag
Error: no such object: "containers-storage:image:tag
podman inspect containers-storage:localhost/image:tag
...

Containers-storage somewhat works, but you have to supply a hostname, which is “localhost” for otherwise unspecified images.

Another angle I tried is building both in a singular tag with the manifest flag. This seems like it should work.

podman build --platform linux/amd64,linux/arm64/v8 . --manifest image:tag

This actually worked – I didn’t realize it at first, but it built both architectures:

podman manifest inspect image:tag
{
    "schemaVersion": 2,
    "mediaType": "application/vnd.docker.distribution.manifest.list.v2+json",
    "manifests": [
        {
            "mediaType": "application/vnd.oci.image.manifest.v1+json",
            "size": 2444,
            "digest": "sha256:ea95462b074c650e6c477f8bf88bcfa0b6a021de7c550e2faca25c7f833bdc5f",
            "platform": {
                "architecture": "amd64",
                "os": "linux"
            }
        },
        {
            "mediaType": "application/vnd.oci.image.manifest.v1+json",
            "size": 2444,
            "digest": "sha256:f1eb75a71b89b3655b845acd79076bc8d640d3db8fb0f24367748fb50b2e6001",
            "platform": {
                "architecture": "arm64",
                "os": "linux",
                "variant": "v8"
            }
        }
    ]
}

However, when I pushed my image, the wrong format was downloaded on my k8s nodes:

podman push image:tag

Containers:
  loadtest:
    Container ID:  containerd://5d157712c742aa63220c34eb2b5213b0cf580a50c5768406ff434910700a2638
    Image:         image:tag
    Image ID:      image:tag@sha256:d0345fbc0ec7c38fdcbedfb90e7b21986e2e9642856e7e2a62a0591d68d48f85

A significant amount of consternation later, I realized that because I was using podman push, the image was being resolved first, and then just the one architecture was pushed (but with tag for the whole . What I needed to do instead was podman manifest push, which pushed the whole manifest and all sub-images.

Adventures in EFI boot

I have a server that didn’t boot on it’s own, requiring manual intervention each startup. This is not optimal for a server-type machine (though the motherboard was never intended for that purpose. This machine had originally booted via Windows, and the Bios on the motherboard would not let me set an MBR entry above an EFI entry via conventional means.

First problem I encountered – I couldn’t set EFI settings while booted in classic (MBR) Mode. This was resolvable through pretty simple means, I booted into an installer/recovery linux image. This was enough for me to be able to use efibootmgr to set up boot order priority, and move the entry it had created for the classic MBR into that position.

That was enough for me to resolve my basic issue, but I wanted to do one better – I wanted to boot via EFI on my existing Ubuntu install. Grub supports EFI and it wasn’t that hard to get installed, but there’s some gotchas. My first attempt was thus:

grub-install --target=x86_64-efi --efi-directory=/boot/efi --debug

grub-install will install to the default subdirectory /boot/grub, with the EFI directory specified separately. The EFI System Partition is a FAT32 formatted disk; That’s all that’s required.

Next, I had to create the boot entry (Grub will generally do this for you, but because I was doing my grub install from my MBR disk:

efibootmgr -c -L "ted" -l '\efi\ubuntu\grubx64.efi' -d /dev/sda -p 2

This didn’t quite work. Eventually I gave up and reinstalled from scratch. 🙂

I’ve had more fun EFI adventures – I had a motherboard that wouldn’t respect a Grub EFI image unless there was a “Windows” image around, so I had to install with grub-install --removable which creates a BOOT.EFI file, otherwise the same.

Last but not least of my recent EFI problems, I managed to partially reinstall grub – My grub modules directory was updated, but grub itself was an old version, and the system was in a crash loop because of grub.cfg loading modules. Reinstalling grub was enough to fix it.

Borg Priorities

https://www.cs.cmu.edu/~harchol/Papers/EuroSys20.pdf

The priority of a job helps define how the scheduler treats it. Ranges of priorities that share similar properties are referred to as tiers: • Free tier: jobs running at these lowest priorities incur no internal charges, and have no Service Level Objectives (SLOs). 2019 trace priority <= 99; 2011 trace priority bands 0 and 1. • Best-effort Batch (beb) tier: jobs running at these priorities are managed by the batch scheduler and incur low internal charges; they have no associated SLOs. 2019 trace priority 110–115; 2011 trace priority bands 2–8. • Mid-tier: jobs in this category offer SLOs weaker than those offered to production tier workloads, as well as lower internal charges. 2019 trace priority 116–119; not present in the 2011 trace. • Production tier: jobs in this category require high availability (e.g., user-facing service jobs, or daemon jobs providing storage and networking primitives); internally charged for at “full price”. Borg will evict lower-tier jobs in order to ensure production tier jobs receive their expected level of service. 2019 trace priority 120–359; 2011 trace priority bands 9–10. • Monitoring tier: jobs we deem critical to our infrastructure, including ones that monitor other jobs for problems. 2019 trace priority >= 360; 2011 trace priority band 11. (We merged the small number of monitoring jobs into the Production tier for this paper.)

https://www.cs.cmu.edu/~harchol/Papers/EuroSys20.pdf

https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf

2.5 Priority, quota, and admission control
What happens when more work shows up than can be accommodated? Our solutions for this are priority and quota.
Every job has a priority, a small positive integer. A highpriority task can obtain resources at the expense of a lowerpriority one, even if that involves preempting (killing) the
latter. Borg defines non-overlapping priority bands for different uses, including (in decreasing-priority order): monitoring, production, batch, and best effort (also known as
testing or free). For this paper, prod jobs are the ones in the
monitoring and production bands.

Upgrading PHP on Ubuntu

One of the weirdities that I have on my personal server is that my public facing site – www.gauntletwizard.net – is served from my personal `~/public-html/` folder. PHP is disabled from these folders by default, for good reason, but that reason is to keep PHP out of the hands of randos and I’m careful about who’s on my machine.

Anyway – There’s a stanza in /etc/apache2/mods-enabled/php-[7].conf that begins with `Running PHP scripts in user directories is disabled by default` – Do as it says and comment that section out.

Delete keys in redis non-atomically

There’s a lot of information out there about how to atomically delete a sequence of keys in Redis. That’s great, if you want to cause your production cluster to block for minutes at a time while you do so. If you’ve want to delete a bunch of keys with a scan, though, there’s less info.

redis-cli does support a --scan flag, which combined with a --pattern flag allows you to asynchronously list a set of prefixed keys – Like the keys command, except without causing your redis server to block. You can then use this output to feed an xargs command.

For example: redis-cli --scan -h "${REDISHOST}" --pattern "PATTERN" | tee keys | xargs redis-cli -h "${REDISHOST}" del | tee deletions

Prometheus alerting and questions

I’ve been switching my company over to Prometheus, and I’ve come across a few things that need discussion and opinions.

First, concrete advice:
Don’t just write an alert like
“`
alert: foo
expr: sum(rate(bar[5m])) > 5
“`
Write it so you record the rate, and then alert on that metric:
“`
record: bar:rate
expr: sum(rate(bar[5m]))
alert: foo
expr: bar:rate > 5
“`

From my Google days, I can say I should probably specify what the time is on that rate.

Questions:
1) How long should the rate window be? [5m]? [2m]? 3? 10?
* I’ve adopted 5m as standard across my company, being a compromise between being fast-moving and not overly smoothed
2) How long should alert `for`s be?
3) Metric naming
* I’m using `A_Metric_Name`; Not sure if this is right
4) Recorded rule naming
* I like `product:metric[:submetric]:unit` ; eg. houseparty:websockets_open:byDeviceType:sum

Kubernetes Build best practices

1) Squash your builds
This is now part of default docker, but it was well worth it even before. Docker will create a new tarball for each `stage` – Each ADD, RUN, etc creates a new layer that, by default, you upload. This means if you add secret material and then delete it – you haven’t really deleted it. More commonly, it bloats your image sizes. A couple intermediate files can be a huge pain, and waste your time and bandwidth uploading.

Don’t squash down to a single, monolithic image – Pick a good base point. Having a fully-featured image as a base layer is not a sin – So long as you reuse it, it doesn’t take up any more space or download time, so your lightweight squashed build can build on top of it.

2) Use Multistage builds
Your build environment should be every bit as much a container as your output. Don’t build your artifacts in your local machine and then add them to your images – You’re likely polluting your output with local state more than you know. Deterministic builds require you to understand the state of the build machine and make sure it doesn’t leak, and containers are a wonderful tool for that.

Alternatively:
Just use Bazel. Bazel’s https://github.com/bazelbuild/rules_docker is pretty simple to use, powerful, and generates docker-compatible images without actually running docker.

Migrating a SBT project to Bazel.

I’ve been working today on migrating a SBT project to Bazel. I’ve taken a few wrong turns, and I’ll document them later, but this will be my working doc and I’ll add some failures to the end.

Two major components – Bazel’s generate_workspace tool, and SBT’s make-pom command. You’ll create a POM file with the dependencies and repos.

ted:growth$ sbt make-pom
[warn] Executing in batch mode.
[warn] For better performance, hit [ENTER] to switch to interactive mode, or
[warn] consider launching sbt without any commands, or explicitly passing 'shell'
[info] Loading project definition from /Users/ted/dev/growth/project
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
[info] Set current project to growth (in build file:/Users/ted/dev/growth/)
[warn] Multiple resolvers having different access mechanism configured with same name 'Artifactory-lib'. To avoid conflict, Remove duplicate project resolvers (`resolvers`) or rename publishing resolver (`publishTo`).
[info] Wrote /Users/ted/dev/growth/target/scala-2.11/growth_2.11-resurrector-9449dfb1de3b816c5fd74c4948f16496b38952ab.pom
[success] Total time: 5 s, completed Jun 14, 2017 4:00:17 PM

This generates a pom file, but not exactly as generate_workspace wants it. It requires a directory with a pom.xml, so go ahead and turn that into one by making a tempdir and copying the file to it TMPDIR="$(mktemp -d)"; cp /Users/ted/dev/growth/target/scala-2.11/growth_2.11-resurrector-9449dfb1de3b816c5fd74c4948f16496b38952ab.pom "${TMPDIR}/pom.xml"

Next, build

So, on to the failures:
I initially tried to do my own workspace code generation. I took the output of sbt libraryDependencies and turned it into mvn_jar stanzas via script. This didn’t work, for the simple reason that I wasn’t doing it transitively, they mention that in the generate_workspace docs. I also tried specifying that list of deps as a big list of –archive stanzas; That turned out to be a mistake, mostly because of alternate repos. I also had to clean out a broken SBT set of repos; bazel does not play well with repeated repo definitions, while SBT is happy to ignore them.

Security

The big companies I’ve worked at have all had been using security policies. The small companies haven’t. Frequently, all access to production machines have been controlled by a single shared ssh key. This sucks, but is inevitable, given the lack of time to spend on tooling. However, there are some low-cost toolings to make this better.

The basic developer workflow has been – Type in a command, which will generate a SSH certificate, then ask you for your password and u2f auth, and it’ll talk to the central signing server and get that cert signed. This is surprisingly doable for a small org – BLESS and CURSE are two alternatives.

For myself, though, the right thing to do is run ssh-agent. ssh-agent allows you to keep your keys in memory, and can support several keys. It also allows for forwarding the auth socket to a remote host – So if you need to ssh through a bastion host, you don’t have to copy your SSH key to the bastion machine, it can live on your local drive and all authentication requests can go through it. ssh -A enables this forwarding.

The other problem I’ve encountered a few times is that I want to share my ssh-agent across several terminals. This can be a blessing or a curse, but on most of my machines I only have one or two keys, and while I want them encrypted at-rest I don’t care if they’re loaded in memory a bunch. I’ve written the shell script that does this a bunch, and I today asked myself why it’s not in the default ssh toolkit (like ssh-copy-id). Well, it’s not, but there is a tool that does what I’m looking for: Keychain, not to be confused with the OSX tool of the same name. Though, to my surprise, OSX *already has this functionality*; My default terminal opens up with an SSH_AUTH_SOCK already populated, and it’s managed by the system. That’s pretty cool.

Ted's Excellent Adventure.