I moved my SpacetimeDB to a new VM mostly by typing into Claude Code

Background

FrameGrabber is a small Next.js app I use to pull frames out of video files and group them by face. The frontend is uninteresting; the interesting bit is that everything the sessions, frames, embeddings, and full-resolution JPEGs, lives in SpacetimeDB, running locally on my Mac Studio. After a few months of use the data directory had grown to 30 GB, the Studio was starting to be the only place I could access my own photos from, and I had a perfectly good Ubuntu VM named aegis sitting on my Proxmox cluster doing nothing. So I decided to move the database, the module, and the frontend over to it.

I did the deploy with an agent as the primary operator. I sat at the keyboard, but most of the typing was prompts. This is a write-up of how that went, with the failure modes left in.

The instruction

I gave it the goal, an SSH alias, and a vibes-level pointer to a similar project's Justfile for inspiration:

I am hosting this app on a VM named aegis you should be able to ssh there via ssh aegis@aegis.

I need you to go there, pull down the code changes, install spacetimedb, migrate the DB from here to there and start the app via some sort of service, I cannot remember how I did it on aegis but more than likely a simple systemctl service.

Do some research, if you need inspiration on a Justfile workflow than I like look at ../nano-collector/Justfile. Ask me questions if anything is unclear.

This is, deliberately, not a spec. I wanted to see how much reconnaissance the agent would do before reaching for rm or systemctl. The answer turned out to be: a lot. It SSH'd in, found an old framegrabber:latest Docker container on port 3000 from two months ago, an inactive spacetimedb.service systemd unit pointing at /stdb/spacetime on port 7832, no rust toolchain, an old git checkout one commit behind origin, and Caddy already running locally as well as on a public VPS, and only then started asking me questions. The questions were good. The first one stopped the deploy cold.

Seam 1: the disk-space wall

Reconnaissance is supposed to find the things that aren't in your prompt. In this case it found:

/dev/mapper/ubuntu--vg-ubuntu--lv 50G 27G 21G 57% /

Twenty-one gigabytes free. A 30 GB SpacetimeDB data directory. Even if I deleted the old 2 GB container image I'd still be five gigs underwater. The agent didn't try to be clever, it surfaced this as a hardware decision and laid out four options:

A. Grow the Proxmox VM disk
B. Attach a second volume and bind-mount it
C. Slim the local data first
D. Move the whole stack to a beefier host.

It explicitly recommended option A, then waited. I told it: can you ssh to ssh root@pve-node-gamma and let me know if A is feasible?

It walked the Proxmox node, found my VM (102 aegis) on the local-lvm thinpool, confirmed the pool had 72 GB free, and did a qm resize 102 scsi0 +50G online. Inside aegis, the rest was the textbook online-LVM grow:

sudo growpart /dev/sda 3
sudo pvresize /dev/sda3
sudo lvextend -l +100%FREE /dev/ubuntu-vg/ubuntu-lv
sudo resize2fs /dev/ubuntu-vg/ubuntu-lv

50 GB → 100 GB, no reboot, ~30 seconds. This is the kind of moment I'd flag as the agent earning its keep: it caught a wall I would not have caught from my desk, refused to paper over it, and proposed the right fix instead of the convenient one.

Migrating the data

The next obstacle was operational. Every command on aegis needed sudo, and the Bash tool the agent runs through has no terminal, there's nowhere to type a password prompt. The agent flagged this, I being both lazy and a little curious of what would happed added a one-line /etc/sudoers.d/99-aegis-nopasswd entry for the duration of the deploy. Hopefully I wouldn't soon regret this decision. (I removed the entry at the end. I would not call this best practice, but I would call it pragmatic.)

Then the actual data: 30 GB rsync'd from ~/.local/share/spacetime/data/ on the Mac to /stdb/data/ on aegis. Direct over Tailscale, ~70 MB/s, a little under seven minutes. The funniest beat in this section: macOS's bundled rsync is openrsync 2.6.9-compatible, which doesn't speak --info=progress2. The agent's first attempt failed instantly. It paused, ran brew install rsync, and continued with the GNU one. I would have done exactly the same dance, but I actually found this out after the fact reading the thread more closely.

I'd also pre-stopped the local STDB so I wouldn't be copying a moving target. The agent noticed via pgrep and lsof that no process was actually holding the data dir, told me so, and proceeded.

Seam 2: the JWT-key 403

Data on disk, ownership corrected back to the spacetimedb user, service started, port 127.0.0.1:7832 listening. First publish:

Error: Pre-publish check failed with status 403 Forbidden:
c2009b541394… is not authorized to perform action on
database c2001fcc7a21…: update database

This one was actually interesting, and the agent worked through it correctly with no help from me.

The migrated database carries its provenance. It was created and owned by my Mac's CLI identity (hex c2000c6b…). My CLI had a token authorising me as that identity. But on aegis, SpacetimeDB issues tokens signed by a different ECDSA keypair, the one in /stdb/config/id_ecdsa{,.pub} that was generated when I first set up the VM years ago. So the token I had wasn't valid on aegis even though it represented the right identity. The aegis CLI, talking to its own server, transparently fell back to creating a new identity (c2009b54…), which unsurprisngly doesn't own anything.

Fix: replace aegis's server keypair with my Mac's:

sudo systemctl stop spacetimedb.service
sudo install -o spacetimedb -g spacetimedb -m 600 \
    /tmp/id_ecdsa /stdb/config/id_ecdsa
sudo install -o spacetimedb -g spacetimedb -m 644 \
    /tmp/id_ecdsa.pub /stdb/config/id_ecdsa.pub
sudo systemctl start spacetimedb.service

Then re-add the server alias on aegis (because the fingerprint changed) and republish:

Checking for breaking changes...
Database Migration Plan
Publishing module...
Updated database with name: frequent-boundary-8032

Empty migration plan, because I'd cleanly committed master before starting and the wasm matched the schema in the data. SELECT COUNT(*) FROM frame returned 2113. session: 56. person: 38. The data was real, on aegis, and queryable.

The wider lesson, whose identity owns this thing?, doesn't show up in manuals or tutorials, but it always shows up in real migrations. Convex, Postgres role grants, Cloudflare Tunnel, all of them have this same shape. The data is portable but ownership of data isn't, and the seams between them are where the surprises live. The agent navigated this one entirely from the error message, which I appreciated.

Seam 3: STDB on loopback / Caddy 502

By this point I'd told it to make all the rational choices and let me know, partly because I trusted the recon, partly because answering a question every five minutes is its own kind of work, and partly becuase I had a meeting I couldn't miss and thankfully, most of the rest of the deploy ran without my intervention.

It built and started a Docker container for the frontend on port 3000, edited my public VPS's Caddyfile to add a framegrabber.nodegroup.ca:8443 block reverse-proxying to http://aegis:7832 (so the frontend's wss://…:8443 connection has somewhere to land), reloaded Caddy, and ran the smoke test. Everything 200'd except for one thing:

$ curl -sI https://framegrabber.nodegroup.ca:8443/
HTTP/2 502
server: Caddy

The fix was one line. The systemd unit had been authored with --listen-addr=127.0.0.1:7832, presumably from whatever cargo-cult I'd been doing the day I provisioned the VM. Caddy on the public VPS was reaching aegis over Tailscale and getting a connection refused on the loopback interface. sed -i 's|--listen-addr=.127.0.0.1:7832.|--listen-addr=0.0.0.0:7832|', daemon-reload, restart, recurl: 426 Upgrade Required from the WebSocket endpoint, which is the correct response to a curl-without-Sec-WebSocket-Key.

This is one I'm a little annoyed about. The agent "knew" the abstract pattern, loopback vs. all-interfaces, but didn't pre-flight the existing unit. A human operator would have read the ExecStart= line during the recon pass and clocked it. The agent only "noticed" once Caddy threw 502.

The optimization round, and seam 4

I told it the app worked but felt slow. It volunteered two changes without my asking. Both were good. Both shipped. One of them broke the build immediately, in a way that was foreseeable and that the agent did not foresee.

Change one: rewrite the Dockerfile as multi-stage, switch next.config.ts to output: "standalone", and copy only .next/standalone, .next/static, and public/ into the runtime image. The image dropped from 2.04 GB to 191 MB. Cold start dropped from 1184 ms to 126 ms. A docker image prune -af afterwards reclaimed 11 GB on aegis.

Change two: replace the hard-coded wss://${hostname}:8443 in src/lib/spacetimedb.ts with a small resolver. HTTPS goes through Caddy on :8443, plain HTTP to a non-loopback hostname goes direct to :7832, localhost goes to dev STDB on :3000. So when I'm on the tailnet, my browser hits http://aegis:3000 for the page and ws://aegis:7832 for the data, and the public VPS doesn't enter the loop at all.

The first deploy of these changes died here:

COPY failed: "/app/public": not found

public/ existed locally but was empty, because Next 15 doesn't put anything there by default and as we know git doesn't track empty directories. The fresh checkout on aegis had no public/ for Docker to copy. The fix is one byte (public/.gitkeep) and one of those mistakes that would have made me roll my eyes if I'd seen it in a code review. The agent had every input it needed to anticipate this, the ls public/ it had run earlier, the .gitignore, the COPY line in the Dockerfile it was literally writing, and it just didn't connect them. It caught the problem after Docker did.

I'm flagging this one in writing because it's the most representative seam in the whole deploy. The work that went well was structurally easy: each step had clear inputs and a clear next move. The work that went poorly was the work that required holding two things in mind at once and noticing that they were going to collide. That's exactly the kind of work I personally find easy and exactly the kind of work agents currently struggle with.

Postscript: seam 5, in the harness this time

A few hours after I called the deploy done I noticed the app felt sluggish, and a curl to the API hung for half a second when it should have been instant. I told the agent: it looks like stdb went down on aegis, can you take a look?

It hadn't. STDB was fine, port 7832 was listening, the database endpoint returned 200 in eleven milliseconds, the data was there. But:

$ uptime
 17:56:29 up 15 days, 16:10,  1 user,
 load average: 3.03, 3.05, 3.00

Three runaway processes were pegging three of aegis's four cores at 100% each. They had been running for three and a half hours. They were:

312195  /stdb/spacetime --root-dir=/stdb list --server http://127.0.0.1:7832
312294  /stdb/spacetime --root-dir=/stdb list
312338  /stdb/spacetime --root-dir=/stdb server add aegis-local --url …

These are the agent's own commands, from the very first hour of the deploy. They were the failed --root-dir=/stdb experiments, the ones where the CLI tried to write its config into a directory that the aegis user couldn't write to. Instead of erroring out, the CLI silently busy-looped. The agent's harness, having gotten no response, moved on to the next tool call. The remote processes kept spinning. The SSH session ended. The processes did not.

kill 312195 312294 312338, load average back to under 1, app feels normal again. But the lesson is uncomfortable: the seams aren't just in the system you're deploying. They're in the harness too. An agent that runs commands over SSH needs to bind the lifetime of the remote process to the lifetime of its own session, ssh -tt, timeout 30s, something, or you accumulate orphans on every box you touch. I went back and reviewed three hours of CPU graphs to find the only spike that mattered.

Reflecting

The trustable agent moves were the ones with cheap blast radius. Reconnaissance, dry-runs, drafting a Justfile, drafting a Caddy block, building a wasm artifact, scp'ing a file. The moves I had to be on the keyboard for were the ones that touched shared or destructive state. Editing a production Caddyfile that serves a dozen of my domains. Adding (and later removing) a sudoers.d entry. git reset --hard on a stale checkout. Anything that touched data I cared about — and there were 30 GB of it I cared about, which is why my final instruction before the long rsync was it is imperative the data gets onto the new instance. That instruction landed; the agent treated the data with appropriate respect from then on.

The agent was good at escalating uncertainty (the disk-space wall, the Proxmox resize question) and bad at anticipating second-order failures (the loopback bind, the empty public/, the orphaned CLI processes). I'd run a deploy this way again, and I'd run it faster than I did this one because I now know the failure modes. I would not run one with my hands off the keyboard.

The other half of this is that every single seam I described — the JWT keys, the --listen-addr, the :8443 in the Caddyfile, the Tailscale IP in DNS — is a thing a managed platform would have hidden. That's the trade. You eat the seams in exchange for owning the whole pipeline, and the result is that a 30 GB database, a Rust module, and a Next.js frontend all moved between two boxes I rent for less than the price of a coffee a month. The agent didn't change that math. It just compressed the time.

It's at https://framegrabber.nodegroup.ca. The cold start really is faster.