Skip to content

Conversation

@kradalby
Copy link
Collaborator

@kradalby kradalby commented May 21, 2025

This PR is kind of like a "stretch goal".

This PR is very much needed to try to address the longstanding issues the database schema and data integrity has.

This adds a schema.sql which should be the source of truth representation of our schema after all migration is applied (for sqlite). This is to ensure you can look one place to figure out how the current schema is defined and not have to do the patchwork of migrations in your head to get an overview.

Thanks to @nblock for discovering this.

How did we get here?

As far as I have understood, this is a combination of several things, parts of things we did not know at the time, didn't think about and naivety. I have at least identified the following:

SQLite requires foreign key to be set on each connection
Essentially, from the early birth of Headscale, we added constraint to the schema via GORM, but for many many early versions, they were never enforced because SQLite was started without setting the PRAGMA foreign_keys, meaning it will just ignore them all.
Running for years without any enforcing likely let all sorts of odd relationships of data creep in, causing havoc when we eventually enabled it.

This has since been addressed, and from this PR, we will also enforce foreign keys for all migrations (we had to turn it off for some old migrations as they are a bit all over the place).

All migrations was done in ONE function
Until Headscale version 0.23.0, all migrations was just one big function, what does that mean? it meant that no matter which migrations version you started at, or if you had parts of the migrations applied previously, Headscale had no clue and just tried to do them all. This is not very smart, and let us say, it is amazing that it has not ended up worse than it has.

This has since been addressed and all changes are now versioned and runs only if needed.

GORM AutoMigrate is a bit of a trap
The ORM Headscale uses, called GORM has a potentially very nice migration helper called AutoMigrate, simplified, it will look at the Model (your go struct) and the current table, calculate the difference and make a migration path.

The problem here is that if you change a model multiple times over several versions and call AutoMigrate in the first migration, then it will apply changes that are even further ahead than it is suppose to. It does not really use the "snap shot" of the Model from the time of the code base where the migration was written. This is kind of mind field, because you can end up trying to apply a migration in between two AutoMigrate call that depends on the data being in the shape of an older model, before upgrading to the final state.
I am sure we "used AutoMigrate wrong", and it is probably even covered in the docs, but since we often try to clean up old code, we do not want to have a lot of deprecated fields on our models. It just does not fit our code. In addition, the opaque-ness of using AutoMigrate hides away "what is actually changing" for someone who comes back to the migration later and needs to debug something or wants to understand what goes on.

To handle this, from this PR, we will no longer accept the use of AutoMigrate moving forward, and all migrations will have to be explicit.

Why is this PR 6000+ lines

We are doing a very large migration that essentially rebuilds all of the SQLite tables to ensure they are equal to the new truth schema on disk. As some tables out there might have undeleted columns that is not used anymore (AutoMigrate does not remove columns), we have opted for rebuilding.

To make sure this is safe, we have done the following:

  • For every version of headscale from 0.2.0 to 0.26.1 we have

    • extracted the schema at the time
    • dumped a database (empty, but with partial migrations)
    • test that it can be migrated to the newest schema and matches
  • For the other databases we have got hold of with previously "broken" migrations

    • We test migration to the new version
    • We try to validate that the data is as expected when it has finished migration

Note: None of these changes will apply to Postgres which is in a "maintenance only" mode.

@ghost
Copy link

ghost commented May 21, 2025

Pull Request Revisions

RevisionDescription
r12
GitHub workflow files reformattedMultiple GitHub workflow YAML files were reformatted with improved line breaks and multiline string handling, primarily affecting indentation and line wrapping
r11
Commented out gopls build configurationRemoved active gopls override configuration, converting it to a commented-out block in the Nix flake configuration
r10No changes since last revision
r9
Database migration SQL simplifiedRemoved redundant EXISTS checks in database migration SQL insert statements, streamlining data copy process
r8
Refactored GitHub Actions integration testsCreated reusable integration test template and updated workflow to split SQLite and PostgreSQL test configurations with separate job matrices
7 more revisions
r7The changes are too large to summarize
r6
Added detailed SQLite configuration documentationExpanded inline documentation for SQLite configuration enums, explaining performance and durability tradeoffs for journal modes, auto-vacuum, and synchronous settings
r5
Refactored database code with minor improvementsMade multiple small code quality improvements across database-related files, including error handling, constant definitions, test cleanup, and minor code style adjustments
r4The changes are too large to summarize
r3
Complex SQLite schema migration implementedAdded robust SQLite schema migration with comprehensive data preservation, validation, and handling of legacy database structures including routes table migration and index recreation
r2
SQLite database schema migration improvedImplemented comprehensive SQLite schema migration with robust data preservation, foreign key validation, index recreation, and advanced error handling mechanisms
r1
Added database schema validation logicIntroduced schema validation for SQLite database using embedded SQL schema file and squibble library, with specific validation for migrations

☑️ AI review skipped after 5 revisions, comment with `/review` to review again
Help React with emojis to give feedback on AI-generated reviews:
  • 👍 means the feedback was helpful and actionable
  • 👎 means the feedback was incorrect or unhelpful
💬 Replying to feedback with a comment helps us improve the system. Your input also contributes to shaping future interactions with the AI reviewer.

We'd love to hear from you—reach out anytime at team@review.ai.

@kradalby kradalby force-pushed the kradalby/schema-truth branch 3 times, most recently from cdec2b5 to e8d7570 Compare July 3, 2025 09:49
@kradalby
Copy link
Collaborator Author

kradalby commented Jul 3, 2025

I'm marking this as ready, but I would like to gather some more "broken" schemas, so will post in discord. also posted on #2475 and #2597.

If you have a broken database, I would love to get a dump of it. The most valuable would be:

  • The older (or multiple steps along the way) the better (an old backup for example)
  • The more data the better
  • Scrub PII, auth keys and API, but don't remove things that don't matter (you can leave public keys for node, disco machine)
  • A description of the type of error you got when it was bad, and possibly how you fixed it.

A dump could be taken with:

sqlite3 <DATABASE PATH> ".dump" > database.sql

If you have one of these, you can email me an SQL dump, my email is in my Github profile.

@kradalby kradalby marked this pull request as ready for review July 3, 2025 10:25
@kradalby kradalby requested a review from juanfont as a code owner July 3, 2025 10:25
@kradalby kradalby force-pushed the kradalby/schema-truth branch from e8d7570 to 7dca4e9 Compare July 3, 2025 11:21
@kradalby kradalby force-pushed the kradalby/schema-truth branch from 7dca4e9 to f068b3f Compare July 3, 2025 12:30
@kradalby kradalby requested a review from nblock July 3, 2025 12:30
Copy link
Collaborator

@nblock nblock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this PR with several databases and they got migrated properly.

ctx, cancel := context.WithTimeout(context.Background(), contextTimeoutSecs*time.Second)
defer cancel()

if err := squibble.Validate(ctx, sqlConn, dbSchema); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deliberately modified the schema and it reported the difference. The error is reported as single line which makes it hard to read: FTL cmd/headscale/cli/serve.go:24 > Error initializing error="creating new headscale: init state: init database: validating schema: invalid schema (-got, +want):\n\n>> Modify index \"idx_policies_deleted_at\"\n@@ -1 +1 @@\n-CREATE INDEX idx_policies_deleted_at ON \"policies2\"(deleted_at)\n+CREATE INDEX idx_policies_deleted_at ON policies(deleted_at)\n\n>> Remove table \"policies2\"\n\n>> Add table \"policies\"\n+ CREATE TABLE policies(\n+ id integer PRIMARY KEY AUTOINCREMENT,\n+ data text,\n+ \n+ created_at datetime,\n+ updated_at datetime,\n+ deleted_at datetime\n+ )\n"

Is it possible to print the error on multiple lines?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not be printed more clear in human mode, logging library was not very keen on respecting newlines.

log.Info().Int("routes", len(routes)).Int("nodes_with_routes", len(tableData.RoutesData)).Msg("Collected routes data")
}

// Drop all existing tables to avoid foreign key constraint issues
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems dangerous to me. If any of the restore steps below fail, we'd leave the user with an empty/half-filled database.

Have you considered the alternative:

  • Rename existing tables to $TABLE_old
  • Drop all indices if exists
  • Restore data
  • Drop old tables

Data could be directly inserted from old to the new table (in case no changes are required) and the data is still there in the renamed tables in case anything fails during restore.

The migration itself could probably not run a second time because old tables are renamed already but at least the data is still there in case the user has no backup.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smarter, PTAL

@kradalby kradalby force-pushed the kradalby/schema-truth branch from f068b3f to cde5b80 Compare July 7, 2025 10:14
@kradalby kradalby requested a review from ohdearaugustin as a code owner July 7, 2025 10:14
@kradalby kradalby requested a review from nblock July 7, 2025 10:15
kradalby added 3 commits July 7, 2025 13:28
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
We are already being punished by github actions, there seem to be
little value in running all the tests for both databases, so only
run a few key tests to check postgres isnt broken.

Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
@kradalby kradalby force-pushed the kradalby/schema-truth branch from 48a4054 to 4d160a0 Compare July 7, 2025 11:28
@kradalby kradalby requested a review from nblock July 7, 2025 11:30
kradalby added 2 commits July 7, 2025 15:02
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
@kradalby kradalby merged commit 5ba7120 into juanfont:main Jul 7, 2025
83 of 87 checks passed
@kradalby kradalby deleted the kradalby/schema-truth branch July 7, 2025 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants