Skip to content

Conversation

@gerhard
Copy link
Member

@gerhard gerhard commented Mar 14, 2022

This is the story of migrating changelog.com from LKE to fly.io, a PaaS where Phoenix screams.

The main reason was to challenge assumptions (mostly mine) and experiment with doing things differently, in a simpler way. I am genuinely curious to discover what we will actually miss most from our Kubernetes setup, what alternatives are there (some may be better!), and most importantly how well does this actually work in practice. I still think that AND propositions are better than OR, and while I don't rule Kubernetes from our future, now is a good time to spring clean and shake things up a bit 😉

Also, I am heeding wise advice, and migrating Postgres to a managed fly.io one:

image

Most of the changes I ran imperatively, by issuing fly commands. While I tried to capture some of them in the latest implementation of Dagger 0.2.0, at some point I started getting slowed down because I was trying to reconcile two things at once. I have since shifted focus to fly only in order to achieve the primary goal of migrating changelog.com to fly.io quicker.

The backstory is that we are currently running on Linode Kubernetes Engine 1.20 which will be forcefully migrated to 1.21 on March 17, 2022 (so in less than 48 hours). K8s 1.20 is EOL and upgrades + security patches are no longer available. While we started a 1.22 K8s LKE cluster in the 2022 dir, we didn't finish that migration - life happened - and at this point the quickest thing is to migrate to fly.io. In all honesty, we have been talking about this for months: Ship It #44, #43, #40 & #31. This is the simplest thing right now. We will make some new friends - 👋🏻 @mrkurt @chrismccord @brainlid 👋🏻 - and most likely have a great story to share.

OK, let's get this party started!

What happened so far?

  1. Create a 2-node PostgreSQL 14 cluster on fly.io
  2. Import prod-2021-04_changelog_2022-03-13T13.00.09Z.sql, an hourly backup running in our current LKE production origin
  3. Load all secrets from LastPass into fly.io
  4. Deploy changelog.com app as a container image
  5. Scaled the app instance to dedicated-cpu-4x (app was crashing due to OOM errors)

The app instance is currently running at https://changelog-2022-03-13.fly.dev/ 🙌🏻

What happens next?

fly scale count 0
fly postgres connect postgres-2022-03-12
drop database changelog with (FORCE);
create database changelog;
flyctl proxy 5432 -a postgres-2022-03-12
psql -h 127.0.0.1 changelog postgres < ~/Downloads/prod-2021-04_changelog_2022-03-15T23.24.19Z.sql
fly scale count 1
  • Figure out why this is happening on app boot → this feels like a @akoutmos task
TelemetryMetricsPrometheus.Core.EventHandler has failed and has been detached
20:31:29.901 [error] Handler {TelemetryMetricsPrometheus.Core.EventHandler, #PID<0.518.0>,
 [:changelog, :prom_ex, :oban, :init, :circuit, :backoff, :milliseconds]} has failed and has been detached. Class=:error
Reason={:badkey, :circuit_backoff,
 %Oban.Config{
   dispatch_cooldown: 5,
   engine: Oban.Queue.BasicEngine,
   get_dynamic_repo: nil,
   log: false,
   name: Oban,
   node: "31eb7755",
   notifier: Oban.Notifiers.Postgres,
   plugins: [
     {Oban.Plugins.Cron,
      [
        timezone: "US/Central",
        crontab: [
          {"0 4 * * *", Changelog.ObanWorkers.StatsProcessor},
          {"0 3 * * *", Changelog.ObanWorkers.SlackImporter},
          {"* * * * *", Changelog.ObanWorkers.NewsPublisher}
        ]
      ]},
     Oban.Plugins.Stager,
     Oban.Plugins.Pruner
   ],
   prefix: "public",
   queues: [comment_notifier: [limit: 10], scheduled: [limit: 5]],
   repo: Changelog.Repo,
   shutdown_grace_period: 15000
 }}
Stacktrace=[
  {PromEx.Plugins.Oban, :"-oban_supervisor_init_event_metrics/2-fun-2-", 2,
   [file: 'lib/prom_ex/plugins/oban.ex', line: 389]},
  {TelemetryMetricsPrometheus.Core.EventHandler, :get_measurement, 3,
   [file: 'lib/core/event_handler.ex', line: 63]},
  {TelemetryMetricsPrometheus.Core.LastValue, :handle_event, 4,
   [file: 'lib/core/last_value.ex', line: 56]},
  {:telemetry, :"-execute/3-fun-0-", 4,
   [file: '/app/deps/telemetry/src/telemetry.erl', line: 150]},
  {:lists, :foreach, 2, [file: 'lists.erl', line: 1342]},
  {Task.Supervised, :invoke_mfa, 2, [file: 'lib/task/supervised.ex', line: 90]},
  {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 226]}
]
  

image

  • Perform one last db re-import before production traffic starts getting routed to the new origin
  • Route changelog.com production traffic to changelog-2022-03-13.fly.dev origin
  • Check Fastly stats & Honeycomb events for any origin issues
  • Stop auto-deploys on 22.changelog.com (they trigger LKE db backups) - remove keel
  • Delete backup db cron job (this will stop LKE db backups every hour)
  • Migrate vanity domains to fly.io
  • Capture Fastly VCL working config
  • Merge PR
  • Update image tag in fly.toml to thechangelog/changelog.com:master

How can you help?

  • Does the above list sound reasonable?
  • Have we missed anything? NOOP if nothing comes to mind.
  • Do the code changes look reasonable? → just one change for now

All Changeloggers & Shippers that are able to help, now is a good time to assemble 🛡
@jerodsanto @adamstac @lawik @akoutmos @mrkurt @chrismccord @brainlid @nickjj

What could have gone better?

I will circle back to this, there are a few other important things which I need to cover first.

Just a brain offload for now:

fly logs stopped streamings logs for 1' 20" after the app restarted
2022-03-15T22:41:15Z app[d6b64e8b] iad [info]22:41:15.375 [info] Access ChangelogWeb.Endpoint at https://changelog-2022-03-13.fly.dev
2022-03-15T22:41:15Z app[d6b64e8b] iad [info]22:41:15.893 [info] PromEx.LifecycleAnnotator successfully created start annotation in Grafana.
2022-03-15T22:41:15Z app[d6b64e8b] iad [info]22:41:15.983 [info] PromEx.DashboardUploader successfully uploaded /app/_build/prod/lib/prom_ex/priv/application.json.eex to Grafana.
2022-03-15T22:41:16Z app[d6b64e8b] iad [info]22:41:16.167 [info] PromEx.DashboardUploader successfully uploaded /app/_build/prod/lib/prom_ex/priv/beam.json.eex to Grafana.
2022-03-15T22:41:16Z app[d6b64e8b] iad [info]22:41:16.315 [info] PromEx.DashboardUploader successfully uploaded /app/_build/prod/lib/prom_ex/priv/ecto.json.eex to Grafana.
2022-03-15T22:41:16Z app[d6b64e8b] iad [info]22:41:16.530 [info] PromEx.DashboardUploader successfully uploaded /app/_build/prod/lib/prom_ex/priv/phoenix.json.eex to Grafana.
2022-03-15T22:41:16Z app[d6b64e8b] iad [info]22:41:16.719 [info] PromEx.DashboardUploader successfully uploaded /app/_build/prod/lib/prom_ex/priv/oban.json.eex to Grafana.
2022-03-15T22:41:16Z app[d6b64e8b] iad [info]22:41:16.839 [info] PromEx.DashboardUploader successfully uploaded /app/_build/prod/lib/changelog/priv/grafana_dashboards/app.json to Grafana.
2022-03-15T22:41:16Z app[d6b64e8b] iad [info]22:41:16.983 [info] PromEx.DashboardUploader successfully uploaded /app/_build/prod/lib/changelog/priv/grafana_dashboards/erlang.json to Grafana.
2022-03-15T22:41:17Z app[d6b64e8b] iad [info]22:41:17.118 [info] PromEx.DashboardUploader successfully uploaded /app/_build/prod/lib/changelog/priv/grafana_dashboards/ecto.json to Grafana.
2022-03-15T22:41:17Z app[d6b64e8b] iad [info]22:41:17.253 [info] PromEx.DashboardUploader successfully uploaded /app/_build/prod/lib/changelog/priv/grafana_dashboards/phoenix.json to Grafana.
2022-03-15T22:41:17Z app[d6b64e8b] iad [info]22:41:17.907 [error] Handler {TelemetryMetricsPrometheus.Core.EventHandler, #PID<0.531.0>,
2022-03-15T22:41:17Z app[d6b64e8b] iad [info] [:changelog, :prom_ex, :oban, :init, :circuit, :backoff, :milliseconds]} has failed and has been detached. Class=:error
2022-03-15T22:41:17Z app[d6b64e8b] iad [info]Reason={:badkey, :circuit_backoff,
2022-03-15T22:41:17Z app[d6b64e8b] iad [info] %Oban.Config{
2022-03-15T22:41:17Z app[d6b64e8b] iad [info]   dispatch_cooldown: 5,
2022-03-15T22:41:17Z app[d6b64e8b] iad [info]   engine: Oban.Queue.BasicEngine,
2022-03-15T22:41:17Z app[d6b64e8b] iad [info]   get_dynamic_repo: nil,
2022-03-15T22:41:17Z app[d6b64e8b] iad [info]   log: false,
2022-03-15T22:41:17Z app[d6b64e8b] iad [info]   name: Oban,
2022-03-15T22:42:37Z app[d6b64e8b] iad [info]22:42:37.080 request_id=FtyvKaNewnaiVAUAABEB [info] GET /manifest.json
2022-03-15T22:42:37Z app[d6b64e8b] iad [info]22:42:37.080 request_id=FtyvKaNewnaiVAUAABEB [info] Sent 200 in 732µs
  

With my CF loggregator & Heroku Logplex experience, I am pretty sure that when our app boots, we stream too many logs and the overload protection mechanism gets triggered. This sheds a bunch of logs with the 2022-03-15T22:41:17Z timestamp and unfortunately, these are exactly the logs which we need to debug an error. I will check to see if this crash gets logged in Sentry - it should! Nope, we don't 😑

@lawik
Copy link
Collaborator

lawik commented Mar 16, 2022

This is fun. Hope the journey goes smoothly :)
I have a client project on fly right now and haven't had much in the way of problems.
Running postgres cli commands against my db over wireguard was a bit finicky but once I'd gotten it to work it worked well. For one thing I had to match the cli tools version locally so ended up turning to docker for that.

@gerhard
Copy link
Member Author

gerhard commented Mar 16, 2022

This is fun. Hope the journey goes smoothly :)

I'm glad that you are enjoying it!

We have hit a couple of issues, but nothing that we couldn't work around. I will capture them to the above description as we get closer to wrapping it up.

The big issue that is currently blocking progress is admin logins not working. There are no errors in the app log, but it still doesn't work. The issue that I am referring to is:

  • Figure out why we cannot login as admins via email → this feels like a @jerodsanto task, and he already has the context. I am assuming that Jerod will need to set fly CLI up so that he can interact with the new infra, but I am waiting for him to pull my leg 😀
2022-03-15T23:48:36Z app[85c3240d] iad [info]23:48:36.025 request_id=Ftyyw2bsYSdbQdkAACfi [info] GET /in/67657268617264406368616E67656C6F672E636F6D7C30363945313631313432333137363639
2022-03-15T23:48:36Z app[85c3240d] iad [info]23:48:36.074 request_id=Ftyyw2bsYSdbQdkAACfi [info] Sent 302 in 48ms
2022-03-15T23:48:36Z app[85c3240d] iad [info]23:48:36.164 request_id=Ftyyw28rpuBF5-YAACgi [info] GET /~
2022-03-15T23:48:36Z app[85c3240d] iad [info]23:48:36.170 request_id=Ftyyw28rpuBF5-YAACgi [info] Sent 302 in 6ms
2022-03-15T23:48:36Z app[85c3240d] iad [info]23:48:36.256 request_id=Ftyyw3SvY3s2S4YAAChC [info] GET /in
2022-03-15T23:48:36Z app[85c3240d] iad [info]23:48:36.260 request_id=Ftyyw3SvY3s2S4YAAChC [info] Sent 200 in 3ms

I have a client project on fly right now and haven't had much in the way of problems. Running postgres cli commands against my db over wireguard was a bit finicky but once I'd gotten it to work it worked well.

flyctl proxy 5432 -a postgres-2022-03-12
psql -h 127.0.0.1 changelog postgres

This worked well for me from a developer experience (DX) perspective, but the speed was dog slow. Importing the db was at least 2x slower over wireguard than when I was running it directly in our K8s cluster. I didn't spend too much time on it since we are talking 5 minutes slower, but it was something that made me 🤔

For one thing I had to match the cli tools version locally so ended up turning to docker for that.

I know exactly what you mean. That is the higher bit perspective that I hold in this migration: how do we capture all these tasks in a Dagger plan which anyone is able to use on their host, without polluting their environment? Fun times ahead indeed 😉 @aluzzardi @samalba @shykes

@gerhard

This comment was marked as outdated.

@mrkurt
Copy link

mrkurt commented Mar 16, 2022

This worked well for me from a developer experience (DX) perspective, but the speed was dog slow. Importing the db was at least 2x slower over wireguard than when I was running it directly in our K8s cluster. I didn't spend too much time on it since we are talking 5 minutes slower, but it was something that made me 🤔

A 2x slowdown is somewhat normal for big operations over fly proxy. This is using user mode wireguard, it's not nearly as fast as what's in the Linux kernel.

We have a migrator project you can use to run imports from a Fly VM to another Fly VM. It will be much faster: https://github.com/fly-apps/postgres-migrator

You can also run fly ssh console -a <database> and do the dumb + import "locally".

@gerhard gerhard marked this pull request as draft March 16, 2022 21:57
@gerhard

This comment was marked as outdated.

@gerhard gerhard changed the title Deploy https://changelog-2022-03-13.fly.dev/ ✈️ Migrate changelog.com to fly.io ✈️ Mar 16, 2022
@gerhard gerhard force-pushed the 2022.fly branch 3 times, most recently from 27ecdc1 to 2d33dbe Compare March 16, 2022 22:50
gerhard added a commit that referenced this pull request Mar 17, 2022
Follow-up from #400

Plan B for #407

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
gerhard added a commit that referenced this pull request Mar 27, 2022
The grafana-agent migration is more involved, going for the simplest
thing right now. Everything is going to change soon anyways:
#407

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
(cherry picked from commit f4e8cfb844ab908031b5d22200519aaf79af084d)
@gerhard gerhard force-pushed the 2022.fly branch 3 times, most recently from 6c641ea to 977501b Compare April 3, 2022 15:19
gerhard added 3 commits April 3, 2022 16:44
Most of this was done imperatively, by running fly CLI. Some of it was
captured in the main.cue. The focus of this is to get this working, have
a discussion, and then to make it right.

Capture a bunch more fly commands as dagger actions

Deploy to fly.io if prod_image gets built & pushed successfully

This *must* be moved into a standalone fly workflow right after we
migrate. While this was the simplest thing, it's wrong to combine build
& deploy concerns into a single pipeline.

This next step requires build_test_ship to commit a fly.toml update, and
this workflow triggering on fly.toml updates

🔥 This is fine for now 🔥

⚠️  cue.mod is for Dagger 0.2.x and will break the
0.1.x workflow that LKE 2022 current prod depends on. ⚠️

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
This is a *temporary* change and *must* be reverted before this gets
merged, otherwise new admin logins on changelog.com will stop working.

This commit shows how to get admin logins working on fly.io, when the
app is running behind changelog-2022-03-13.fly.dev. Due to security
reasons, .fly.dev. domains are *not* allowed and will *not* work.

The gotcha is that this is a *compile* config, meaning that setting it
in the env is *not* going to work.

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
This reverts commit 4d47781.

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
changelog.com in Fastly -> Fly is not working correctly (all requests
are being sent to the S3 backend) and I wanted to double-check my
understanding of how all the pieces fit together. They fit as expected:

- https://old-flower-9005.fly.dev/
- https://lazu.ch/

But not for changelog.com... time for some extra pairs of eyes.

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
@gerhard
Copy link
Member Author

gerhard commented Apr 3, 2022

When we configured Fastly to route all production traffic from Fastly (changelog.com) to Fly (changelog-2022-03-13.fly.dev), Fastly was sending this traffic to our S3 bucket origin instead 🤦‍♂️

This was wrong, since only cdn.changelog.com requests should be sent to the S3 bucket. The condition was right there, and yet... it was not working as expected.

I thought that we had something misconfigured, but clicking through the Fastly service config UI is just pants, so I downloaded the VCL config and then 😲

$ wc -l fastly-changelog-v113.vcl
   12970 fastly-changelog-v113.vcl

There are nearly ~13k lines of VCL config, out of which ~12.3k are just shielding config 🤔

Having trimmed all that excess, we were left with ~700 lines of relevant VCL config:

$ wc -l fastly-changelog-v113.no-shielding.vcl
     718 fastly-changelog-v113.no-shielding.vcl

Even so, we couldn't find the problem. Everything looked right, and yet changelog.com requests were not getting sent to the correct backend. Maybe I was getting too tired and I just couldn't see the problem even if my life depended on it...

We scrambled to finish setting up another LKE instance and route all changelog.com traffic to the new LKE origin, 22.changelog.com, and call it a long, 16h day.

I was now able to make some headspace, start with a clean slate, and confirm that Fastly → Fly integration works as expected: https://lazu.ch/

image

NOTE: as dope as that h3 protocol is, let's focus on the fact that this works as expected 😎

And it works directly too: https://old-flower-9005.fly.dev/

image

So there is something wrong with the 2022.fly/lazu.ch/fastly-changelog-v113.no-shielding.vcl config, and we are going to find out what next. Before we can do that, there is something else that we need to do first:

  • Configure the managed PostgreSQL instance to get backed up periodically (every 1h)
  • Backup the managed PostgreSQL instance on every deploy
  • Reload managed PostgreSQL instance with latest backup
  • Configure the LKE 1.22 app instance to use the managed PostgreSQL instance

The reason why we need to do this first is because we want a single db instance serving both the LKE 1.22 & Fly app instances. Regardless what happens with the experiments, db data will remain consistent and we won't have to worry about missed writes. Currently LKE 1.22 app instance uses a local db, and the Fly app instance uses the managed PostgreSQL instance.

In hindsight, we should have split this issue into multiple PRs, and that should have been the first PR to get merged. Things got hectic with the LKE 1.20 migration, so I am not surprised that this did not go as smoothly as it could, but now that we had more time to think about it, the incremental steps are obvious.


If anyone is interested in the VCL configs, or the test app code, everything is in this PR, 2022.fly dir.

gerhard added 2 commits April 4, 2022 00:40
Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
@gerhard

This comment was marked as outdated.

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
@gerhard
Copy link
Member Author

gerhard commented Apr 4, 2022

Now that we have LKE -> fly.io, Miss Lantecy in Fastly was interesting to notice:

image

I suspect that this Honeycomb query is a good starting point to continue exploring:

image

@gerhard
Copy link
Member Author

gerhard commented Apr 4, 2022

Reminder to investigate Fastly cache errors (convert to issue). Honeycomb starting point:

image

gerhard added 2 commits April 5, 2022 00:18
It would have been nice to capture these declaratively, in the fly.toml

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
This makes everything work together as expected.
FTR: https://support.fastly.com/hc/en-us/requests/477850

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
@gerhard gerhard marked this pull request as ready for review April 4, 2022 23:35
@gerhard gerhard merged commit 607bad6 into master Apr 5, 2022
@gerhard gerhard deleted the 2022.fly branch April 5, 2022 00:24
gerhard added a commit that referenced this pull request Apr 5, 2022
Follow-up to #407

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
@gerhard
Copy link
Member Author

gerhard commented Apr 5, 2022

@gerhard gerhard changed the title ✈️ Migrate changelog.com to fly.io ✈️ ✈️ Migrate changelog.com to Fly.io ✈️ Aug 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants