-
Notifications
You must be signed in to change notification settings - Fork 247
✈️ Migrate changelog.com to Fly.io ✈️ #407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is fun. Hope the journey goes smoothly :) |
I'm glad that you are enjoying it! We have hit a couple of issues, but nothing that we couldn't work around. I will capture them to the above description as we get closer to wrapping it up. The big issue that is currently blocking progress is admin logins not working. There are no errors in the app log, but it still doesn't work. The issue that I am referring to is:
This worked well for me from a developer experience (DX) perspective, but the speed was dog slow. Importing the db was at least 2x slower over wireguard than when I was running it directly in our K8s cluster. I didn't spend too much time on it since we are talking 5 minutes slower, but it was something that made me 🤔
I know exactly what you mean. That is the higher bit perspective that I hold in this migration: how do we capture all these tasks in a Dagger plan which anyone is able to use on their host, without polluting their environment? Fun times ahead indeed 😉 @aluzzardi @samalba @shykes |
This comment was marked as outdated.
This comment was marked as outdated.
A 2x slowdown is somewhat normal for big operations over We have a migrator project you can use to run imports from a Fly VM to another Fly VM. It will be much faster: https://github.com/fly-apps/postgres-migrator You can also run |
This comment was marked as outdated.
This comment was marked as outdated.
27ecdc1 to
2d33dbe
Compare
The grafana-agent migration is more involved, going for the simplest thing right now. Everything is going to change soon anyways: #407 Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk> (cherry picked from commit f4e8cfb844ab908031b5d22200519aaf79af084d)
6c641ea to
977501b
Compare
Most of this was done imperatively, by running fly CLI. Some of it was captured in the main.cue. The focus of this is to get this working, have a discussion, and then to make it right. Capture a bunch more fly commands as dagger actions Deploy to fly.io if prod_image gets built & pushed successfully This *must* be moved into a standalone fly workflow right after we migrate. While this was the simplest thing, it's wrong to combine build & deploy concerns into a single pipeline. This next step requires build_test_ship to commit a fly.toml update, and this workflow triggering on fly.toml updates 🔥 This is fine for now 🔥⚠️ cue.mod is for Dagger 0.2.x and will break the 0.1.x workflow that LKE 2022 current prod depends on.⚠️ Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
This is a *temporary* change and *must* be reverted before this gets merged, otherwise new admin logins on changelog.com will stop working. This commit shows how to get admin logins working on fly.io, when the app is running behind changelog-2022-03-13.fly.dev. Due to security reasons, .fly.dev. domains are *not* allowed and will *not* work. The gotcha is that this is a *compile* config, meaning that setting it in the env is *not* going to work. Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
This reverts commit 4d47781. Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
changelog.com in Fastly -> Fly is not working correctly (all requests are being sent to the S3 backend) and I wanted to double-check my understanding of how all the pieces fit together. They fit as expected: - https://old-flower-9005.fly.dev/ - https://lazu.ch/ But not for changelog.com... time for some extra pairs of eyes. Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
|
When we configured Fastly to route all production traffic from Fastly (changelog.com) to Fly (changelog-2022-03-13.fly.dev), Fastly was sending this traffic to our S3 bucket origin instead 🤦♂️ This was wrong, since only I thought that we had something misconfigured, but clicking through the Fastly service config UI is just pants, so I downloaded the VCL config and then 😲 There are nearly ~13k lines of VCL config, out of which ~12.3k are just shielding config 🤔 Having trimmed all that excess, we were left with ~700 lines of relevant VCL config: Even so, we couldn't find the problem. Everything looked right, and yet We scrambled to finish setting up another LKE instance and route all changelog.com traffic to the new LKE origin, 22.changelog.com, and call it a long, 16h day. I was now able to make some headspace, start with a clean slate, and confirm that Fastly → Fly integration works as expected: https://lazu.ch/
And it works directly too: https://old-flower-9005.fly.dev/ So there is something wrong with the
If anyone is interested in the VCL configs, or the test app code, everything is in this PR, |
Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
This comment was marked as outdated.
This comment was marked as outdated.
Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
|
Now that we have LKE -> fly.io, Miss Lantecy in Fastly was interesting to notice: I suspect that this Honeycomb query is a good starting point to continue exploring: |
|
Reminder to investigate Fastly cache errors (convert to issue). Honeycomb starting point: |
It would have been nice to capture these declaratively, in the fly.toml Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
This makes everything work together as expected. FTR: https://support.fastly.com/hc/en-us/requests/477850 Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
Follow-up to #407 Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
This is the story of migrating changelog.com from LKE to fly.io, a PaaS where Phoenix screams.
The main reason was to challenge assumptions (mostly mine) and experiment with doing things differently, in a simpler way. I am genuinely curious to discover what we will actually miss most from our Kubernetes setup, what alternatives are there (some may be better!), and most importantly how well does this actually work in practice. I still think that
ANDpropositions are better thanOR, and while I don't rule Kubernetes from our future, now is a good time to spring clean and shake things up a bit 😉Also, I am heeding wise advice, and migrating Postgres to a managed fly.io one:
Most of the changes I ran imperatively, by issuing
flycommands. While I tried to capture some of them in the latest implementation of Dagger 0.2.0, at some point I started getting slowed down because I was trying to reconcile two things at once. I have since shifted focus toflyonly in order to achieve the primary goal of migrating changelog.com to fly.io quicker.The backstory is that we are currently running on Linode Kubernetes Engine 1.20 which will be forcefully migrated to 1.21 on March 17, 2022 (so in less than 48 hours). K8s 1.20 is EOL and upgrades + security patches are no longer available. While we started a 1.22 K8s LKE cluster in the 2022 dir, we didn't finish that migration - life happened - and at this point the quickest thing is to migrate to fly.io. In all honesty, we have been talking about this for months: Ship It #44, #43, #40 & #31. This is the simplest thing right now. We will make some new friends - 👋🏻 @mrkurt @chrismccord @brainlid 👋🏻 - and most likely have a great story to share.
OK, let's get this party started!
What happened so far?
prod-2021-04_changelog_2022-03-13T13.00.09Z.sql, an hourly backup running in our current LKE production origindedicated-cpu-4x(app was crashing due to OOM errors)The app instance is currently running at https://changelog-2022-03-13.fly.dev/ 🙌🏻
What happens next?
changelogdb from local backup file → running thepsqlcommand will fail, the db needs to be re-created for this to workTelemetryMetricsPrometheus.Core.EventHandlerhas failed and has been detached20:31:29.901 [error] Handler {TelemetryMetricsPrometheus.Core.EventHandler, #PID<0.518.0>, [:changelog, :prom_ex, :oban, :init, :circuit, :backoff, :milliseconds]} has failed and has been detached. Class=:error Reason={:badkey, :circuit_backoff, %Oban.Config{ dispatch_cooldown: 5, engine: Oban.Queue.BasicEngine, get_dynamic_repo: nil, log: false, name: Oban, node: "31eb7755", notifier: Oban.Notifiers.Postgres, plugins: [ {Oban.Plugins.Cron, [ timezone: "US/Central", crontab: [ {"0 4 * * *", Changelog.ObanWorkers.StatsProcessor}, {"0 3 * * *", Changelog.ObanWorkers.SlackImporter}, {"* * * * *", Changelog.ObanWorkers.NewsPublisher} ] ]}, Oban.Plugins.Stager, Oban.Plugins.Pruner ], prefix: "public", queues: [comment_notifier: [limit: 10], scheduled: [limit: 5]], repo: Changelog.Repo, shutdown_grace_period: 15000 }} Stacktrace=[ {PromEx.Plugins.Oban, :"-oban_supervisor_init_event_metrics/2-fun-2-", 2, [file: 'lib/prom_ex/plugins/oban.ex', line: 389]}, {TelemetryMetricsPrometheus.Core.EventHandler, :get_measurement, 3, [file: 'lib/core/event_handler.ex', line: 63]}, {TelemetryMetricsPrometheus.Core.LastValue, :handle_event, 4, [file: 'lib/core/last_value.ex', line: 56]}, {:telemetry, :"-execute/3-fun-0-", 4, [file: '/app/deps/telemetry/src/telemetry.erl', line: 150]}, {:lists, :foreach, 2, [file: 'lists.erl', line: 1342]}, {Task.Supervised, :invoke_mfa, 2, [file: 'lib/task/supervised.ex', line: 90]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 226]} ]fly.tomltothechangelog/changelog.com:masterHow can you help?
All Changeloggers & Shippers that are able to help, now is a good time to assemble 🛡
@jerodsanto @adamstac @lawik @akoutmos @mrkurt @chrismccord @brainlid @nickjj
What could have gone better?
I will circle back to this, there are a few other important things which I need to cover first.
Just a brain offload for now:
fly logsstopped streamings logs for 1' 20" after the app restarted2022-03-15T22:41:15Z app[d6b64e8b] iad [info]22:41:15.375 [info] Access ChangelogWeb.Endpoint at https://changelog-2022-03-13.fly.dev 2022-03-15T22:41:15Z app[d6b64e8b] iad [info]22:41:15.893 [info] PromEx.LifecycleAnnotator successfully created start annotation in Grafana. 2022-03-15T22:41:15Z app[d6b64e8b] iad [info]22:41:15.983 [info] PromEx.DashboardUploader successfully uploaded /app/_build/prod/lib/prom_ex/priv/application.json.eex to Grafana. 2022-03-15T22:41:16Z app[d6b64e8b] iad [info]22:41:16.167 [info] PromEx.DashboardUploader successfully uploaded /app/_build/prod/lib/prom_ex/priv/beam.json.eex to Grafana. 2022-03-15T22:41:16Z app[d6b64e8b] iad [info]22:41:16.315 [info] PromEx.DashboardUploader successfully uploaded /app/_build/prod/lib/prom_ex/priv/ecto.json.eex to Grafana. 2022-03-15T22:41:16Z app[d6b64e8b] iad [info]22:41:16.530 [info] PromEx.DashboardUploader successfully uploaded /app/_build/prod/lib/prom_ex/priv/phoenix.json.eex to Grafana. 2022-03-15T22:41:16Z app[d6b64e8b] iad [info]22:41:16.719 [info] PromEx.DashboardUploader successfully uploaded /app/_build/prod/lib/prom_ex/priv/oban.json.eex to Grafana. 2022-03-15T22:41:16Z app[d6b64e8b] iad [info]22:41:16.839 [info] PromEx.DashboardUploader successfully uploaded /app/_build/prod/lib/changelog/priv/grafana_dashboards/app.json to Grafana. 2022-03-15T22:41:16Z app[d6b64e8b] iad [info]22:41:16.983 [info] PromEx.DashboardUploader successfully uploaded /app/_build/prod/lib/changelog/priv/grafana_dashboards/erlang.json to Grafana. 2022-03-15T22:41:17Z app[d6b64e8b] iad [info]22:41:17.118 [info] PromEx.DashboardUploader successfully uploaded /app/_build/prod/lib/changelog/priv/grafana_dashboards/ecto.json to Grafana. 2022-03-15T22:41:17Z app[d6b64e8b] iad [info]22:41:17.253 [info] PromEx.DashboardUploader successfully uploaded /app/_build/prod/lib/changelog/priv/grafana_dashboards/phoenix.json to Grafana. 2022-03-15T22:41:17Z app[d6b64e8b] iad [info]22:41:17.907 [error] Handler {TelemetryMetricsPrometheus.Core.EventHandler, #PID<0.531.0>, 2022-03-15T22:41:17Z app[d6b64e8b] iad [info] [:changelog, :prom_ex, :oban, :init, :circuit, :backoff, :milliseconds]} has failed and has been detached. Class=:error 2022-03-15T22:41:17Z app[d6b64e8b] iad [info]Reason={:badkey, :circuit_backoff, 2022-03-15T22:41:17Z app[d6b64e8b] iad [info] %Oban.Config{ 2022-03-15T22:41:17Z app[d6b64e8b] iad [info] dispatch_cooldown: 5, 2022-03-15T22:41:17Z app[d6b64e8b] iad [info] engine: Oban.Queue.BasicEngine, 2022-03-15T22:41:17Z app[d6b64e8b] iad [info] get_dynamic_repo: nil, 2022-03-15T22:41:17Z app[d6b64e8b] iad [info] log: false, 2022-03-15T22:41:17Z app[d6b64e8b] iad [info] name: Oban, 2022-03-15T22:42:37Z app[d6b64e8b] iad [info]22:42:37.080 request_id=FtyvKaNewnaiVAUAABEB [info] GET /manifest.json 2022-03-15T22:42:37Z app[d6b64e8b] iad [info]22:42:37.080 request_id=FtyvKaNewnaiVAUAABEB [info] Sent 200 in 732µsWith my CF loggregator & Heroku Logplex experience, I am pretty sure that when our app boots, we stream too many logs and the overload protection mechanism gets triggered. This sheds a bunch of logs with the2022-03-15T22:41:17Ztimestamp and unfortunately, these are exactly the logs which we need to debug an error. I will check to see if this crash gets logged in Sentry - it should! Nope, we don't 😑