alsuren.github.io

A Love Letter to Grist

2026-03-09T00:00:00+00:00

My previous company was called Opvia. It was basically Notion, but with stronger guarantees around audit logging and version management, so it was easier to sell into regulated industries. I was pretty proud of a lot of the things I built there, but I wasn’t so okay with the perverse incentives that come with being a SaaS system of record company. Specifically, I was haunted by the question “are we sticky with our customers because we make their lives better than using excel, or are we sticky because once you’ve spent all this time getting data into the system in a nice relational way, and it’s a massive pain to get the data out again?”

After leaving Opvia, I wanted to make something that did all of the same stuff but without the data lock-in and perverse incentives that we had. I had stumbled across the local-first movement, and all the crazy things people were doing with SQLite. In my vision, I wanted something that worked like Excel from my childhood (where you pass around .xls files) but each file contained the whole database. I decided that I would solve the multiplayer story using something like cr-sqlite, and was blissfully naive about the conflict resolution piece. I suppose what I was building in my head was Microsoft Access.

Recently, I was browsing around the EMF Camp site and stumbled upon Grist. It is open source and uses sqlite as the file format. I think this is basically what I would have built at the end of that summer, if I was not in danger of running out of money. I suddenly became really excited about this, and told everyone who would listen.

It is basically Airtable, but open source and with data sovereignty. Each column directly maps to a column in an sqlite database, and each row to a row. There are a few reserved metadata tables for configuring widgets and views, but if you fetch the .grist file from the cloud server (or grist-static) then you can open it in the sqlite3 CLI and select * from mytable will do exactly what you expect.

In the last few years I have gone all in on sqlite and local-first (including going to Local-First Conf in Berlin and the local-first devroom at FOSDEM, and even joining the D1 managed sqlite database team at Cloudflare).

I was surprised not to see more hype from Grist around local-first. I had stumbled across grist-static (which is a fully in-browser version of Grist that can be hosted on GitHub Pages). I assumed that they were probably doing pretty well against most of the local-first ideals from the paper:

1. No spinners: your work at your fingertips

If grist-static exists then surely everything can be done in the browser completely offline, right?

The future looks like it could be pulled in this direction, but unfortunately that is not the primary architecture right now. As it stands, when you open a Grist document, it fetches the sqlite file from S3 into a server and starts a Python sandbox to serve the backend for your database. If you lose connection then it will show a toast for a second telling you it is in read-only mode before it reconnects.

10 out of 10 for grist-static, but 0 out of 10 for grist-proper.

2. Your work is not trapped on one device

Everything is done on the server, but you can access it from anywhere and the format that the server stores your work in is the same sqlite .grist file format that you are given when you download the database for local use.

10 out of 10.

3. The network is optional

8 out of 10 for grist-static but 0 out of 10 for grist-proper. See point 1.

4. Seamless collaboration with your colleagues

I haven’t been using grist to collaborate with anyone so far, but it has public projects and you can invite people to your team. They even have a fork and merge ui, which I’m looking for an excuse to play with. Grist also got a bunch of contributions from the French government, so I expect it to work reasonably well for government department sized teams.

If I could change anything here, I would make it easier to export a .grist file, edit it in some external program and then use the fork and merge ui to compare changes before overwriting the original on the server. It currently defaults to creating a new document, which is technically nondestructive, but not very satisfying if you have public links that point into the original and you don’t want to break them by archiving it. I get that this kind of reconciliation problem is extremely hard, and an area of ongoing research. I also understand that getting it wrong can be disastrous, so I understand why they’ve done it like they have. A boy can dream though, right?

0 out of 10 for grist-static but 9 out of 10 for grist-proper.

5. The Long Now

This is the whole fucking deal. I can’t emphasize this enough. This is important especially right now, and especially for people who paid attention in history class.

When I got bitten by the local-first bug during my summer after Opvia, it felt like going back to my Freedesktop roots. People were taking principled stances on things, and talking about certain ideals in the same way that Free Software people talk about certain freedoms. This ideal is the place where the two communities overlap.

So how does Grist do here?

Grist’s backing store and interchange format are .grist files. These are sqlite databases under the hood, which is what your browser uses, and also most iphone and android apps. The sqlite file format is basically infrastructure at this point, and grist uses it in a way that I would have never dreamed of when I was working at Opvia. A .grist file’s database structure closely mirrors what you see in the UI (with some extra book-keeping tables off on the side). Whenever you make a change to your table structure, grist runs ALTER TABLE on the sqlite database to make the change. This means that you can drop into the sqlite3 cli or tableplus and do whatever you want to the underlying data without any grist-specific tooling in sight, and without any fear of data getting silently omitted in the export process. For a multi-tenant system running on top of postgres, this would be unheard of. Really nice.

To give another comparison, the other local-first Notion clone that I tried out in my summer off was AFFiNE. It can also run entirely in the browser, and has an offline-capable CRDT-based sync engine. On the surface of it, .grist files and .affine files are both sqlite databases, but .affine’s tables are just a bunch of opaque binary CRDT nonsense. I would not expect to be able to edit a .affine file in any tool other than affine itself.

AFFiNE is AGPL licensed, so in theory you can keep it going in the long now. I don’t expect anyone to build a hosting business around it if affine folds though. When I tried to contribute to it, their issue tracker was in a closed linear project, and it was an uphill struggle because I couldn’t see whether my contributions were aligned with their vision at all. It seems like their github issue tracker is more active now, so maybe that changed since then.

Grist’s core is Apache licensed. This is a solid license that smells like Internet Infrastructure. Issues are public, and there are a bunch of contributions from external organisations, including the french government.

10 out of 10.

6. Security and privacy by default

I was in the radical openness camp until Facebook made that deeply unfashionable. I used to keep a public trello board of all of my project ideas (which at some point got moved to airtable and then maybe affine before being abandoned). I’m now I’m starting to build that up again in grist (one day I will build a kanban view that I actually like, I promise). I also have a private database for tracking house things that I share only with people I live with. I do not expect to be served adverts about the contents of either of these databases.

Grist also has some fancy row level security things that are enforced server-side, but fundamentally the server needs to be able to read the whole database, and retrofitting end to end encryption into the thing would be quite tricky.

[Note: when I shared this post with folks on the grist discord, they mentioned that grist started out as a desktop app with end-to-end encrypted sync. You can still feel those roots in a bunch of places, even though the architecture has changed a lot since then.]

7 out of 10?

7. You retain ultimate ownership and control

I fundamentally trust Grist to act in my interests more than I trust Airtable or Notion. The fact that a self hosted version is just a single docker command away, and the statically hosted version is so capable, gives me confidence that I will be able to eject onto fly.io if I need to. I would love to say Durable Objects here, but the lack of ability to fetch/post the underlying sqlite backing store for a Durable Object makes some things hard.

One thing to note is that widgets are basically iframes, and can point basically anywhere. I didn’t mention it in The Long Now, but I wonder if there might be some scope for golang-style caching proxies, or even a system for vendoring widgets into your .grist file (grist has a jsfiddle-like UI for authoring widgets if you want, and they do get stored in the .grist file). Honestly though, if I lose a couple of widgets over the decades, but my data is in the shape that I specified in the UI, it won’t be much work to build replacements.

Bonus: if you consider Grist widgets to be apps and Grist to be some data infrastructure/container then maybe it is already approaching the vision that Orion Reed is describing in this post.

10 out of 10 again.

Conclusion

I would not say that Grist has all of the local-first ideals nailed. I don’t think anyone does. I realised as I was using it that the data sovereignty piece is the one that I really care about though, and that is why I am all in on Grist at the moment. I would much rather have Grist with its long-now-proof sqlite storage than AFFiNE with its magical real time sync engine and no trust that I will be able to get my data out in 10 years time.

Local-First Notes #1: Riffle

2025-05-17T00:00:00+00:00

I’m starting a series of notes as I prepare for Local First Conference. As I read through related literature, I’ll share my thoughts and questions here. This post is inspired by Riffle’s Prelude essay.

Short-circuiting query re-runs on the block level

For reactive queries, could we:

submit initial query to query engine
receive and store:
- results
- merkle tree of index-related pages that have been read to satisfy the query
- merkle tree of data-related pages that have been read to satisfy the query

When an update comes in:

re-run query, passing in merkle trees
if none of the pages have changed, return 304 Not Modified
if index pages have changed, re-run up to the point where you know which data pages you will read. If none of these have changed, return 304 Not Modified

This means that if you have a massive song list and you’re looking at the top page, a change to a song in the bottom page won’t cause your UI to re-render.

I have no idea whether this kind of low-level control of SQLite is even possible, or whether it is possible to split its pages into index and data at the VFS layer (or whether it even makes VFS read calls if it’s already got the page in its cache).

I’d be interested in hearing thoughts from others who have experience with SQLite internals or similar reactive query optimizations with sqlite. Reach out on bluesky, mastodon, or discord.

Nested results

Standard SQL doesn’t support nesting, even in the projection step (i.e., what describes the shape of the results). We’re big fans of data normalization, but it’s very convenient to nest data when producing outputs.

There are various extensions to SQL that support nesting, but many of them are not that good and the good ones are not widely available.

I am reminded of postgraphile’s query builder, which generates a bunch of nested json_agg() calls and with statements like this:

with __local_0__ as (select "user"."name", "user"."age", "user"."height" from "user" where created_at > NOW() - interval '3 years' and age > $1)
select
  (select json_agg(row_to_json(__local_0__)) from __local_0__) as all_data,
  (select max(age) from __local_0__) as max_age

(example from pg-sql2 but many of the postgraphile queries take a similar form).

I wonder if this approach would be valuable in a reactive sqlite browser datastore.

Stay Tuned

Stay tuned on RSS or bluesky or mastodon for more half-formed thoughts.

FOSDEM Highlights

2025-02-07T00:00:00+00:00

I was at FOSDEM last weekend, after ~15 years away. I though I should probably write up my highlights while they’re fresh in my head.

They should be in chronological order.

The road to open source General Purpose Humanoids with dora-rs by Tao xavier

This talk is about a replacement for the ROS framework. It is explicitly trying to be a more Machine-Learning-Researcher-friendly alternative to ROS. Rather than having to have a separate machine/vm running linux + CMake, you can just pip install everything on Windows/Mac/Linux, and it will download the precompiled (typically rust or python) code. It seems to have some funding from huggingface, so from a money point of view, it could be considered a distant cousin of lerobot.

The core seems to be a message bus written with shared memory and apache arrow, which can be used to connect bits of a robot together. This is very similar to ROS.

It also does orchestration by reading yaml files which describe how the bits of the robot fit together. I find it interesting that lerobot previously relied heavily on yaml configurations (using hydra), and recently switched to using straight up python files for its configuration. The reason given by lerobot was that IDE completion support was better if everything is in one language. I guess this doesn’t apply to dora, because it is already in two (python and rust).

They finished with a demo involving a reachy robot and some voice commands, using rerun.io for all of the UI elements.

Zap the Flakes! Leveraging AI to Combat Flaky Tests with CANNIER by Daniel Hiller

This one was an interesting one. I feel like a lot of projects I’ve worked on would benefit from a tool that can parse github actions logs to spot flakey tests, but that’s not what this is about. This is about something that tries to stop flakey tests getting merged to main in the first place. It does this by running the test once and doing a bunch of static+dynamic analysis and then uses the resulting features to predict the probability of it being flakey (using a random forest model). The original implementation is in python, and the talk was about a WIP port to Go for Kubernetes tests.

I caught Daniel after the talk and had a great discussion about the inverted testing pyramid approach. It is definitely much easier to write integration tests and satisfy yourself and stakeholders that they actually test the thing you care about, but 1 hour CI test runs are no fun at all.

I wonder whether it would be possible to write a tool that can collect coverage information from an end-to-end integration tests suite and help you pick targets to distill into a sans-io data driven integration test suite. I’ve heard that LLMs are quite good at writing tests these days, so this coverage-driven test distillation might be something that is already within reach.

How a City Platform Became a Global Community by Carolina Romero Cruz

I went to this wondering whether they would be talking about something like Polis. It seems like Decidim is a much more traditional and multi-faceted collaboration tool than this. It started out in Barcelona and then exploded all over the place during lockdowns, with a bunch of twists and turns in its internal governance and funding journey. One of the takeaways for me was that government procurement is a nightmare.

How I optimized zbus by 95% by Zeeshan Ali Khan

I arrived late to this one, because I was actually interested in Programming ROS 2 with Rust by Júlia Marsal Perendreu, which was afterwards. This talk made my day because right at the very end, he mentions that varlink is poised to replace dbus in a bunch of places. I’m not currently using desktop linux, so I will probably watch this one from a distance. I was part of the Telepathy core team back at Collabora, and mijia-homie is built on top of the BlueZ dbus interface, so I have first hand experience of how funky it can be.

Zeeshan has done a fantastic job with zbus, and I expect him to do the same with varlink.

As a side note, I am hoping that I can watch the video for Adopting BlueZ in production: challenges and caveats and see how it compares with our experience on mijia-homie.

Programming ROS 2 with Rust by Júlia Marsal Perendreu

I got the impression from this talk that rclrs is a pretty good way to make robust production robotics applications. I mostly care about fucking around in python with desktop robot arms though, so I am more excited about lerobot and dora-rs.

Could we actually replace containers? by Dan Phillips

This talk was a little piece of gold. The whole project is pure audacity and I love it.

The jist is “Why do we need to wait for this WASI thing to be specified and stabilize? We already have POSIX, that’s been stable for decades.”

The project provides a libc implementation that can be used to compile basically-unmodified Dockerfile descriptions to WASM targets. They have demos that can run python code in their runtime, and then use basically-standard WASM pre-execution techniques to bring the cold start times way way down compared to native. They seem to have a runtime that works in the browser, bit I suspect the networking implementation needs to call out to an external proxy, because posix sockets are quite a lot more low level than you normally have access to in the browser (I’ve not checked).

I caught up with Dan afterwards to congratulate him on his audacity. We had a bit of a chin-wag about how server side wasm projects are doing (we both independently know the wasmCloud people).

A long, short history of realtime AI agents by Rob Pickering

The AV wasn’t working for this one, which is a shame, but it was interesting to get an insight into how modern realtime speech agents work. The general jist is that you don’t try to do speech to text and then pass text into the LLM. What you do instead is tokenize the speech directly and pass the tokens into the LLM. This is a bit like how multimodal models work with images. I get the impression that you also skip the text-to-speech step, but then I’m not sure how you turn the output voice tokens into a text transcript.

Kites for Future - Airborne Wind Energy for everyone by Marc de Laporte, Benjamin Kutschan

This one holds a special place in my heart because of a lockdown project that I worked on with my housemate. The kitesforfuture.be team are using an architecture that’s halfway between Google’s Makani project and the SkySails Mauritius project. They use a rigid kite with flaps for control and propellors for launch, but they generate power at the ground station by alternating between using the crosswind effect in the power zone of the wind window and gliding with minimal tension at the edge of the window.

They include a bunch of cool hacks, like:

abusing the magnetic compass sensor on the esp32 for tracking line angles
starting off with esp32’s wifi protocol, and switching to lora to get better range
a fishing-rod-like rig for catching the kite and suspending it in mid-air for automatic re-launch
- now that I’m having another look at the SkySails marketing site, I’m realising that they also have something similar, but much more heavyweight.

They also showed a video of a soft kite based demo at burning man that uses their power plant ground station, but otherwise has a very similar architecture to SkySails. I’m very interested in this, because it sounds like the teleoperation setup could brought into the field without a ground station, and that would be much lighter than the full hoverkite setup. While I’m quite fond of our “take these COTS hardware components and hack them together” pluckiness in the hoverkite project, I do admit that adding a steering box up near the kite makes a lot of sense, especially as your lines get longer.

This was another big highlight for me. The talk abstract mentions slowreader as an example, but the talk was mostly about motivating you to actually consider privacy when building modern web apps.

Genuinely entertaining and very approachable.

He called out that most web developers only consider top-down government-based privacy threats: it is also worth considering threats from family, religious groups, local officials and local ISPs in many locales. Concretely, if you don’t trust your users’ local ISPs then it might be a good idea to proxy their requests for them, to hide them from the ISP.

He also called out current data hording tendencies, and suggested practical ways to be good to your users, like advocating for analytics tools that track events rather than users (and therefore don’t need GDPR popups). He suggested a good technique for advocating for the removal of user-tracking info: ask your product manager if they’ve actually used this piece of information to make a decision in the last 6 months, and when they say no, propose deleting it.

As an aside, I guess the FOSDEM schedule webapp that I am using could be considered privacy-first. It has a similar architecture to slowreader, storing your bookmarks locally in the browser with no cross-device syncronisation, and if you want to share your bookmarks with others (or yourself), you send a url with a comma-separated list with bookmark ids.

Row-Level Security sucks. Can we make it usable? by Jimmy Angelakos

I went to this one because we used RLS at my previous company (this is the recommended approach with postgraphile). The scheme that he ended up with ended up with was very similar to ours, but he used a GIN index to make the && set overlap operation fast.

Where to next?

I think the next thing I’m going to is local-first conf. Give me a shout if you’re going and want to meet up.

Why cargo-quickbuild?

2022-07-30T00:00:00+00:00

In my previous post, I reviewed the history of cargo-quickinstall and introduced my cargo-quickbuild idea. When reviewing it with workmates, we thought it might be useful compare it against the other solutions that already exist in this space. I will attempt to do so in this blog post.

Why not use sccache, nix, bazel or cargo-chef?

These tools are all in a similar space, and I will definitely be stealing ideas from all of these projects.

sccache is intended to be a thin wrapper around rustc, and is easy to integrate into your workflow. In order to calculate its cache keys, it reads the rust source code. It has the ability to use a shared cache on s3, but it is all a single cache, so you need to trust everyone who has write access to it, and all of the crates that they ever compile. There is also no mechanism (that I know of) for identifying cache misses and farming them off to a background worker, because it is assumes that the user will also have write access to the cache, and will upload the result when they’re done. This is okay for teams in a large company like Mozilla, but less good if you are in a team of one, hacking on projects in your spare time.
nix is promising, and its cache architecture is very powerful. It is not very portable though (no native windows support, and getting it set up on macos has traditionally been very painful). Integrating it into a crate with a large dependency tree also requires a lot of boilerplate.
bazel is more portable, but also requires a lot of boilerplate. It is made by Google, and the trust model for its shared cache appears to be similar to that of sccache.
cargo-chef is a docker-specific tool. If you need to build a docker image today, and you’re not able to just pre-build your binary outside of docker and COPY it in, I recommend cargo-chef (my first Rust London Hack and Learn was spent working on cargo-chef, so I might be biassed). Its key innovation is turn your dependency tree into a separate layer that is independent of the rest of your source code. This means that you can skip rebuilding dependencies if you are only making changes to your source code. The downside of this approach is that if you bump any dependencies then the whole layer gets thrown away and built from scratch (this similar to the behaviour that you will find with GitHub Actions’ recommended cache configuration). Sharing individual docker layers between build servers and developers is also a pain, so your developers will probably not feel the benefit of the massive CI bill that you pay every month.

Rust’s tooling excellence owes a lot to the unifying influence of cargo for build + docs + testing. Its major shortcoming is long build times. My aim with quickbuild is to meet users where they are, because cargo is an excellent place to be. I’m hoping to produce meaningful speed-ups of from-scratch builds, without requiring configuration changes for the user’s computer/project.

I also aim to build on shared infrastructure, available to all, so you don’t need any involvement from finance or your ops team. I will be making use of free “open source tier” compute resources for building my packages, but they will be available for use by anyone to reduce their build times and CI costs, as long as they are happy to share their rust flags, and the list of dependencies from their Cargo.toml.

I have come to the end of my time at Tably, and I plan to spend August house hunting and working on quickbuild, before looking for my next job. If you would like to become the first sponsor this work, please go to my GitHub Sponsors page.

The cargo-quickinstall journey - how I made a thing for installing rust programs quickly

2022-07-10T00:00:00+00:00

I made a thing.

I made a thing that I threw together in a week.

I made a thing that is horrifically complicated, and held together with hot glue and string.

I made a thing that people seem to be using.

Halp!

Pre-built binaries of Rust programs

Back when I was working at Red Badger, we had some GitHub Actions pipelines that relied on some tools that were written in Rust. We had our GitHub actions cache set up correctly and everything, but every so often we would blast away the cache, by some innocent-looking operation, like bumping a dependency. This would result in a dog-slow build, as it rebuilt all of the tools that we were using, before even starting to compiling our own project.

When that project wound down, I had some bench time between projects, so I decided try doing something about it. I decided to build a service that would pre-build your Rust tools for you. That way, whenever you would usually write something like:

cargo install ripgrep

you could write:

cargo quickinstall ripgrep

This would install pre-compiled versions of any binaries in the crate. If we did’t have a pre-compiled version, it would fallback to cargo install automatically.

The initial implementation of cargo-quickinstall was hacked together in less than a week. I also took the opportunity to make as many terrible architectural decisions as possible. Proper resume-driven development. Good times.

Do the thing

At Red Badger, there is a saying “Do the right thing. Do the thing right”.

“Do the right thing” is about finding ways to make sure you are actually making something that people find useful. “Do the thing right” is about getting products into the hands of users in a sustainable way, so that we can gather feedback and and iterate quickly.

I decided to start by building the feedback bit first. If my system knows which packages most people want, it can “do the right thing” without my intervention, by making sure that those packages get built. Having download counts for my packages would also let me know valuable my thing is, and whether I should keep doing it.

Stats Server

I knew that the requirements for the stats server were pretty simple, and also that the rest of the system would function just fine without this piece of the puzzle. I optimised for ticking things off of my tech bucket list, rather than building a rock-solid server.

I started out by creating a Cloudflare Workers project in Rust, using their Wrangler devtool, and their KV store. Rust support for cloudflare workers had only just come out at the time, and I quickly realised that taking this approach would be an uphill battle. There weren’t even official rust bindings to their KV store at the time. I had also read somewhere that their KV store was heavily read-optimised (as-in “please think of this as a configuration store, and don’t try to make more than 1 write per second”, or something). I had grand dreams that I might one-day receive more than 1 request per second, so I decided to switch tack.

The boring choice would be to spin up a Heroku app and write to PostgreSQL. That was too boring though. What other fun resume-expanding technology stack could I use?

One of the most fundamental requirements for cargo-quickinstall has always been that it shouldn’t cost me anything to maintain, so stringing together free-tier teaser offerings was the order of the day.

I remembered meeting with an ad-tech company at a careers fair a few years earlier, and they had a fun architecture. They didn’t have any of their own servers in the hot loop of serving customers. They would serve everything from CDN, including tracking pixels, and then have a cronjob that parsed the CDN logs and used that to generate invoices to their customers. Clever, right? Entirely too web-scale for my own good.

Following this piece of slightly inappropriate architectural inspiration, I span up an empty Vercel project, and started spamming it with requests to random non-existent pages. I then hooked up the log drain to sematext. My client would make a request to a non-existent page, and immediately receive a 404 response. I would then periodically query the sematext Elasticsearch API. No cold-start lambda delays to worry about ¹. Brilliant.

Artifact Storage

For artifact storage, I picked a service that I had used before for hosting debian packages, specificallyJFrog’s Bintray service.

I hacked up a script to build a package and upload it to Bintray from my laptop. I ran it on a single package to get me started, and moved on.

Client

Next on the list was the cargo-quickinstall client.

This is basically a glorified bash script.

I wanted cargo install cargo-quickinstall to be as quick as possible, so I only used things that were in std, and shelled out to the system’s curl and tar binaries to do the actual work. curl and tar are both available on modern Windows boxes by default, so this turns out to be a surprisingly portable choice. I also initially did json parsing with jq, but this has since been replaced with tinyjson because apparently nobody has jq installed (they don’t know what they’re missing).

The initial client basically did this:

Automated Builder

The automated builder is responsible for this half of the architecture diagram:

The initial implementation got its list of requested crates from sematext’s Elasticsearch API. It was pretty simple - it would just make a list of all requested packages, and try to build + upload the first one that we didn’t already have a package of in Bintray. If there was nothing to do then it would just build cargo-quickinstall for good luck (which only takes a couple of seconds, so isn’t that much wasted work).

Security and Trust

It’s worth digging into the cargo-quickinstall trust model at this point.

The trust model is currently:

You trust the author of the crate that you asked for, and its dependencies.
You trust me to be acting in good faith, and to have configured GitHub actions and GitHub releases correctly, and my sandboxing to be adequate.
You trust GitHub not to replace everyone’s released binaries with cryptomalware.

This means that (assuming that you trust cargo-quickinstall, and that our sandboxing is solid) by running cargo quickinstall $CRATE, you’re not forced to trust anyone that you’re not already trusting by running cargo install $CRATE.

cargo-quickinstall does not trust the author of any package on crates.io. As soon as we have run the crate’s build.rs or any proc macros, we must treat the build box as compromised. There is some gymnastics involved in achieving this, so bear with me.

GitHub Actions Gymnastics

The cronjob works out which crate needs to be built next for each target architecture, and which runner OS we need to build it on.

The workflow that does the building is given $CRATE $VERSION $BUILD_OS and $TARGET_ARCH. We currently supply these variables by running sed over a template, and committing the result to git. If I was writing it again today from scratch, I might revisit this decision, but this works well enough for now.

We spin up a runner with $BUILD_OS and permissions: {} on it, and do the build. This essentially runs cargo install $crate and then tars up the resulting binaries and uses actions/upload-artifact to upload it with a known filename, so that it is available for other jobs in the same build pipeline.

Security notice: I’m assuming that all runners are able to use actions/upload-artifact without any extra creds. ~I’ve not really dug into it that much.~ If it turns out that the runner is being given some kind of god token, and that token is available to $CRATE’s untrusted build.rs for doing anything other than uploading build artifacts then we’re in big trouble. If you believe this to be the case, please email me so that I can stop building new packages and do a proper audit/redesign.

Once the builder is finished, we throw it in the bin, and spin up a new ubuntu-20.04 runner. This downloads the tarball from actions/upload-artifact² and uploads it to GitHub Releases (previously bintray).

By doing this whole dance, we ensure that a malicious crate author can only poison the tarball of their own crate, or any crates that depends on their crate. If you run cargo install $CRATE then you already trust every crate in $CRATE’s dependency tree, and you already trust GitHub for the crates.io index. Assuming that you trust cargo-quickinstall and that our sandboxing is solid, by using cargo install $CRATE, you’re not forced to trust anyone that you’re not already trusting by running cargo install $CRATE.

There are probably massive holes in this logic. Even if it’s all sound, cargo-quickinstall has never been audited. If you work at Microsoft/GitHub and/or would like to sponsor a security researcher to help me audit this, please leave a comment on this issue or contact me privately.

EDIT 2022-07-24: When pair-reviewing this post with @alecmocatta, he pointed me to a github security article on the matter. It seems sensible to assume that the runner for the “build” is a VM that contains both the untrusted code and an orchestration process that has access to the GITHUB_TOKEN for the whole build.

He also advised me not to trust any isolation boundary that’s less strong than full-on VM isolation.

This stalled me for a while. I came up with an excessively complicated scheme, and we agreed that it would probably work, but was very unsatisfying. When I sat down to implement it on the weekend, I stumbled on a mechanism to neuter the GITHUB_TOKEN that is sent to the runner. The fix ended up being a single line.

I have not yet found a way for a malicious crate to extract the secret and use it to do anything. I also don’t have any reason to believe that anyone else has done so. I will therefore be keeping all previously built packages published in the repo as-is. If you believe that this is unwise, and would like to help me implement a better security process, please comment on this issue and I will set up a call to pair on it.

Bootstrapping the package list

There is a bit of a chicken-and-egg problem with the approach I have described so far. If you don’t have any users then you won’t have any idea which packages need to be built. New users will always find that we don’t have the packages that they want, so they will stop using our service. This means they will not tell us the names of any more packages that they want building.

To break this cycle, I made a list of popular packages by grabbing the html from https://lib.rs/command-line-utilities and pulling out the package names into a flat text file using pup and jq.

Skipping broken packages

Not all crates build on all platforms. Initially, my builder didn’t have any memory of what it had attempted to build, so when it came across a package that it couldn’t build, it would get stuck and attempt to rebuild it every hour until I manually excluded it. It also did each of the platforms in series, so a broken windows build would block progress on all platforms. It was also racey, so if a build took over an hour (or if I got impatient and triggered multiple builds in an hour) it would sometimes build the same package twice, and then crash when trying to upload it.

This is where the “sed the template and commit it to git” approach comes in. There is a branch for each target (trigger/$TARGET), and each time the cronjob builds, it checks out each trigger/$TARGET branch, using git worktree, and checks what it last attempted to build. It then walks down the list of popular/requested crates, and makes a new commit to trigger a build of the next crate after the one that was last attempted. We still do a lot of useless builds of packages that would never compile, but at least we weren’t head-of-line blocking anymore.

Later, when we started pushing tags for each successful build (as part of the switch to GitHub releases), we were able to detect repeatedly-failing builds and automatically add the offending packages to the exclude list. This process is a little fragile, and it currently errs on the side of building known-broken packages occasionally, but it’s better than nothing.

Free Tiers Don’t Last Forever

The danger of relying on free-tier stuff is that your provider is not beholden to you in any way. They may take away your service at any time.

The first service to fall was Bintray. Bintray was still serving my compiled crates read-only, and I had a bit of time before they would start deleting them entirely, so I wasn’t in too much of a rush, but if I didn’t find an alternative host eventually then I would have to put cargo-quickinstall in the bin.

Around this time, I was mentoring a Hack and Learn, and the spotify-tui maintainer pointed out that they use GitHub actions to make their releases, and that the release artifacts would show up with predictable URLs. I kicked off a new verson of the builder that could upload to GitHub Releases, and then made a release of the client which could fetch from both places.

The next free-tier service to go away was sematext. When sematext ended the free tier that my log pipeline was using, I decided that it was probably time to pick a more traditional architecture. I added a couple of typescript endpoints to my Vercel site so I was no longer relying on logs of 404 errors. Boring. I like boring. Boring is good, especially for things that people are using, and that I need to actually maintain.

The joy of working with other people 🤝

I have really enjoyed working with external contributors on cargo-quickinstall. The client is super simple, so it is reasonably approachable for beginners. After mentoring two Rust London Hack and Learn events, most of the low-hanging fruit has been picked, but there are approachable issues that show up from time to time. If you want to have a go at one, check out the good first issue.

Next Steps

There are a few open issues on the board, and I’m happy to mentor people on any of them. The issue that I’m especially interested in mentoring someone on is the one for building static binaries for non-ubuntu-20.04 support, and shelling out to cargo-binstall for the more complex fallback behaviour.

At the moment, there are no time-critical issues on the board (no security issues, and nothing that represents a regression for existing users in CI), so I am mostly leaving things open and offering mentoring on them. It is more valuable at the moment to get more people familiar with the codebase, and improve the bus-factor of the project.

Speeding up `cargo build` as well

The other reason for me taking this approach with cargo-quickinstall is cargo-quickbuild. This is a project idea to take parts of dependency trees, rather than just the end-result. I have started progress on this over in a new cargo-quick repo. The idea is to have a tool to make from-scratch builds quicker, by providing a central repo of prebuilt crates (think docker layers for your target dir). It will have the same trust model as cargo-quickinstall, but a slightly more complex architecture. Once quickbuild has come along a bit further, I will port the quickinstall builder to use it, and then merge cargo quickinstall into the cargo-quick repo.

This raises the question:

Why not use sccache, nix, bazel or cargo-chef?

To avoid bloating the end of this post, I have split this out into its own post. The conclusion of which is:

Rust’s tooling excellence owes a lot to the unifying influence of cargo for build + docs + testing. Its major shortcoming is long build times. My aim with quickbuild is to meet users where they are, because cargo is an excellent place to be. I’m hoping to produce meaningful speed-ups of from-scratch builds, without requiring configuration changes for the user’s computer/project.

I also aim to build on shared infrastructure, available to all, so you don’t need any involvement from finance or your ops team. I will be making use of free “open source tier” compute resources for building my packages, but they will be available for use by anyone to reduce their build times and CI costs, as long as they are happy to share their rust flags, and the list of dependencies from their Cargo.toml.

I am coming to the end of my time at Tably and I plan to spend August house hunting and working on quickbuild, before looking for my next job. If you would like to become the first sponsor this work, please go to my GitHub Sponsors page.

In practice, when I made the client, I made it do this request in a background thread anyway, so it doesn’t really matter how long my cold-start time is. ↩
previously bintray ↩

Monitoring Temperature (with too many Bluetooth thermometers)

2022-02-19T00:00:00+00:00

This is the blog post form of a presentation given at Rust London - 27 April 2021.

A video of the talk is available on youtube, and slides are available in the project’s repo.

The slides are written in markdown using remark, and this blog is also markdown. Let’s see how well this translation job goes.

Backstory

We started with a few ESP32 dev-boards like this:

These cost around US$16 each, and don’t last more than about a day on battery power.

ESP32 is a super-cheap system on chip with bluetooth and wifi, but dev-boards will always be more expensive than commercial off-the-shelf hardware.

During lockdown, we were setting around the dinner table, and I asked my housemate “Wouldn’t it be nice to have a hundred temperature sensors? What could we do with that many sensors?”

So we bought 20 of these, at $3 each, and hooked them up to the internet.

System Overview

This is what we built:

Let’s take a look at the decisions we made, how they turned out.

Rust

We picked Rust because we were both starting to use Rust for work, so using Rust for a personal project was a good opportunity for learning, for both of us.

Andrew is working on crosvm and Virt Manager for Android.

I was using Rust for the backend of the FutureNHS project (I have since worked on other Rust projects at Red Badger, and moved on to work at Tably.com, with backend and frontend written in Rust).

It was also a good chance to work on something together during lockdown.

I also found a blog post describing how to connect to these sensors with Rust. This gave us the burst of momentum that we needed to start the project, but in the end, we outgrew its initial structure, and made our own project.

MQTT

MQTT is the pubsub of choice for low-powered gadgets.

Homie is an auto-discovery convention built on MQTT.

In Rust, the rumqttc library is pretty good:

It works using channels, which is a nice interface.
Andrew has submitted patches, and they were well received.

Rust Bluetooth in 2020

The state of Rust Bluetooth in 2020 was a little underwhelming. The options were:

blurz - “Bluetooth from before there was Tokio”
- We started with this.
- Talks to BlueZ over D-Bus, but single-threaded and synchronous.
- Blocking device.connect() calls. 😧
- Unmaintained (for 2 years).

btleplug - “cross-platform jumble”
- Theoretically cross platform, but many features not implemented.
- Linux implementation needed root access.
- Too many panics for us to use.

Aside: Concurrency

The main problem with blurz was that it exposed a single-threaded blocking library interface:

We realised that there was a third approach:

dbus-rs - aka “roll your own BlueZ wrapper”
- We could generate a “-sys” crate from D-Bus introspection, using the tools provided by the dbus-rs project.
- The dbus-rs codegen produces syncronous or async interfaces, so you can pick whichever approach you want.

After switching to an async library, we got:

This almost solves the problem, but not quite. In our case, everything lives in a big Arc>.

The solution is to hold the Mutex for as little time as possible.

This is much better.

These are the concurrency tools that we use:

Arc
- Used for all of our state.
- Easy refactor from &mut GlobalState.
- Fine as long as you know where the lock contention is.
- Only hold the mutex when you need it, be careful of await points.
Unbounded Channels
- Used for all bluetooth events, and all MQTT traffic.
- Fine if you know they’re not going to back up.
Stream
- Used as the consumption API of the Channels.
- Just the async version of Iterator.
- map(), filter() and select_all() are easy to use.

Bluetooth Developments

We ended up building on top of our “-sys” Bluetooth library, and created: bluez-async

Linux only
Typesafe async wrapper around BlueZ D-Bus interface.
Sent patches upstream to dbus-rs to improve code generation and support for complex types.
Didn’t announce it anywhere, but issues filed (and a PR) by two other users so far.

Andrew has also been contributing to btleplug

Ported btleplug to use bluez-async on Linux.
Exposes an async interface everywhere.
There are a few bugs that need fixing before they make a release though.

Results

We now have graphs like this, with inside and outside readings:

and readings from our fridge:

and we can plot trends using Pandas and Plotly:

Will’s setup, with MiFlora sensors

I gave some to my workmate:

so you can tell when Will waters his plants:

and when the dehumidifier kicks in in the cellar:

CloudBBQ

We also got it working with a meat thermometer (backstory: one of the people who sent us patches was using it with a bbq meat thermometer, so I bought one for Andrew as a joke present):

so now we have a graph of our roast:

Closing Remarks

Separating things into layers (and crates) worked well:

App (mijia-homie) -> Sensor (mijia) -> Bluetooth (bluez-async) -> D-Bus.
App (mijia-homie) -> Homie (homie-device) -> MQTT.
MQTT -> Homie (homie-controller) -> homie-influx -> InfluxDB

Deployment

Everything is supervised by systemd.
Built with Github Actions and cross, packaged with cargo-deb.
Test coverage is a bit thin (blame me for this).

One major limitation is that the Raspberry Pi only supports 10 connected BLE devices (10 « 100). The way to get around this problem would be to make the mijia sensors include the temperature and humidity data in their advertising broadcast packets, and then passively listen to them on the raspberry pi. There are a handful of projects that provide flash custom firmware for the mijia sensors, and many of them let you do exactly this. If anyone has done this to their sensors, we would be really interested to hear from you. Adding support for reading sensors in this way would allow us to deploy many more sensors, and would also drastically reduce how many cell batteries we go through in a year.

Docker Yaks

2021-08-13T00:00:00+00:00

I have given myself a target of not installing Docker Desktop on this laptop since I did the reinstall. Let’s see how much pain that causes.

I also don’t want to install VirtualBox, because it feels like a dead end. For some reason, brew install --cask vagrant wants root access (maybe it’s just because they decided to dump files in /opt/vagrant? Maybe it bundles VirtualBox? At any rate, I decided against installing it)

Qemu is capable of installing without root, because it uses the MacOS Hypervisor.Framework. It is also what the Android AVD emulation system uses under the hood, last time I checked.

What can I install with qemu and no vagrant?

My first thought was to try packer, but all of the public packer templates that mention qemu are significantly out of date, and don’t build.

I started looking around for qcow2 images, and searching for “qcow2 debian” finds some images for use with OpenStack at https://cdimage.debian.org/cdimage/openstack/current/. I don’t have any of the openstack tools installed on my mac, but maybe I can still use the images?

Fedora has a cloud spin, and also coreos. I downloaded both. CoreOS from https://getfedora.org/en/coreos/download?tab=metal_virtualized&stream=stable

The coreos installation requires virt-install, which requires python gobject-introspection bindings, which don’t seem easy to install.

Following https://docs.fedoraproject.org/en-US/fedora-coreos/producing-ign/ to produce an ignition file, and then using the virt-install command on https://docs.fedoraproject.org/en-US/fedora-coreos/getting-started/ to guess what the underlying qemu command might look like, we get:

qemu-system-x86_64  -nographic /Users/alsuren/Downloads/fedora-coreos-34.20210725.3.0-qemu.x86_64.qcow2 -fw_cfg name=opt/com.coreos/config,file=coreos/docker-host.ign

which gets quite a long way, and then says:

[    1.680861] Trying to unpack rootfs image as initramfs...
[    1.681558] Initramfs unpacking failed: invalid magic at start of compressed archive

(killall qemu-system-x86_64 is the only way to escape from this, and it puts the shell into a strange state, so readline editing of long commands stop working, so be ready to throw away a lot of shells)

They do have a page on using qemu directly: https://docs.fedoraproject.org/en-US/fedora-coreos/provisioning-qemu/

Adapting their example to our filenames, we get:

qemu-system-x86_64 -m 2048 -nographic -snapshot \
    -drive if=virtio,file=/Users/alsuren/Downloads/fedora-coreos-34.20210725.3.0-qemu.x86_64.qcow2 \
    -fw_cfg name=opt/com.coreos/config,file=coreos/docker-host.ign \
    -nic user,model=virtio,hostfwd=tcp::2222-:22

After about 200 seconds, you end up with a box that you can ssh into, like this:

ssh core@localhost -p 2222

… but can I use it for doing docker-fwd things?

Dump this in ~/.ssh/config:

Host localhost
  User core
  HostName localhost
  Port 2222

ssh localhost mkdir -p ./$PWD
ssh localhost git init ./$PWD
git remote add docker-fwd --fetch localhost:./$PWD

git push docker-fwd HEAD:incoming
ssh localhost "cd ./$PWD && git reset --hard incoming"

git commit
git push docker-fwd HEAD:incoming
ssh localhost "cd ./$PWD && git merge --ff-only incoming"

🎉

And then you can open vscode remote over ssh and keep hacking.

Running docker is hella-slow though.

docker run --rm \
  --volume="$PWD:/srv/jekyll" \
  -it jekyll/jekyll:latest \
  jekyll build

work out how to specify groups in the butane configs

Once you’ve added yourself to the docker group takes half an age, and then gives:

/usr/local/lib/ruby/2.7.0/fileutils.rb:250:in `mkdir': Permission denied @ dir_s_mkdir - /srv/jekyll/.jekyll-cache (Errno::EACCES)

It seems that I’m going to be fighting against half a decade of sloppy docker permissions

Saturday

persistent filesystem:

qemu-img create -f qcow2 -b /Users/alsuren/Downloads/fedora-coreos-34.20210725.3.0-qemu.x86_64.qcow2

qemu-system-x86_64 -m $((1024*8)) -nographic \
    -drive if=virtio,file=my-fcos-vm.qcow2 \
    -fw_cfg name=opt/com.coreos/config,file=$HOME/src/docker-fwd/coreos/docker-host.ign \
    -nic user,model=virtio,hostfwd=tcp::2222-:22

For some reason, vscode doesn’t like to connect to this today. I will try a distribution that I understand and come back to it.

Let’s try this first: https://fabianlee.org/2020/03/14/kvm-testing-cloud-init-locally-using-kvm-for-a-centos-cloud-image/

… okay, maybe https://sumit-ghosh.com/articles/create-vm-using-libvirt-cloud-images-cloud-init/

genisoimage -output cidata.iso -V cidata -r -J user-data meta-data

becomes (according to https://apple.stackexchange.com/questions/121491/equivalents-for-genisoimage-and-qemu-img-on-ubuntu)

brew install cdrtools
mkisofs -output cidata.iso -V cidata -r -J user-data meta-data

qemu-img create -f qcow2 -b  ~/Downloads/debian-10-openstack-amd64.qcow2 debian-10-openstack-amd64.qcow2 30G

qemu-system-x86_64 -m $((1024*8)) -nographic \
    -drive if=virtio,file=debian-10-openstack-amd64.qcow2 \
	-cdrom cidata.iso \
    -nic user,model=virtio,hostfwd=tcp::2222-:22

Each time you nuke the image, you need to run this to clear out the [localhost] entry in known_hosts, like this:

sed -i '' -n  '/^[^[]/p' ~/.ssh/known_hosts

VSCode ssh seems a bit fucked. It starts up fine and then errors out a few seconds after opening a folder.

The green-and-blue debian prompt makes feel more powerful than the coreos monochrome one. Funny how our minds work.

Back to docker again

I ended up adding this to my ssh config (dof being shorthand for docker-fwd):

Host dof
  HostName localhost
  Port 2222

I now have a vm that’s capable of running docker images, so now I can start debugging my theme.

Musings about synced git commits

I think I want something like:

git commit $file -m "auto-commit $file"
git branch -D outgoing
git branch outgoing  # this will make a branch, but leave you on master
git reset HEAD^  # without changing what's checked out
git push dof outgoing:incoming

Open questions:

How do you go about running this in a watchexec loop? Do you need to have a local git worktree off to the side, that does git restore --source=outgoing $file each time you make a change, to get a fast-forward history for pushing to the remote?
How do you deal with staged files?

I think what I really want is a bare git repo living in another directory, that knows to ignore everything about the .git dir in the local repo.

In the case where you’ve just made a commit and want the remote to sync up, you might be able to add this to your post-commit hook:

#!/bin/bash

git push docker-fwd HEAD:incoming
ssh localhost "cd ./$PWD && git merge --ff-only incoming"

Initial impressions of nushell

2021-07-26T00:00:00+00:00

Recently I’ve been recording my impressions of technologies as I start using them. This is so that I can remember what my pain points are when I’m onboarding other people later, or contributing to the project.

Why am I doing this?

I am trying out nushell because I really don’t like zsh’s input system (zle). Readline is much better than zle. Zle doesn’t have a concept of short vs long words. If you set zle into “bash” word mode, it treats - like whitespace, so typing esc-backspace after ls -- . will delete everything (from memory). Nushell uses a rust-based readline clone for its text input, so my muscle memory is intact.

I had a go at teaching zle how to differentiate ^W from esc-backspace, but it’s a mix of C and shell horribleness. I suspect that bending nushell to my will will be easier than fixing zsh.

I also tried fish a while back. The thing that forced me back to bash was the lack of !!/!$ support, I think. nushell also doesn’t have these things, but they haven’t explicitly ruled out supporting it, and maybe I could submit it as a patch.

I also really love apache arrow, and running a shell with polars/arrow dataframes built in makes me really excited.

History

The history recording seems completely broken. I was trying to look at the history of what I’ve done, and ended up with a bunch or repeated sequences with entries like this:

│ n
│ false
│ ckout main

I wonder if it’s similar to the problem that bash had, where multiple shell processes confuse each other when they close. I quite like what apple did with per-session history files in a directory, and a cleanup job to consolidate them (this was also used to recover sessions with separate scrollback and history after a restart)

Table rendering

As I was pasting that, I noticed that it was padded with trailing whitespace. This means that when you resize your terminal, it will cause all sorts of problems.

Unix command compatibility

I get the impression that nushell is that it wants to be a data processing language like R, but doesn’t really care about being a capable unix shell.

Piping things into unix commands

In bash, echo automatically separates argument words with spaces, and appends a newline at the end.

Nushell’s echo turns its arguments into list. Piping a list into a unix command will concatenate the contents without any separators, and won’t add a trailing newline.

I’ve taken to doing this, but it feels very sad:

echo 'some/path' | fmt | tee -a .gitignore

Exit codes

By default, nushell ignores the exit code of unix programs, so running /usr/bin/false will not tell you that anything is wrong.

This is concerning, because it means that all pipeline failures are going to be ignored. It’s the philosophical opposite of set -euo pipefail at the top of your bash scripts.

If you set nonzero_exit_errors = true in your config, you end up with a slightly-too-verbose error:

$ /usr/bin/false
error: External command failed
  ┌─ shell:1:1
  │
1 │ /usr/bin/false
  │ ^^^^^^^^^^^^^^ command failed

I’ve started trying to hack this up in https://github.com/nushell/nushell/pull/3840

Quoting of semicolons

I wanted to write this:

git commit -am "do first thing; do second thing"

but the semicolon got parsed by the shell somehow. My commit message got truncated, and I got sh: do second thing: command not found printed by the shell.

Interestingly, using single quotes avoids this problem, so I suspect that this is a bug. I like to write words like don't in my commit messages, so I prefer to use double-quotes for this.

Quoted paste mode

When setting up a git repo, you are told to:

git remote add origin git@github.com:alsuren/docker-fwd.git
git branch -M main
git push -u origin main

If you are using readline with quoted paste enabled, this will give you a multiline input, and you can hit enter once to execute the whole lot. nushell’s default behaviour is similar to the readline default settings, so it will execute the first command and only read the second and third lines once the first has finished executing (unless the first command reads from stdin, in which case, they may go missing entirely).

That’s all for now

I will keep editing this post until I have run out of beginner papercuts.

Early impressions of NATS

2021-07-06T00:00:00+00:00

First impressions of technologies are quite important for driving adoption, so I’ve started writing down my early impressions as I explore different technologies.

NATS is a distributed messaging system.

(Originally posted on the red-badger tech site)

Protocol

NATS prides itself on its protocol’s simplicity. It is a text based protocol with length-encoded payloads and a very small number of verbs, a bit like http 1.0. My first introduction to NATS was the keynote video on their website though (https://www.youtube.com/watch?v=lHQXEqyH57U), which talks about private keys and JWTs, so this initial simplicity feels like a trap.

Reliability

If the client reaches this internal limit, it will drop messages and continue to process new messages. This is aligned with NATS at most once delivery. It is up to your application to detect the missing messages and recover from this condition.

– https://docs.nats.io/nats-server/nats_admin/slow_consumers

This is a different approach from other queueing systems that I have used. I am a little bit wary.

Security

It’s slightly concerning to find out that --routes=nats://ruser:T0pS3cr3t@nats:6222 is baked into the default nats image. I wonder how many installations are running with that default exposed to the public internet. (I have heard that the security doesn’t come from secrets stored in the cluster though, so this might not be important?)

Documentation

I started reading down the sidebar of the NATS docs from https://docs.nats.io/nats-server/installation and ended up encountering NATS 2.0 auth concepts at the same as JetStream. I can’t remember how that happened. I’m now cycling back around to the top of the docs sidebar, and seeing whether the learning concepts flow a bit better.

Request-Reply

This feels like the way you do blocking-style function calls to other actors in Erlang (typically the message-sending and response-waiting logic is hidden away behind the actor’s public API interface in Erlang. I expect that you would do something similar when using Request-Reply in NATS. Maybe you would wrap up the interface in an IDL and do codegen to generate an ergonomic interface?).

This feels very positive. This pattern of communication isn’t really a thing in RabbitMQ-land, and I have seen people do highly questionable things like dispatching a message and then polling a database to ensure that the message was actioned before returning from a web request. Hopefully we won’t see anything like this in NATS-land. I’m sure we’ll find many other pathological anti-patterns though.

Queue Groups

Their concepts page on queue groups says that you can do some magic with consumer groups, but doesn’t really say how they’re configured. I wonder whether misconfigured queue groups are a thing that you see in production. I guess I’ll find out when I read their tutorial.

Acknowledgements and Sequence Numbers

These both fall into the “we’re providing you with an unreliable channel” bucket. I don’t think that I would want my business logic to be worrying about things like this. I wouldn’t be surprised if real-world NATS applications end up having multiple layers of abstraction on top of these concepts.

Acronyms and glossary

The NATS ecosystem is full of bullshit acronyms. I think the maintainers think they’re funny. They are also big into truncating words in their command-line tools, rather than just writing proper bash completions for them. Mostly all it does is make it harder to keep track of what everything means.

So far, we have:

NATS: “Neural Autonomic Transport System” – https://docs.nats.io/faq#what-does-the-nats-acronym-stand-for
NGS: ??? (maybe “NATS Global Service”?)
From https://docs.nats.io/nats-tools/nats-tools:
- nats - Interact with and manage NATS
- nk - Generate NKey - nsc - Configure Operators, Accounts and Users
- nsc - Configure Operators, Accounts and Users (What does NSC mean? This tool also appears to include functionality from the nats and nk tools. Feels like one or more of these tools should be deprecated)
- nats account server - Serve Account JWTs (this is sometimes abbreviated as nas in the docs, which collides in my head with “network attached storage”. This also sounds like it is a deprecated way to do auth with nats, since it is inlined)
- nats top - Monitor NATS Server
- nats-bench - Benchmark NATS Server
- prometheus-nats-exporter - Export NATS server metrics to Prometheus and a Grafana dashboard.

NGS

Following along with https://synadia.com/ngs/signup

I quite like the curl | python approach. I guess maybe this is more portable than curl | sh?

nsc init -o synadia -n First - I suppose if you made the tool then you can make sure your company is in the list of preconfigured operators. I wonder who else is in that list. Would there be any pushback if a big player like AWS started offering the same service, and wanted to be included in the list of operators?

I wonder how the ngs.echo topic works (whether it’s patched into the server, or a process that has access to that topic on all accounts). Same goes for the other things in ngs.*.

Default connections and hidden context

nsc tool pub implicitly reaches into some configs somewhere and works out where it needs to connect. Legacy tools like nats-pub don’t do this, and neither do client libraries (or tools that use them, like wasmcloud).

Leaf nodes

It feels like everyone wants you to install a nats leaf node as a sidecar (wasmcloud’s wash cli tool doesn’t even have a way to configure it to connect directly to NGS as a control plane). Following along with https://docs.nats.io/nats-server/configuration/leafnodes#leaf-node-example-using-a-remote-global-service shows you how to do it. You need a verified email address and payment method in order to do this.

Messaging vs Queueing systems

In its most basic configuration, NATS is not a queueing system, because it doesn’t have persistence or retries. It is more reasonable to think of it as a messaging system. I suppose in some ways it’s like UDP, but wrapped in TLS+auth, and with foo.bar.baz-style broadcast addresses. It’s a good fit for writing your app cluster’s backplane in, but not necessarily your job scheduler.

There is a mode in NATS (req) where you can send a return-address with your message, and the receiver[s] can respond to that return address. This primitive can be used to build reliable systems (like how TCP is built on top of IP packets).

On top of this NATS backplane, you can add a bunch of applications, and connect them up. The NATS server comes bundled with a JetStream application, that you can enable with the -js flag. This is basically rabbitmq-over-NATS, but it can get away with being a lot simpler, because its communication protocol is NATS rather than TCP.

Conclusions

I can see why WasmCloud chose NATS for its “lattice” backplane. The basic protocol feels simple and solid. It feels like the kind of protocol that you’d expect to see described in an RFC. Once you have this communication system in place, the work of doing inter-process communication feels like it will have a lot less friction. The difference between using HTTPS and NATS is like the difference between setting up unix domain sockets for each of your process vs using D-Bus (although D-Bus is terrible in ways that NATS manages to sidestep).

It is a reasonable initial reaction that the basic NATS protocol doesn’t quite get you all the way. I think that the JetStream covers some of those holes, but I’m not sure whether it belongs in the core NATS server implementation. I’m hoping that the nats-server implementation continues to gain traction without adding too much bloat (or if it does get bloated, that they release a tiny-nats-server binary that can be used as a low-overhead kubernetes sidecar).

Early impressions of WasmCloud

2021-07-06T00:00:00+00:00

First impressions of technologies are quite important for driving adoption, so I’ve started writing down my early impressions as I explore different technologies.

WasmCloud is an application platform built on WASM (like erlang’s BEAM, but you deploy WASM to it rather than erlang).

(Originally posted on the red-badger tech site)

Documentation

You start at wasmcloud.dev (“Docs Home”), and can follow a getting started guide from there, but there is also a “Home” link in the top bar that takes you to wasmcloud.com, which links out to a bunch of commercial courses. I haven’t looked into the courses, but the fact that they exist feels reassuring.

The rest of the documentation is not particularly polished. There are broken links in a few places, and a handful of typos, including:

$ ctl link MCUDTMOOZCVAM5EBNN4U3X2OGHNIY3BKPEW66HY4RTCYYVWXOE7ESVDQ VAG3QITQQ2ODAOWB5TTQSDJ53XK3SHBEIFNK4A YJ5RKAX2UNSCAPHA5M wasmcloud:httpserver PORT=8080 in https://github.com/wasmcloud/examples/tree/main/echo (probably caused by copy-pasta out of the wash shell tui)
Both Architecture sections are missing (https://wasmcloud.dev/reference/host-runtime/architecture/ and https://wasmcloud.dev/reference/lattice/architecture/). This feels like a job for excalidraw.
s/on-premise/on-premises/ - https://wasmcloud.dev/reference/lattice/leaf-nodes/
https://crates.io/crates/wash-cli “control-interface” links to https://github.com/wasmcloud/wasmcloud/tree/main/crates/control-interface, which is a broken link.
Get some CI to check that they don’t have any broken links?

`wash`

The getting started guide recommends that you have wash set to 250 characters wide. This is a bit ridiculous. It also has a “REPL (Standalone)” thing that flashes in the top-left corner. Why? Can you please just not. (I ended up cloning the repo and patching this out)

Can I just run wash in a daemon mode, and inject commands into it from bash? When looking into this, I went to https://github.com/wasmCloud/wash and found a broken link to control-interface (looks like this has been split into the wasmCloud repo).

wasmcloud

My instinct is always to live in the root of a git repo, so when following along with https://github.com/wasmcloud/examples/tree/main/echo, I did wasmcloud -m echo/manifest.yaml. This caused wasmcloud to try resolving .wasm filenames in the root of the repo rather than in the echo dir. It would be much cleaner for imports to be relative to the manifest file rather than wasmcloud’s $PWD.

This probably isn’t a very difficult fix. I should open an issue or make a PR at some point.

docker-compose

I like that their instructions get you to run a docker registry on localhost:5000 in their docker-compose setup. When following along with the dapr kubernetes instructions, they wanted you to use a real container registry, so you end up uploading your images and downloading them, even though you are only running things on localhost (Cedd flagged this at the time, but I couldn’t remember off the top of my head what the best way to do it was). This way is much nicer.

Once you have this working, it starts a NATS server, so you end up in lattice mode. You can then leave the wash shell in a background tab and type commands like wash ctl get hosts and get output from stdout where you expect, which makes me want to throw up a bit less.

zero downtime migrations

I wanted to tear down the wash ui entirely, because it makes me sad. Could I add a wasmcloud host on the command-line and then give it a new web server capability provider and tear down the wash node?

Starting a new node:
- wasmcloud
- wash ctl get hosts
… there is no way to migrate actors, so create a new one
- wash ctl start actor localhost:5000/echo:0.2.2
- … oh, that comes up with the same ID, doesn’t it? Does that mean it’s already linked to the http server?
- Maybe? I’m not sure.
What about a capability provider?
- wash ctl start provider wasmcloud.azurecr.io/httpserver:0.11.1
- the capability provider then spits out: thread '' panicked at 'called Result::unwrap()on anErr value: Os { code: 48, kind: AddrInUse, message: "Address already in use" }', src/lib.rs:91:14, which is not hugely surprising. This wouldn’t be a problem if it was running in a docker container, I suppose. It is possible to make this work on linux using SO_REUSEPORT (https://lwn.net/Articles/542629/), but it’s a huge hack, and I would not expect actix to support this out of the box.
Okay, then bring down the original node, and see what happens:
- curl localhost:8080/echo: curl: (7) Failed to connect to localhost port 8080: Connection refused
- Not surprising, given the panic
Can we kick the provider into action again?
- wash ctl link MDNYN4IMBBLOCOTWSRES4NLLLHAMH6UEX527K7M6HPEGRZFG3HVHTXJL VAG3QITQQ2ODAOWB5TTQSDJ53XK3SHBEIFNK4AYJ5RKAX2UNSCAPHA5M wasmcloud:httpserver PORT=8080
- curl localhost:8080/echo: {"method":"GET","path":"/echo","query_string":"","headers":{"host":"localhost:8080","accept":"*/*","user-agent":"curl/7.64.1"},"body":[]}
- looks like that worked.

I didn’t get any indication that my second capability provider was in a non-functioning state, apart from the panic in the logs. This isn’t hugely encouraging.

What next?

Probably try starting a wasmcloud host inside docker, and retry the above dance.

In order to do this, I made an excursion into setting up the appropriate security creds for my own NATS cluster. This ended up defeating me. I might try again following a tutorial, rather than going it alone in a docker-compose sandbox. I ended up creating an NGS Developer account.

Connecting via NGS

If you want to connect via a leaf node, follow along with this tutorial to get a leaf node set up: https://docs.nats.io/nats-server/configuration/leafnodes#leaf-node-example-using-a-remote-global-service

If you want to do it all in a single process, you can do it via command-line arguments or environment variables (thanks structopt):

CONTROL_HOST=connect.ngs.global \
RPC_HOST=connect.ngs.global \
CONTROL_CREDS=$HOME/.nkeys/creds/synadia/leaftest/leaftestuser.creds \
RPC_CREDS=$HOME/.nkeys/creds/synadia/leaftest/leaftestuser.creds \
wasmcloud

`wash` control plane access

Connecting wash to a local leaf node seems to be a supported configuration. There doesn’t seem to be a way to specify a control-plane credentials file.

Maybe I will make a patch for this. Done: https://github.com/wasmCloud/wash/pull/146

It seems that control plane access from the wasm host is typically passwordless via the leaf node. This makes me nervous. Control-plane access feels like it’s equivalent to root access to your entire cluster. If anyone manages to compromise the wasm sandbox, or any of your capability providers then they just need to write pub $some.topic $some.payload to tcp localhost:4222.

work out what $some.topic $some.payload needs to look like to be sufficiently scary

Developer experience Experiment

Let’s make a TodoMVC demo

https://github.com/redbadger/wasmcloud-examples/projects/1

There are a bunch of things in our backlog that were paper cuts that are probably easy to fix.

Adding a logging capability

[2021-07-01T12:53:03Z ERROR] The target wasmcloud:logging was not found as an actor public key, an actor call alias, or as the contract ID in an existing link from source actor MBQ3PZKT6JSEAL4UH7WPZDR2OK7WESH352XH56ORRFC6EE4YPYE5JAEX

This happened because I wrote a warn!() line in my #[actor::init] function. There could probably be some diagnostic hints for this.

Actor ids

It’s very frustrating that you have to find a way to take the actor id from target/*/*/whatever_s.wasm or wasmcloud.azurecr.io/kvcounter:0.2.0 and paste it into your manifest.yaml (or side-loading it via ${VAR:default} environment variable hackery)

Really, I feel like the manifest format wants a way to say “whatever id you find at the following path/container registry url”, and then you could store the mappings from path to signature in a lockfile, that you can optionally check in. I should make a ticket for this.

`wash` again

We ended up using wash --watch in our example’s Makefile. This gave the hot-reloading experience, but also forced us to use wash. It makes sense that wasmcloud --watch isn’t a thing, since it’s not really a thing that you would expect to need in production, but wash’s TUI is really terrible. A reasonable middle-ground might be to have a mode similar to bluetoothctl on linux, where there is a prompt at the bottom, and logs are printed above the prompt, so wash --no-tui becomes a repl with streaming logs above. I think there is some black magic involved in making this work though, so I’m not expecting anything soon.

Scaling

In wasmcloudland, the unit of scale is wasmcloud host (like it is a pod in kubernetesland). There is no way to scale actors within a single wasmcloud host. There is a ticket about this already, at https://github.com/wasmCloud/wasmCloud/issues/54. It’s definitely worth waiting for them to get this right before adopting wasmcloud for real applications.

Conclusions

(Yes: I’m introducing things in my conclusion section that don’t appear anywhere else in the document. Sue me.)

I think that most of the things above are paper-cuts, and symptoms of the fact that wasmcloud isn’t very mature yet. The wasmcloud team is small, and they seem to be focussing on the big architectural decisions, so we can expect a bit of a lack of polish as the big ticket items are figured out.

Architecturally, wasmcloud feels like it’s what the future will look like. In the next couple of years, I expect that either wasmcloud will get there or something will come along that looks surprisingly like wasmcloud and get to a polished platform first. Either way, learning wasmcloud has given me a glimpse into the future.

alsuren.github.io

A Love Letter to Grist

1. No spinners: your work at your fingertips

2. Your work is not trapped on one device

3. The network is optional

4. Seamless collaboration with your colleagues

5. The Long Now

6. Security and privacy by default

7. You retain ultimate ownership and control

Conclusion

Local-First Notes #1: Riffle

Short-circuiting query re-runs on the block level

Nested results

Stay Tuned

FOSDEM Highlights

The road to open source General Purpose Humanoids with dora-rs by Tao xavier

Zap the Flakes! Leveraging AI to Combat Flaky Tests with CANNIER by Daniel Hiller

How a City Platform Became a Global Community by Carolina Romero Cruz

How I optimized zbus by 95% by Zeeshan Ali Khan

Programming ROS 2 with Rust by Júlia Marsal Perendreu

Could we actually replace containers? by Dan Phillips

A long, short history of realtime AI agents by Rob Pickering

Kites for Future - Airborne Wind Energy for everyone by Marc de Laporte, Benjamin Kutschan

Privacy-first architecture: alternatives to GDPR popup and local-first by Andrey Sitnik

Row-Level Security sucks. Can we make it usable? by Jimmy Angelakos

Where to next?

Why cargo-quickbuild?

The cargo-quickinstall journey - how I made a thing for installing rust programs quickly

Pre-built binaries of Rust programs

Do the thing

Stats Server

Artifact Storage

Client

Automated Builder

Security and Trust

GitHub Actions Gymnastics

Bootstrapping the package list

Skipping broken packages

Free Tiers Don’t Last Forever

The joy of working with other people 🤝

Next Steps

Speeding up cargo build as well

Monitoring Temperature (with too many Bluetooth thermometers)

Outline

Backstory

System Overview

Rust

MQTT

Rust Bluetooth in 2020

Aside: Concurrency

Bluetooth Developments

Results

Will’s setup, with MiFlora sensors

CloudBBQ

Closing Remarks

Links

Docker Yaks

Saturday

Back to docker again

Musings about synced git commits

Initial impressions of nushell

Why am I doing this?

History

Table rendering

Unix command compatibility

Piping things into unix commands

Exit codes

Quoting of semicolons

Quoted paste mode

That’s all for now

Early impressions of NATS

Protocol

Reliability

Security

Documentation

Request-Reply

Queue Groups

Acknowledgements and Sequence Numbers

Acronyms and glossary

NGS

Speeding up `cargo build` as well

`wash`

`wash` control plane access

`wash` again