Automating open source: How Ersilia distributes AI models to advance global health equity

Discover how the Ersilia Open Source Initiative accelerates drug discovery by using GitHub Actions to disseminate AI/ML models.

| 13 minutes

Taking an average of 10 years and $1.3 billion to develop a single new medication, pharmaceutical companies often focus their drug discovery efforts on a high return on investment, developing drugs for diseases prevalent in high-income countries—and leaving lower- and middle-income countries behind.

In response, investments in building AI/ML models for drug discovery have soared in the last five years. By using these models, scientists can shorten their research and development timeline by getting better at identifying drug prospects. However, access to these models is limited by data science expertise and computational resources.

The nonprofit Ersilia Open Source Initiative is tackling this problem with the Ersilia Model Hub.

Through the hub, Ersilia aims to disseminate AI/ML models and computational power to researchers focused on drug discovery for infectious diseases in regions outside of Europe and North America.

In this post, we’ll share how Ersilia and GitHub engineers built a self-service IssueOps process to make AI/ML models in the hub publicly available, allowing researchers to find and run them for free on public repositories using GitHub Actions. 👇

Ersilia Model Hub: What it is and who uses it

Though largely overlooked by for-profit pharmaceutical companies, research on infectious diseases in low- and middle-income countries is ongoing. The hub taps into that research by serving as a curated collection of AI/ML models relevant to the discovery of antibiotic drugs.

Through its platform, Ersilia helps to disseminate published findings and models, as well as their own, through public repositories on GitHub so undersourced researchers and institutions can use them for free to improve drug discovery in their respective countries.

“At some point, I realized that there was a need for a new organization that was flexible enough to actually travel to different countries and institutions, identify their data science needs, which are often critically lacking, and develop some data science tools,” says Ersilia co-founder, Miquel Duran-Frigola, PhD.

That realization crystallized into Ersilia and the Ersilia Model Hub, which Duran-Frigola founded with two other biomedicine experts, Gemma Turon, PhD, and Edo Gaude, PhD.

“The hub contains computational models, which are relatively very cheap to run compared to doing experiments in the laboratory,” Duran-Frigola says. “Researchers can create simulations using computational models to predict how a candidate molecule might treat a particular disease. That’s why these models are often good starting points to perform drug discovery research.”

Currently, there are about 150 models in the Ersilia Model Hub.

Who uses and contributes to Ersilia?

Most of the contributors who add models to the hub are data scientists and software developers, while most who run those models are researchers in biomedicine and drug discovery at institutions in various countries throughout Sub-Saharan Africa. Over the next two years, Ersilia aims to establish the hub in 15 institutions throughout Africa.

Ersilia’s biggest partner is the University of Cape Town’s Holistic Drug Discovery and Development (H3D) Centre (H3D) Centre. Founded in 2010 as Africa’s first integrated drug discovery and development center, H3D researchers use the data science tools disseminated by the Ersilia Model Hub to advance innovative drug discovery across the African continent.

Ersilia is also partnering with emerging research centers, such as the University of Buea’s Center for Drug Discovery in Cameroon. A fellowship from the Bill & Melinda Gates Foundation provided the center with the seed funding it needed to start in 2022, and today it has 25 members.

“The center aims to discover new medicines based on natural products collected from traditional healers, but it doesn’t have a lot of resources yet,” explains Duran-Frigola. “The idea is that our tool will become a core component of the center so its researchers can benefit from computational predictions.”

How the Ersilia Model Hub works

Contributors can request a model be added to the hub by opening an issue.

The vast majority of models are open source, all are publicly available, and most are submitted and pulled from scientific literature. For example, biochemists at the David Braley Centre for Antibiotic Discovery created an ML model to predict how likely a chemical compound will inhibit the activity of Acinetobacter baumannii, a pathogen often transmitted in healthcare settings and known for its resistance to multiple antibiotics.

But Ersilia develops some models in-house, like one that predicts the efficacy of chemicals against lab-grown Mycobacterium tuberculosis (M. tuberculosis), using data from Seattle Children’s Hospital. M. tuberculosis is the agent that causes tuberculosis, an infectious disease that primarily affects individuals in low- and middle-income countries.

While the Ersilia team manually approves which models enter the hub, it uses GitHub Actions to streamline requests and solicit the following information from model contributors:

  • The model’s schema (what input is expected and what output will be returned).
  • Open source license information.
  • Whether the model can run on CPUs or GPUs.
  • Link to model’s open source code.
  • Link to publication (either peer-reviewed or preprint).
  • Labels to describe the model’s use case, with tags like malaria, classification, regression, unsupervised, or supervised.

When Ersilia approves the model, the contributor submits a pull request that triggers a set of tests. If all those tests are successful, GitHub Actions merges the pull request and the model is incorporated into the hub.

Rachel Stanik, a software engineer at GitHub, breaks down the steps to adding an AI model to the Ersilia Model Hub:

From the user side, researchers interested in drug discovery can fetch static and ready-to-use AI/ML models from the hub and contained in public repositories, input candidate molecules, and then receive predictions that indicate how well the candidate molecule performs against a specific disease—all online and for free. The self-service process contains an important note on privacy, disclosing that any activity on the repository is open and available to the general public—which includes those predictions, stored as actions artifacts.

“Right now, Ersilia is focused on information and tool dissemination,” says Duran-Frigola. “For the future, we’re working on providing a metric of confidence for the models. And, with a bigger user base, Ersilia could aggregate inputs to capture the candidate molecules that people are testing against infectious diseases.”

Using an aggregation of candidate molecules, researchers could glean which drugs are available in certain countries and experiment with repurposing those drugs to fight against other microbes. The information could help them to treat neglected diseases without having to develop a new drug from scratch.

How GitHub built a self-service process for the Ersilia Model Hub

Before reaching out to GitHub, researchers couldn’t independently access or run the models in the hub.

GitHub customer success architect, Jenna Massardo, and social impact engineer, Sean Marcia, who’s also the founder of the nonprofit Ruby For Good, worked with Ersilia to fix that by creating a self-service process for the hub. GitHub’s Skills-Based Volunteer program, run by GitHub’s Social Impact team, organized the opportunity. The program partners employees with social sector organizations for a period of time to help solve strategic and technical problems.

Creating an IssueOps process

Massardo and Marcia’s first step in problem-solving was understanding and learning how the software works: How would a researcher share information? What kind of outputs should a researcher expect?

“I had them walk me through the process of setting up and using the Ersilia Model Hub on my workstation. It was only once it was running on my workstation, where I could actually test it and do the process myself, that I began to pick it apart,” Massardo says.

Massardo and Marcia then broke the phases into pieces: How would a researcher make a request to use a model? How would the model process the researcher’s input data? How would that input be handled? What notifications would researchers get?

Massardo and Marcia decided to bring in a standard IssueOps pattern, which uses GitHub issues to drive automation with GitHub Actions.

“It’s a super common pattern. A lot of our internal tooling at GitHub is built on it, like some of our migration tooling for our enterprise customers,” Massardo explains. She quickly ruled out using a pull request flow, where collaborators propose changes to the source code.

“People are contributing to the repository but they’re not actually making code changes. They’re just adding files for processing,” Massardo says. “Using pull requests would have meant a lot of noise in the repository’s history. But issues are perfect for this sort of thing.”

Once a plan was set in place, Massardo began to build while Marcia kept the collaboration running smoothly.

Researchers, biologists, and even students can now use the self-service process by simply going to the hub, creating an issue, filling out the template, and submitting it. Note, the template requires users to select the model they want to run and input candidate molecule(s) in standard SMILES format (Simplified Molecular Input Line Entry System), a computer-readable format to represent complex molecules and text.

The GitHub issues template that powers the Ersilia Model Hub's self-service process is shown on a screen in dark mode.

Setting up a GitHub Actions workflow

Originally, Ersilia wanted to build a custom GitHub Action, but Massardo—someone who’s written multiple custom actions used internally and externally—knew that it comes with a fair amount of maintenance.

“There’s a lot of code you’re writing on your own, and that means you have to manage a bunch of dependencies and security updates,” Massardo says. “At that point, it becomes a full application.”

Understanding the problem as a series of individual tasks allowed her to scope an effective and cost-efficient solution.

“We created a series of simple workflows using readily available actions from GitHub Marketplace and just let GitHub Actions do its thing,” Massardo says. “By understanding Ersilia’s actual desires and needs, we avoided overcomplicating and obfuscating the issue.”

When a researcher files an issue to run a candidate molecule through a model, it triggers a GitHub Actions workflow to run. Here’s a look at the process:

  1. GitHub Actions spins up a GitHub-hosted runner to execute the workflow.
  2. The GitHub Issue Forms Body Parser action, parses the content out of the issue and translates it from Markdown into structured, usable data.
  3. The workflow fetches the user-requested model and then triggers Ersilia’s software to run.
  4. Ersilia’s software configures the model, and the user-provided input is put into a file that the model can process.
  5. Ersilia’s software then generates a CSV output, saved as an artifact in GitHub Actions.
  6. The workflow lets the user know that it was successfully completed by leaving a comment in the open issue, which includes a link to the artifact that the user can click to download.
  7. This particular workflow has a 30-day retention period, so five days before the artifact expires, stale bot notifies users to download the output. After 30 days, stale bot automatically closes the issue.

“Everything happens right on GitHub,” Massardo explains. “The user doesn’t have to worry about anything. They just submit the issue, and Ersilia’s workflow processes it and lets them know when everything’s done. Importantly, the Ersilia staff, who are busy running the nonprofit, don’t have to do any maintenance.”

Using Docker containers to run AI models on GitHub runners

To streamline the process of creating model images, Ersilia uses a Dockerfile template. When a researcher submits a new model to the hub, Ersilia copies the template to the model’s repository, which kicks off the Docker image build for that model—a process that’s powered by GitHub-hosted runners. Once built, the model image lives in the hub and researchers can run it as many times as needed. A model can also be rebuilt if fixes are needed later.

The models in the hub are available in public repositories, where GitHub Actions runs at no cost. When researchers use the self-service process, GitHub Actions runs these Docker images on GitHub’s runners for free, which in turn allows researchers to run these models for free. Models in the hub are also designed and optimized to run on CPUs so that researchers can run the models locally on their machines, making them more accessible to the global scientific community.

The models aren’t very large, explains Ersilia CTO, Dhanshree Arora, because they’re built for very specific use cases. “We’re actively working to reduce the size of our model images, so they use less network resources when transferred across machines, occupy less space on the machine where they run, and enable faster spin-up times for the containers created from these model images,” Arora says.

The ability to package these models as Docker containers also means that researchers can collaborate more easily, as the models run in consistent and reproducible environments.

Automating daily model fetching

When researchers file an issue to use a model, they see a list of available models. That list is updated every day by a workflow that Massardo built using GitHub Actions and some custom code.

Every day, the workflow:

  1. Fetches the file containing the list of models managed by the Ersilia team. The file is automatically updated whenever the team modifies or deprecates a model, or adds a new model.
  2. Runs a Python script to process the file and pull out data that captures new, updated, or deprecated models.
  3. Updates the list of models in the issues template with the extracted data.

“This is another example of how we built this process to be as hands-off as possible while still making it as easy as possible for researchers to actually use the tool,” Massardo says.

Ersilia wants your contributions

Ersilia has grown an open source community of contributors and users, and believes that everything it does needs to continue to be open source. It was initially drawn to GitHub Actions because it’s free to use in public repositories. After witnessing the impact of GitHub Actions on the model hub, Duran-Frigola wants to identify more use cases.

“I want to find creative ways to use GitHub Actions, beyond CI/CD, to help more researchers use our tools,” he says.

He also wants Ersilia’s many interns to practice using GitHub Copilot and gain hands-on experience with using AI coding tools that are changing the landscape of software development.

3 tips for contributing to open source projects, from a Hubber

➡️ Read Massardo’s three tips for contributing to open source projects, Ersilia’s contributions guidelines, then start engaging with GitHub’s open source community.

  1. Find a project that interests you. Working on a project that’s personally interesting generally means you’ll stick with it and not get bored.
  2. Look through the issues in a project’s repository to find something that you can fix or add. A lot of projects use the good first issue label to identify things that newcomers can tackle.
  3. Be prepared to iterate. Some project owners require several smaller contributions before they’ll entertain a larger product change. Some folks are in different parts of the world so you may need to rewrite things to be more clear. If you’re thinking about a major change to a project, open an issue to discuss it with the owners first because they might have a different vision.

Contribute to another nonprofit using For Good First Issue

Ersilia was recently designated as a Digital Public Good (DPG) by the United Nations. DPGs are open source solutions—ranging from open source software and data to AI systems and content collections—that are designed to unlock a more equitable world. DPGs are freely accessible, intended to be used and improved by anyone to benefit the public, and they’re designed to address a societal challenge and promote sustainable development.

If you’re inspired by Ersilia and want to contribute to more DPGs, check out GitHub’s For Good First Issue, a curated list of recognized DPGs that need contributors.

For Good First Issue is designed as a tool for nonprofits to connect with technologists around the world. As nonprofits often lack funding and resources to solve society’s challenges through technology, For Good First Issue can connect nonprofits that need support with the people who want to make positive change.

More reading on Ersilia, GitHub Actions, and For Good First Issue

Tags:

Written by

Related posts

Software as a public good

Open source software underpins all sectors of the economy, public services and even international organizations like the United Nations. How can all its beneficiaries work together to make the open source ecosystem more sustainable?