Skip to content

feat(providers/github): add GitHub provider#6

Merged
nickscamara merged 2 commits into
firecrawl:mainfrom
mogery:mog/github
Jan 29, 2024
Merged

feat(providers/github): add GitHub provider#6
nickscamara merged 2 commits into
firecrawl:mainfrom
mogery:mog/github

Conversation

@mogery
Copy link
Copy Markdown
Member

@mogery mogery commented Jan 29, 2024

Fixes #3
/claim #3

This PR adds a GitHub provider for retrieving files from public GitHub repositories, like described in #3.

Remarks:

  • Since GitHub only allows rougly 50 requests per 5 minutes for unauthorized API callers, I had to add authentication, otherwise the tests would take, like, 30 minutes just waiting around because of the rate limiting. Nango is supported, as well as manual authorization by providing an OctoKit auth strategy and parameters.
    • This means that if set up correctly (correct scopes/permission specified), this is not only a public GitHub provider, but a private one too.

provider: "github",
type: this.docOnly
? "document" // don't run iterating computation if we only retrieved documents anyways
: isDoc(file.path) ? "document" : "code",
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the schema for Document.type? I know "document" is valid, but is "code" allowed?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine!

Comment on lines +186 to +198
// Construct pretty source URL.
sourceURL: `https://github.com/${
encodeURIComponent(this.owner)
}/${
encodeURIComponent(this.repo)
}/blob/${
encodeURIComponent(branchName)
}/${
file.path
.split("/") // Don't escape slashes, they're a part of the path.
.map(part => encodeURIComponent(part))
.join("/")
}`,
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty dodgy, but I couldn't find a better way to construct a URL. file.url points to an api.github.com link, which won't bring up the GitHub UI. This "pretty URL" does.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, this should be fine.

@nickscamara
Copy link
Copy Markdown
Member

Wow! This is awesome @mogery! This all looks really good. Thank you!

Comment on lines +175 to +177
// Decode the content blob as it is encoded
const decodedContent = Buffer.from(blob.data.content, 'base64').toString('utf8');

Copy link
Copy Markdown
Member Author

@mogery mogery Jan 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may output large amounts of non-text data if binaries the repo contains binaries. Is that okay? Should we add UTF-8 detection?

@nickscamara
Copy link
Copy Markdown
Member

/tip $20

@algora-pbc
Copy link
Copy Markdown

algora-pbc Bot commented Jan 29, 2024

@mogery
Copy link
Copy Markdown
Member Author

mogery commented Jan 29, 2024

Thank you!

@algora-pbc
Copy link
Copy Markdown

algora-pbc Bot commented Jan 29, 2024

🎉🎈 @mogery has been awarded $20! 🎈🎊

@nickscamara nickscamara merged commit 1ae2da8 into firecrawl:main Jan 29, 2024
@mogery mogery deleted the mog/github branch January 29, 2024 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Public GitHub Connector

2 participants