Fix null bytes in PostgreSQL #49

mrdb · 2025-03-17T21:09:44Z

Claude 3.7 Sonnet (I appologise if I am doing this incorrectly, I am new to GitHub and development and am learning as I go):

Based on my analysis, I've found the root cause of your issue with Buildware. The error "invalid byte sequence for encoding UTF8: 0x00" occurs when the application tries to store binary data with null bytes (0x00) in a PostgreSQL text field.

Here's the issue and solution:

Problem Diagnosis

The error happens during the embedBranch function when updating the embedded_files table in your Supabase database.
Specifically, in actions/github/fetch-files.ts, binary files are being read with:
```
content: Buffer.from(data.content, "base64").toString("utf-8")
```
While there's some filtering to exclude common image files, there are likely other binary files in your repository that contain null bytes (0x00) which are invalid in PostgreSQL UTF-8 encoding.
The current file exclusion list is incomplete and doesn't catch all binary file types.

Solution

I recommend improving the binary file detection and handling in the fetch-files.ts file:

Additionally, I recommend modifying the embedFiles function to handle empty file arrays and provide better error handling:

Implementation Steps

Replace the existing actions/github/fetch-files.ts with the modified version I provided.
Replace the existing actions/github/embed-files.ts with the modified version I provided.
Restart your application to apply these changes.

Explanation

The primary issue is that binary files with null bytes (0x00) can't be stored in PostgreSQL UTF-8 text fields. My solution:

Enhanced Binary File Detection: Added more thorough binary file type detection based on file extensions.
Null Byte Scanning: Added explicit checks for null bytes in file content before attempting to store them.
Error Handling: Improved error handling to skip problematic files rather than failing the entire operation.
Content Sanitization: Added content sanitization as a fallback to remove any null bytes that might have slipped through.

This approach should prevent the UTF-8 encoding errors while still allowing the system to process and embed the relevant text files in your codebase.

Additional Recommendations

Consider using a BYTEA column type in PostgreSQL to store binary content if you need to analyze binary files in the future.
You might want to add a database migration to handle any existing corrupted records.
Consider adding a logging system to track which files are being skipped to help debug any future issues.

Would you like me to explain any part of the solution in more detail?

Fix null bytes in PostgreSQL

42a5a6f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix null bytes in PostgreSQL #49

Fix null bytes in PostgreSQL #49

Uh oh!

mrdb commented Mar 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Fix null bytes in PostgreSQL #49

Are you sure you want to change the base?

Fix null bytes in PostgreSQL #49

Uh oh!

Conversation

mrdb commented Mar 17, 2025

Problem Diagnosis

Solution

Implementation Steps

Explanation

Additional Recommendations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant