Skip to content

Conversation

sahil-shubham
Copy link
Collaborator

@sahil-shubham sahil-shubham commented Aug 22, 2025

In this PR

  • Endpoint for indexing json data (single or multiple objects)
  • Updated config to work with a single database connection string
  • Disabled qdrant

Copy link

linear bot commented Aug 22, 2025


temp_path = temp_file.name

logger.info(f"Starting JSON indexing: site={site}, objects={len(data) if isinstance(data, list) else 1}")

Check failure

Code scanning / CodeQL

Log Injection High

This log entry depends on a
user-provided value
.

Copilot Autofix

AI about 1 month ago

To fix this log injection vulnerability, sanitize the site user-supplied value before logging it. The generally recommended approach for plain-text logs is to strip newlines and carriage returns (\r, \n, \r\n) from the value to prevent attackers from injecting extra log lines. This can be achieved by calling .replace() on the variable before interpolation in the log message.

Best practice is to sanitize as close as possible to usage, but for readability and to prevent accidental omission, it makes sense to sanitize just above where the variable is first used in a log statement. In this file, sanitize site just before it's logged:

  • Replace newlines and carriage returns from site using site.replace('\r', '').replace('\n', '') and use the sanitized variable in the log message.
  • Optionally, explicitly mark the sanitized version, e.g. site_for_log.
  • No external dependencies are required.

Make this change where site is logged on line 60 (and optionally line 72), by introducing a sanitized version just before.


Suggested changeset 1
code/python/webserver/routes/indexing.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/code/python/webserver/routes/indexing.py b/code/python/webserver/routes/indexing.py
--- a/code/python/webserver/routes/indexing.py
+++ b/code/python/webserver/routes/indexing.py
@@ -57,7 +57,8 @@
                 
                 temp_path = temp_file.name
             
-            logger.info(f"Starting JSON indexing: site={site}, objects={len(data) if isinstance(data, list) else 1}")
+            site_for_log = site.replace('\r', '').replace('\n', '') if isinstance(site, str) else str(site)
+            logger.info(f"Starting JSON indexing: site={site_for_log}, objects={len(data) if isinstance(data, list) else 1}")
             
             # Use loadJsonToDB with the temporary file
             total_documents = await loadJsonToDB(
@@ -69,7 +70,7 @@
                 database=database
             )
             
-            logger.info(f"JSON indexing completed: {total_documents} documents indexed for site {site}")
+            logger.info(f"JSON indexing completed: {total_documents} documents indexed for site {site_for_log}")
             
             # Return success=False if no documents were indexed
             success = total_documents > 0
EOF
@@ -57,7 +57,8 @@

temp_path = temp_file.name

logger.info(f"Starting JSON indexing: site={site}, objects={len(data) if isinstance(data, list) else 1}")
site_for_log = site.replace('\r', '').replace('\n', '') if isinstance(site, str) else str(site)
logger.info(f"Starting JSON indexing: site={site_for_log}, objects={len(data) if isinstance(data, list) else 1}")

# Use loadJsonToDB with the temporary file
total_documents = await loadJsonToDB(
@@ -69,7 +70,7 @@
database=database
)

logger.info(f"JSON indexing completed: {total_documents} documents indexed for site {site}")
logger.info(f"JSON indexing completed: {total_documents} documents indexed for site {site_for_log}")

# Return success=False if no documents were indexed
success = total_documents > 0
Copilot is powered by AI and may make mistakes. Always verify output.
database=database
)

logger.info(f"JSON indexing completed: {total_documents} documents indexed for site {site}")

Check failure

Code scanning / CodeQL

Log Injection High

This log entry depends on a
user-provided value
.

Copilot Autofix

AI about 1 month ago

To fix the problem, we should sanitize the user-provided site value before it is included in the log entry. Specifically, we should remove any carriage return (\r) and line feed (\n) characters, which are most often used in log injection attacks. We can do this using the replace() method as described in the background, applying both .replace('\n', '') and .replace('\r', '') (or combining them). This sanitization should be done before logging operations that interpolate or concatenate user input.

The relevant change is needed in the region of code before line 72, where the log statement is made. We should sanitize site after extracting it from the body and before it is used in logging statements (lines 60, 72, and possibly any other log sink in the handler). For completeness and future safety, it is best to sanitize site at the top after extraction and use the sanitized version throughout the function.

No additional dependencies are required. Only a code edit is needed.

Suggested changeset 1
code/python/webserver/routes/indexing.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/code/python/webserver/routes/indexing.py b/code/python/webserver/routes/indexing.py
--- a/code/python/webserver/routes/indexing.py
+++ b/code/python/webserver/routes/indexing.py
@@ -32,6 +32,8 @@
         
         data = body.get('data')
         site = body.get('site')
+        # Sanitize site input to prevent log injection
+        safe_site = site.replace('\r', '').replace('\n', '') if site else ''
         
         if not data or not site:
             return web.json_response({
@@ -57,7 +59,7 @@
                 
                 temp_path = temp_file.name
             
-            logger.info(f"Starting JSON indexing: site={site}, objects={len(data) if isinstance(data, list) else 1}")
+            logger.info(f"Starting JSON indexing: site={safe_site}, objects={len(data) if isinstance(data, list) else 1}")
             
             # Use loadJsonToDB with the temporary file
             total_documents = await loadJsonToDB(
@@ -69,7 +71,7 @@
                 database=database
             )
             
-            logger.info(f"JSON indexing completed: {total_documents} documents indexed for site {site}")
+            logger.info(f"JSON indexing completed: {total_documents} documents indexed for site {safe_site}")
             
             # Return success=False if no documents were indexed
             success = total_documents > 0
EOF
@@ -32,6 +32,8 @@

data = body.get('data')
site = body.get('site')
# Sanitize site input to prevent log injection
safe_site = site.replace('\r', '').replace('\n', '') if site else ''

if not data or not site:
return web.json_response({
@@ -57,7 +59,7 @@

temp_path = temp_file.name

logger.info(f"Starting JSON indexing: site={site}, objects={len(data) if isinstance(data, list) else 1}")
logger.info(f"Starting JSON indexing: site={safe_site}, objects={len(data) if isinstance(data, list) else 1}")

# Use loadJsonToDB with the temporary file
total_documents = await loadJsonToDB(
@@ -69,7 +71,7 @@
database=database
)

logger.info(f"JSON indexing completed: {total_documents} documents indexed for site {site}")
logger.info(f"JSON indexing completed: {total_documents} documents indexed for site {safe_site}")

# Return success=False if no documents were indexed
success = total_documents > 0
Copilot is powered by AI and may make mistakes. Always verify output.
@sahil-shubham
Copy link
Collaborator Author

Merging this, tested search and indexing locally.

@sahil-shubham sahil-shubham merged commit 9f90304 into main Aug 29, 2025
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant